Go to the first, previous, next, last section, table of contents.

Details of Profiling

Implementation of Profiling

Profiling works by changing how every function in your program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. This change is made by the compiler when your program is compiled with the `-pg' option, which causes every function to call mcount (or _mcount, or __mcount, depending on the OS and compiler) as one of its first operations.

The mcount routine, included in the profiling library, is responsible for recording in an in-memory call graph table both its parent routine (the child) and its parent's parent. This is typically done by examining the stack frame to find both the address of the child, and the return address in the original parent. Since this is a very machine-dependant operation, mcount itself is typically a short assembly-language stub routine that extracts the required information, and then calls __mcount_internal (a normal C function) with two arguments - frompc and selfpc. __mcount_internal is responsible for maintaining the in-memory call graph, which records frompc, selfpc, and the number of times each of these call arcs was transversed.

GCC Version 2 provides a magical function (__builtin_return_address), which allows a generic mcount function to extract the required information from the stack frame. However, on some architectures, most notably the SPARC, using this builtin can be very computationally expensive, and an assembly language version of mcount is used for performance reasons.

Number-of-calls information for library routines is collected by using a special version of the C library. The programs in it are the same as in the usual C library, but they were compiled with `-pg'. If you link your program with `gcc ... -pg', it automatically uses the profiling version of the library.

Profiling also involves watching your program as it runs, and keeping a histogram of where the program counter happens to be every now and then. Typically the program counter is looked at around 100 times per second of run time, but the exact frequency may vary from system to system.

This is done is one of two ways. Most UNIX-like operating systems provide a profil() system call, which registers a memory array with the kernel, along with a scale factor that determines how the program's address space maps into the array. Typical scaling values cause every 2 to 8 bytes of address space to map into a single array slot. On every tick of the system clock (assuming the profiled program is running), the value of the program counter is examined and the corresponding slot in the memory array is incremented. Since this is done in the kernel, which had to interrupt the process anyway to handle the clock interrupt, very little additional system overhead is required.

However, some operating systems, most notably Linux 2.0 (and earlier), do not provide a profil() system call. On such a system, arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer()), which then performs the same operation of examining the program counter and incrementing a slot in the memory array. Since this method requires a signal to be delivered to user space every time a sample is taken, it uses considerably more overhead than kernel-based profiling. Also, due to the added delay required to deliver the signal, this method is less accurate as well.

A special startup routine allocates memory for the histogram and either calls profil() or sets up a clock signal handler. This routine (monstartup) can be invoked in several ways. On Linux systems, a special profiling startup file gcrt0.o, which invokes monstartup before main, is used instead of the default crt0.o. Use of this special startup file is one of the effects of using `gcc ... -pg' to link. On SPARC systems, no special startup files are used. Rather, the mcount routine, when it is invoked for the first time (typically when main is called), calls monstartup.

If the compiler's `-a' option was used, basic-block counting is also enabled. Each object file is then compiled with a static array of counts, initially zero. In the executable code, every time a new basic-block begins (i.e. when an if statement appears), an extra instruction is inserted to increment the corresponding count in the array. At compile time, a paired array was constructed that recorded the starting address of each basic-block. Taken together, the two arrays record the starting address of every basic-block, along with the number of times it was executed.

The profiling library also includes a function (mcleanup) which is typically registered using atexit() to be called as the program exits, and is responsible for writing the file `gmon.out'. Profiling is turned off, various headers are output, and the histogram is written, followed by the call-graph arcs and the basic-block counts.

The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth. This is because samples of the program counter are taken at fixed intervals of the program's run time. Therefore, the time measurements in gprof output say nothing about time that your program was not running. For example, a part of the program that creates so much data that it cannot all fit in physical memory at once may run very slowly due to thrashing, but gprof will say it uses little time. On the other hand, sampling by run time has the advantage that the amount of load due to other users won't directly affect the output you get.

Profiling Data File Format

The old BSD-derived file format used for profile data does not contain a magic cookie that allows to check whether a data file really is a gprof file. Furthermore, it does not provide a version number, thus rendering changes to the file format almost impossible. GNU gprof uses a new file format that provides these features. For backward compatibility, GNU gprof continues to support the old BSD-derived format, but not all features are supported with it. For example, basic-block execution counts cannot be accommodated by the old file format.

The new file format is defined in header file `gmon_out.h'. It consists of a header containing the magic cookie and a version number, as well as some spare bytes available for future extensions. All data in a profile data file is in the native format of the host on which the profile was collected. GNU gprof adapts automatically to the byte-order in use.

In the new file format, the header is followed by a sequence of records. Currently, there are three different record types: histogram records, call-graph arc records, and basic-block execution count records. Each file can contain any number of each record type. When reading a file, GNU gprof will ensure records of the same type are compatible with each other and compute the union of all records. For example, for basic-block execution counts, the union is simply the sum of all execution counts for each basic-block.

Histogram Records

Histogram records consist of a header that is followed by an array of bins. The header contains the text-segment range that the histogram spans, the size of the histogram in bytes (unlike in the old BSD format, this does not include the size of the header), the rate of the profiling clock, and the physical dimension that the bin counts represent after being scaled by the profiling clock rate. The physical dimension is specified in two parts: a long name of up to 15 characters and a single character abbreviation. For example, a histogram representing real-time would specify the long name as "seconds" and the abbreviation as "s". This feature is useful for architectures that support performance monitor hardware (which, fortunately, is becoming increasingly common). For example, under DEC OSF/1, the "uprofile" command can be used to produce a histogram of, say, instruction cache misses. In this case, the dimension in the histogram header could be set to "i-cache misses" and the abbreviation could be set to "1" (because it is simply a count, not a physical dimension). Also, the profiling rate would have to be set to 1 in this case.

Histogram bins are 16-bit numbers and each bin represent an equal amount of text-space. For example, if the text-segment is one thousand bytes long and if there are ten bins in the histogram, each bin represents one hundred bytes.

Call-Graph Records

Call-graph records have a format that is identical to the one used in the BSD-derived file format. It consists of an arc in the call graph and a count indicating the number of times the arc was traversed during program execution. Arcs are specified by a pair of addresses: the first must be within caller's function and the second must be within the callee's function. When performing profiling at the function level, these addresses can point anywhere within the respective function. However, when profiling at the line-level, it is better if the addresses are as close to the call-site/entry-point as possible. This will ensure that the line-level call-graph is able to identify exactly which line of source code performed calls to a function.

Basic-Block Execution Count Records

Basic-block execution count records consist of a header followed by a sequence of address/count pairs. The header simply specifies the length of the sequence. In an address/count pair, the address identifies a basic-block and the count specifies the number of times that basic-block was executed. Any address within the basic-address can be used.

`gprof`'s Internal Operation

Like most programs, gprof begins by processing its options. During this stage, it may building its symspec list (sym_ids.c:sym_id_add), if options are specified which use symspecs. gprof maintains a single linked list of symspecs, which will eventually get turned into 12 symbol tables, organized into six include/exclude pairs - one pair each for the flat profile (INCL_FLAT/EXCL_FLAT), the call graph arcs (INCL_ARCS/EXCL_ARCS), printing in the call graph (INCL_GRAPH/EXCL_GRAPH), timing propagation in the call graph (INCL_TIME/EXCL_TIME), the annotated source listing (INCL_ANNO/EXCL_ANNO), and the execution count listing (INCL_EXEC/EXCL_EXEC).

After option processing, gprof finishes building the symspec list by adding all the symspecs in default_excluded_list to the exclude lists EXCL_TIME and EXCL_GRAPH, and if line-by-line profiling is specified, EXCL_FLAT as well. These default excludes are not added to EXCL_ANNO, EXCL_ARCS, and EXCL_EXEC.

Next, the BFD library is called to open the object file, verify that it is an object file, and read its symbol table (core.c:core_init), using bfd_canonicalize_symtab after mallocing an appropiate sized array of asymbols. At this point, function mappings are read (if the `--file-ordering' option has been specified), and the core text space is read into memory (if the `-c' option was given).

gprof's own symbol table, an array of Sym structures, is now built. This is done in one of two ways, by one of two routines, depending on whether line-by-line profiling (`-l' option) has been enabled. For normal profiling, the BFD canonical symbol table is scanned. For line-by-line profiling, every text space address is examined, and a new symbol table entry gets created every time the line number changes. In either case, two passes are made through the symbol table - one to count the size of the symbol table required, and the other to actually read the symbols. In between the two passes, a single array of type Sym is created of the appropiate length. Finally, symtab.c:symtab_finalize is called to sort the symbol table and remove duplicate entries (entries with the same memory address).

The symbol table must be a contiguous array for two reasons. First, the qsort library function (which sorts an array) will be used to sort the symbol table. Also, the symbol lookup routine (symtab.c:sym_lookup), which finds symbols based on memory address, uses a binary search algorithm which requires the symbol table to be a sorted array. Function symbols are indicated with an is_func flag. Line number symbols have no special flags set. Additionally, a symbol can have an is_static flag to indicate that it is a local symbol.

With the symbol table read, the symspecs can now be translated into Syms (sym_ids.c:sym_id_parse). Remember that a single symspec can match multiple symbols. An array of symbol tables (syms) is created, each entry of which is a symbol table of Syms to be included or excluded from a particular listing. The master symbol table and the symspecs are examined by nested loops, and every symbol that matches a symspec is inserted into the appropriate syms table. This is done twice, once to count the size of each required symbol table, and again to build the tables, which have been malloced between passes. From now on, to determine whether a symbol is on an include or exclude symspec list, gprof simply uses its standard symbol lookup routine on the appropriate table in the syms array.

Now the profile data file(s) themselves are read (gmon_io.c:gmon_out_read), first by checking for a new-style `gmon.out' header, then assuming this is an old-style BSD `gmon.out' if the magic number test failed.

New-style histogram records are read by hist.c:hist_read_rec. For the first histogram record, allocate a memory array to hold all the bins, and read them in. When multiple profile data files (or files with multiple histogram records) are read, the starting address, ending address, number of bins and sampling rate must match between the various histograms, or a fatal error will result. If everything matches, just sum the additional histograms into the existing in-memory array.

As each call graph record is read (call_graph.c:cg_read_rec), the parent and child addresses are matched to symbol table entries, and a call graph arc is created by cg_arcs.c:arc_add, unless the arc fails a symspec check against INCL_ARCS/EXCL_ARCS. As each arc is added, a linked list is maintained of the parent's child arcs, and of the child's parent arcs. Both the child's call count and the arc's call count are incremented by the record's call count.

Basic-block records are read (basic_blocks.c:bb_read_rec), but only if line-by-line profiling has been selected. Each basic-block address is matched to a corresponding line symbol in the symbol table, and an entry made in the symbol's bb_addr and bb_calls arrays. Again, if multiple basic-block records are present for the same address, the call counts are cumulative.

A gmon.sum file is dumped, if requested (gmon_io.c:gmon_out_write).

If histograms were present in the data files, assign them to symbols (hist.c:hist_assign_samples) by iterating over all the sample bins and assigning them to symbols. Since the symbol table is sorted in order of ascending memory addresses, we can simple follow along in the symbol table as we make our pass over the sample bins. This step includes a symspec check against INCL_FLAT/EXCL_FLAT. Depending on the histogram scale factor, a sample bin may span multiple symbols, in which case a fraction of the sample count is allocated to each symbol, proportional to the degree of overlap. This effect is rare for normal profiling, but overlaps are more common during line-by-line profiling, and can cause each of two adjacent lines to be credited with half a hit, for example.

If call graph data is present, cg_arcs.c:cg_assemble is called. First, if `-c' was specified, a machine-dependant routine (find_call) scans through each symbol's machine code, looking for subroutine call instructions, and adding them to the call graph with a zero call count. A topological sort is performed by depth-first numbering all the symbols (cg_dfn.c:cg_dfn), so that children are always numbered less than their parents, then making a array of pointers into the symbol table and sorting it into numerical order, which is reverse topological order (children appear before parents). Cycles are also detected at this point, all members of which are assigned the same topological number. Two passes are now made through this sorted array of symbol pointers. The first pass, from end to beginning (parents to children), computes the fraction of child time to propogate to each parent and a print flag. The print flag reflects symspec handling of INCL_GRAPH/EXCL_GRAPH, with a parent's include or exclude (print or no print) property being propagated to its children, unless they themselves explicitly appear in INCL_GRAPH or EXCL_GRAPH. A second pass, from beginning to end (children to parents) actually propogates the timings along the call graph, subject to a check against INCL_TIME/EXCL_TIME. With the print flag, fractions, and timings now stored in the symbol structures, the topological sort array is now discarded, and a new array of pointers is assembled, this time sorted by propagated time.

Finally, print the various outputs the user requested, which is now fairly straightforward. The call graph (cg_print.c:cg_print) and flat profile (hist.c:hist_print) are regurgitations of values already computed. The annotated source listing (basic_blocks.c:print_annotated_source) uses basic-block information, if present, to label each line of code with call counts, otherwise only the function call counts are presented.

The function ordering code is marginally well documented in the source code itself (cg_print.c). Basically, the functions with the most use and the most parents are placed first, followed by other functions with the most use, followed by lower use functions, followed by unused functions at the end.

Debugging `gprof`

If gprof was compiled with debugging enabled, the `-d' option triggers debugging output (to stdout) which can be helpful in understanding its operation. The debugging number specified is interpreted as a sum of the following options:

2 - Topological sort: Monitor depth-first numbering of symbols during call graph analysis
4 - Cycles: Shows symbols as they are identified as cycle heads
16 - Tallying: As the call graph arcs are read, show each arc and how the total calls to each function are tallied
32 - Call graph arc sorting: Details sorting individual parents/children within each call graph entry
64 - Reading histogram and call graph records: Shows address ranges of histograms as they are read, and each call graph arc
128 - Symbol table: Reading, classifying, and sorting the symbol table from the object file. For line-by-line profiling (`-l' option), also shows line numbers being assigned to memory addresses.
256 - Static call graph: Trace operation of `-c' option
512 - Symbol table and arc table lookups: Detail operation of lookup routines
1024 - Call graph propagation: Shows how function times are propagated along the call graph
2048 - Basic-blocks: Shows basic-block records as they are read from profile data (only meaningful with `-l' option)
4096 - Symspecs: Shows symspec-to-symbol pattern matching operation
8192 - Annotate source: Tracks operation of `-A' option

Go to the first, previous, next, last section, table of contents.

Les analystes,
Yvan Michaud et Mario Paquet
courrier �lectronique: analystes@dmi.usherb.ca