-rw-r--r-- 5960 libcpucycles-20230115/doc/selection.md raw
Here is how libcpucycles decides which cycle counter to use. The
underlying principles are as follows:
* Failure is not allowed. Using a low-resolution timer such as
`gettimeofday()` to estimate cycle counts is not desirable but is better
than providing no information.
* A counter that does well on some CPUs and OSes can do badly on others.
The counter selection in libcpucycles is based not just on rules set
at compile time but also on measurements of how well the counters
perform when the program first calls `cpucycles()`.
* A critical application of cycle counting is collecting cycle counts
for multiple options to see which option is faster. It is the caller's
responsibility to compute medians of cycle counts for many runs of
whatever is being benchmarked: medians filter out occasional
cycle-count jumps caused by migration to another core (if the
benchmark is not pinned to a single core) or interrupts from other OS
activity. libcpucycles does not reject an otherwise attractive counter
merely because of occasional jumps.
* Cycle-counting overhead is not desirable, but does not directly affect
comparisons of multiple options measured using the same cycle counter,
so it is less important than consistent major errors such as treating
2^32 + x cycles as x cycles. (Performance experts seeing a function
that takes billions of cycles usually focus on smaller subroutines,
but libcpucycles should not break larger measurements.) This is why
libcpucycles does not provide direct access to 32-bit cycle counters:
it provides wrappers that combine the counters with gettimeofday() to
produce 64 bits, even though this incurs some extra overhead.
* The noise introduced by typical off-core clocks, such as multiplying a
24MHz clock by 86 to estimate cycles on a 2.064GHz CPU core, comes in
small part from low resolution but much more from changes in CPU
frequency: e.g., a 10000-cycle computation might be measured as 20000
cycles when the CPU enters a power-saving mode. When libcpucycles has
access to what is believed to be an on-core cycle counter, it uses
that even when its measurements show some noise. (Choosing an on-core
cycle counter does not magically eliminate the change in the relative
speed of the CPU and DRAM; the usual advice to warm up the CPU and set
constant frequencies if possible still applies.)
When `cpucycles()` is first called, libcpucycles tries running each
cycle counter that has been compiled into the library. For example, for
64-bit ARM CPUs, libcpucycles will try `arm64-pmc`, `arm64-vct`,
`default-gettimeofday`, `default-mach`, `default-monotonic`, and
`default-perfevent`, minus any of those that failed to compile.
Cycle counters that fail at run time with SIGILL (or SIGFPE or SIGBUS or
SIGSEGV) are eliminated from the list. For example, `arm64-pmc` will
fail with SIGILL if the kernel does not allow user access to
`PMCCNTR_EL0`. Beware that libcpucycles does not catch SIGILL after its
initial tests: if the kernel initially allows user access to
`PMCCNTR_EL0` but later turns it off then `arm64-pmc` will crash.
Independently of these counters, libcpucycles uses various OS mechanisms
to obtain an _estimate_ of the CPU frequency. This estimate is also
available to the caller as `cpucycles_persecond()`.
The methods that libcpucycles uses to ask the OS for an estimated CPU
frequency fail on some OS-CPU combinations, in which case libcpucycles
falls back to a `cpucyclespersecond` environment variable, or, if that
variable does not exist, an estimate of 2399987654 cycles per second.
(This estimate is in a realistic range of CPU speeds, and is close to
multiples of 24MHz, 25MHz, and 19.2MHz, which are common crystal
frequencies.) The sysadmin can create `/etc/cpucyclespersecond` to
override all of the OS mechanisms.
For counters that do not ask for scaling, the estimated CPU frequency is
shown in `cpucycles-info` as a double-check on the counter results. For
counters that ask for scaling, libcpucycles uses the estimated CPU
frequency to compute the scaling, so this is not a double-check. If a
counter asks for scaling and the estimated CPU frequency does not seem
close to a multiple of the counter frequency (possibly with a small
power-of-2 denominator) then libcpucycles will throw the counter away,
except in the case of fixed-resolution OS counters such as
`gettimeofday` and `CLOCK_MONOTONIC`.
libcpucycles computes a precision estimate for each counter (times any
applicable scaling) as follows. Call the counter 1000 times. Check that
the counter has never decreased, and has increased at least once. (A
counter where the decrease/increase checks fail is retried 10 times, so
10000 calls overall, and removed if it fails all 10 times.) The
precision estimate is then the smallest nonzero difference between
adjacent counter results, plus a penalty explained below.
The penalty is 100 cycles for off-core counters (including RDTSC) and
`default-perfevent`, and 200 cycles for fixed-resolution OS counters.
For example, an on-core CPU cycle counter will be selected even if it
actually has, e.g., a resolution of 8 cycles and 50 cycles of overhead.
Finally, libcpucycles selects the counter where the precision estimate
is the smallest number of cycles. Note that an inaccurate estimate of
CPU frequency can influence the choice between a scaled counter and an
unscaled counter.
libcpucycles does _not_ carry out its counter selection (typically tens
of milliseconds, sometimes even more) as a static initializer; callers
are presumed to not want to incur the cost of initialization unless and
until they are actually using `cpucycles()`. A multithreaded caller thus
has to place locks around any possibly-first call to `cpucycles()`, or
create its own static initializer (an `__attribute__((constructor))`
function) with an initial `cpucycles()` call so that all subsequent
`cpucycles()` calls are thread-safe.