-
write isolated benchmark; i.e. one executable benchmarks one thing
-
benchmark compiler flags:
- optimized(-O3)
- with debug info(-g)
- with frame pointer(-fno-omit-frame-pointer)
-
run benchmark and for each:
- output results as JSON
- record:
- CPU profile(perf and/or gperftools)
- stats with linux builtin commands(vmstat, mpstat, pidstat etc.) and/or BPF BCC tools(ext4slower, runqlen etc.)
- perf stats(pmc), mostly IPC(instructions per cycle):
perf stat -e instructions,cycles
- run them with cachegrind(more stable results, can be compared with cg_diff)
-
best to run them on a native machine(not a VM or container)
used only for benchmarks with minimum required software(to avoid noisy neighbour problem)
-
compare results with last run(with gbenchdiff and cg_diff)
-
send email back to submitter with results report(or link where it can be downloaded):
- gbenchdiff and cg_diff
- cachegrind report
- linux-perf CPU profile(perf.data)
- gperftools CPU profile
- Linux/eBPF tools results
-
interpret results
- analyze gperftools CPU profile with pprof
- analyze perf CPU profile as flamegraph and with flamescope
- see OS/BCC/perf stats to understand better what's going on