Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Profiling with Ctrl-C (yosefk.com)
88 points by jstanley on Sept 2, 2024 | hide | past | favorite | 40 comments


My favorite hack along these lines was to put a timer/ISR on an embedded system that did nothing more than crawl up the stack frame the two or three addresses that the ISR used (yep, it was really just as dumb as [sp + 8] or whatever), and then dump that address to the serial terminal every second or so.

You can fix a lot of stupid problems that way. (And most problems are stupid.) Yes, yes, a real profiler would be better, but if you don't have the fancy tools because your employer doesn't buy you such things, and it's a primitive and cruddy embedded system so there's no obvious better way to do it, and you built this horrible hack right now and... hey, the hack solved the problem, and what do you know? it keeps on solving things....


> Yes, yes, a real profiler would be better

As someone who wrote several profilers for a living... that is a real profiler.


only once you pipe the output to sort | uniq -c


It's amusing; your 2nd paragraph drives at the core of the programmer experience - having to disclaim everything you do and say with "yes, I know there are better ways, I'm not dumb, I'm just working within constraints, and this does the job".


that's just the core of the posting-on-hn experience. when i'm programming i don't try to preemptively defuse personal attacks like that, though i do try to accurately document the advantages and disadvantages of things in my comments, which often looks like 'this is a fucking broken piece of shit that only works if none of the input bytes are null and takes cubic time'

but my motivation there isn't to keep people from calling me an idiot. that's a lost cause. it's to save me and them time rediscovering problems i already know about


> My favorite hack along these lines was to put a timer/ISR on an embedded system that did nothing more than crawl up the stack frame the two or three addresses that the ISR used (yep, it was really just as dumb as [sp + 8] or whatever), and then dump that address to the serial terminal every second or so.

Did the same, but added a stack canary to tell if I overflowed, and wrote my call-stack results to a hardcoded address at the very end of memory (last 32 bytes, IIRC). When the chip reset, in addition to the reset flags (brownout, etc), I could peek that memory to see what the last addresses in the callstack were.

Helped immensely in figuring out a transient bug (device resets) which was hard to repro.


For something more systematic/reproducible, it's possible to use rr[1] to record the program, and in a replay run to the end (or whatever boundaries you care about), run "when-ticks", and do various "seek-ticks 123456789" below that number to seek to various points in the recording.

I've made a thing[2] that can display that within a visual timeline (interpolated between ticks of the nearest syscalls/events, which do have known real time), essentially giving a sampling flamegraph that can be arbitrarily zoomed-in, with the ability to interact with it at any point in gdb.

Though this is not without its issues - rr's ticks count retired conditional branches, so a loop with a 200-instruction body takes up the same amount of ticks, and thus visual space, as one with a 5-instruction body; and of course more low-level things like mispredicts/stalls/IPC are entirely lost.

[1]: https://rr-project.org/

[2]: https://github.com/dzaima/grr


rr looks interesting. It should be useful for debugging race conditions or something "random based": once you record the issue it becomes 100% reproducible in your debugger.

Will try it next time when I have such issue. Thank you!


If you do try doing that, use rr's "chaos mode".


I wonder how hard it would be to have a profiler dump a big chunk of stack on each sample interrupt, convert these into core dump format, and then use gdb or whatever to decode the traces for analysis? This ought to have the touted benefits without the downside of it being slow to capture a bunch of samples.


I believe this is essentially what linux perf's "--call-graph dwarf" does. On my system that ends up producing ~33MB/s of recording data for ~4000 samples/s.


With `-z` (zstd compression) you can bring down the disk-space cost of dwarf unwinding by a factor of ~100 based on my personal experience.

to GP: What you describe sounds like https://github.com/koute/not-perf to me


There are still issues with perf being unable to parse the debug format for some stuff, e.g. code compiled with `-ggdb3` as touched on in TFA. The idea is more: can I take one of the stacks that perf captured, and hand that off to GDB for a stack trace, without perf trying to parse/interpret it itself?


perf's produced perf.data is a documented file format that you can parse manually if so desired, and does in fact directly store the 8KB of stack data (or whatever configured) & register values in the "dwarf" mode.


Speaking of keyboard shortcuts, I miss BSD's Ctrl-T and SIGINFO. It often helped to see if a process was hung.


I don't know exactly what these BSD things did, but there is a super easy way nowadays to get the stack for any process:

    eu-stack -i -p $(pidof ...)
Thanks to debuginfod this will even give you good backtraces right away (at the cost of some initial delay to load the data from the web, consecutive runs are fast). If you get a "permission denied" error, you probably need to tweak kernel.yama.ptrace_scope=0


the bsd things still work; you can install a bsd in qemu or a spare laptop and try them

from your reference to kernel.yama.ptrace_scope (and your apparent belief that bsd belongs to the distant past) i infer that eu-stack is a linux thing? this looks pretty awesome, thanks for the tip!

https://stackoverflow.com/questions/12394935/getting-stacktr...


>Apparently gcc generates some DWARF data that gdb is slow to handle. The GNU linker fixes this data, so that gdb doesn’t end up handling it slowly. LLD refuses to emulate this behavior of the GNU linker, because it’s gcc’s fault to have produced that DWARF data in the first place. And gdb refuses to handle LLD’s output efficiently, because it’s LLD’s fault to not have handled gcc’s output the way the GNU linker does. So I just remove -ggdb3 - it gives you a bit richer debug info, but it’s not worth the slower linking with gold instead of LLD, nor the slowdown in gdb that you get with LLD. And everyone links happily ever after.

lol, it's a story as old as time. The infinite loop of ego entrenched developers not wanting to change something out of some trivial inconsequential disagreement. The bike shed will be built my way!


I do get moderately annoyed by having to write code that's fundamentally a workaround for somebody else's failure, though I usually still do it anyway.

Sometimes I do so but add something to the stderr output referencing the issue number I'm compensating for - that has a surprisingly good rate of getting somebody who knows what they're doing looking at the issue in the other project and submitting a patch.


I mostly use GUI-based debuggers (and profilers), but even in this case I found it often useful to pause the program at random times when it appears "stuck".

Most of the time I don't event need to reach for a profiler proper.


Random sampling is not only useful for quick and dirty debugging, but also for engineering nuclear bombs: https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm#...


Of course, a lot of that old neutron transport work is now used for ray tracing in cinema production. Metropolis (and MCMC is general) is one, and I also remember some sort of volumetric scattering thing as well.


few phenomena in daily life are either quicker or dirtier than a nuclear bomb


Yes, exactly, I do this all the time. Only after having exhausted this, which I consider to be the low-hanging fruits of performance gains, do I start profiling code with an actual profiler.


> what do you know, there’s one billion stack frames from the nlohmann JSON parser, I guess it all gets inlined in the release build;

My guess would be that it's because tail-call optimisation only happens in -O2 and above.

Parsing recursively is frequently the cleanest way to implement a parser of tree-structured input, after all.

If you're doing anything recursively, it makes sense to slightly restructure the recursive call to be the last call in the scope, so that TCO can be applied.


It looks to me like you can use -foptimize-sibling-calls to get it to happen on gcc below -O2.

There's also https://github.com/pietro/gcc-musttail-plugin to ensure it does happen (and clang has musttail support built in these days).


\o/ looks like that won't be needed much longer: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83324#c27


Not really; knowing the library, I suspect is just many, many layers of trivial templated functions that normally just get optimized away to nothing, but at -O0 bloat the code. -Og can sometimes help in these cases.



The premise of this website and articles like https://yosefk.com/blog/how-profilers-lie-the-cases-of-gprof... just show that the authors are using the wrong tools. It is nowadays relatively easy to also look at off-CPU time when profiling with perf (e.g. https://github.com/KDAB/hotspot/?tab=readme-ov-file#off-cpu-...). The idea is to use sampling for the on-CPU periods and then combine that with the off-CPU time measured between context switches. VTune also supported this mode for many years.


> The premise of this website and articles like https://yosefk.com/blog/how-profilers-lie-the-cases-of-gprof... just show that the authors are using the wrong tools. It is nowadays relatively easy to also look at off-CPU time when profiling with perf (e.g. https://github.com/KDAB/hotspot/?tab=readme-ov-file#off-cpu-...).

I think, firstly, that spending 15s trying the CTRL-c approach is a worthwhile tradeoff. If you don't find anything, then sure, spend another 30m - 60m setting up perf, KDAB, etc. Maybe more if you're on an embedded device.

Secondly, the author seems to say that he's used this on embedded devices with no output but a serial line for the debugger. This is also a 15s effort[1].

It's basically a very low effort task, takes seconds to determine if it worked or not, and if it doesn't work you've only lost a few seconds.

[1] I'm assuming that if you're developing on a device supporting a serial GDB connection, you've already got the debugger working.


perf is easily available through yocto and buildroot (and probably other embedded linux image builders). hotspot can be downloaded as an appimage. It should not take 30-60min to set this up, but granted, learning the tools the first time always has some cost.

Furthermore, note how your reasoning is quite different from what the website you linked to says - it basically says "there are no good tools" (which is untrue) whereas you are saying "manual GDB sampling might be good enough and is easier to setup than a good tool" (which is certainly true).


the vast majority of embedded cpus cannot run yocto or indeed linux, even the arms

but they all support gdb


True, that's another good point. But again, this reasoning is very different to the one from the linked article and website - if you have oprofile or valgrind's cachegrind available, you clearly could get perf setup instead.

I'm not debating that manual GDB sampling has its place and value. I'm debating that perf is "lying" or that it's impossible to get hold of off-CPU samples, or profiling of multithreaded code in general.


yes, agreed


(well, not all)


kreinin spends a lot of time debugging things that don't run on linux or any cpu architecture linux or vtune supports. even on amd64 linux, perf is not so useful with python, lua, node.js, browser js, shell scripts, etc.


I kept waiting for the guy to actually paste the code somewhere


The default state of regulars in technical IRC channels :D


Signal handler integrations are underrated and great.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: