NOT ANOTHER TRACER!!
I'm sure it's impressive engineering work, but why oh why...
How does it compare to Linux uprobes, which are built into Linux mainline? Bear in mind there are different front ends for uprobes (ftrace, perf_events, bcc, ...), and these are also still in development, so if one lacked certain features they needed, such features could be added. There's been a LOT of work in this area in the past 6 months, as well (see lkml).
If the goal was lowest performance, then why compile with no-op sleds ("negligible overhead") instead of using dynamic tracing (literally "zero overhead")? Or, if the existing kernel-based dynamic tracers benchmarked poorly, then why not something like LTTng?
How does it compare to DTrace, as well? (Doesn't Google have some FreeBSD?).
All the tracers I mentioned can not only do dynamic tracing, but also instrument all user and kernel code, without special recompilation.
Most of XRay was written before uprobes was merged into the Linux kernel (and well before such kernels were widely available).
I don't think any of the alternatives you mentioned are Pareto superior to XRay when considering all of "speed while tracing", "speed while not tracing", and "flexibility".
E.g.:
- In "speed while tracing", anything that takes a context switch per traced function will probably be dramatically slower. Even if there's some fast dispatch mechanism you have in mind that I'm not familiar with when you say dynamic tracing, if it doesn't insert the moral equivalent of a nop-sled, it will have to either choose between logging the whole PC (spending data, which means spending RAM and disk time) or figuring out how to map it to a function-specific unique int (spending cycles).
- In "speed while not tracing", anything much more expensive than nop-sleds will be too slow to run in production.
- Anything that doesn't have a compile time component probably won't be able to completely hook functions that get inlined, or whose source you aren't able to change, won't be able to pick out information the runtime wants to summarize from function arguments, etc.
To me, the neat thing about XRay isn't so much the "function patching" aspect, except insofar as it serves as a mechanism to execute arbitrary code at function entry or exit in a way that's runtime-customizable and very low overhead when you want it to be.
> "anything that takes a context switch per traced function will probably be dramatically slower."
Good thing uprobes don't context switch:
# perf stat -e context-switches -e probe_libc:re_search_internal sed '/./d' /mnt/data.txt
Performance counter stats for 'sed /./d /mnt/data.txt':
6 context-switches
15,122,432 probe_libc:re_search_internal
19.744738204 seconds time elapsed
You mean mode switch? Cheaper, but yes, still costly. Here's runtime without the probe: # time sed '/./d' /mnt/data.txt
real 0m3.349s
user 0m3.345s
sys 0m0.004s
Which means we can calculate the cost to be ~1.1 us per probe (on my system). Anyone know what XRay is clocking in at?AFAIK, LTTng has done work for user<->user instrumentation. I think uBPF will be doing this (https://github.com/iovisor/ubpf) - although that project is very new. Could use some help from some more good engineers (please do!).
> "In "speed while not tracing", anything much more expensive than nop-sleds will be too slow to run in production."
I'm not sure anyone is suggesting anything more than nop-sleds. Dynamic tracing is zero, and static tracing is nop-sleds.
> "probably won't be able to completely hook functions that get inlined"
Sure. Sometimes there's static tracing probes (nop-sled based), sometimes there isn't and it's dynamic probes, sometimes those dynamic probes are inlined and you walk up the stack to find one that isn't. If it is inlined, maybe you need to trace the address rather than the function entry.
In my experience it's pretty rare that something is just untracable because inlining is so insane. But yes, it does happen sometimes. Usually I figure out a workaround before giving up.
> "a mechanism to execute arbitrary code at function entry or exit in a way that's runtime-customizable and very low overhead when you want it to be"
BPF! In-kernel virtual machine that runs JIT'd code on events, and is part of mainline Linux. Lots of enhancements in the Linux 4.x series.