I got run times from the simplest single-threaded directory walk that are only 1.8x slower than git ls-files. (Min time of 10 runs with the git repo housed by /dev/shm on Linux 5.15.)

The "simple" code is in https://github.com/c-blake/cligen/blob/master/cligen/dents.n... (just `dents find` does not require the special kernel batch system call module to be fast. That kernel module is more about statx batching but IO uring can also do that these days. For those unfamiliar with Nim, it's a high productivity, high performance systems language.)

I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth". (I personally consider this over-/mis-engineered from days of bygone systems with default ulimits of, say, only 64 open fd's. There should at least be a "fast mode" since let's be real - file hierarchies deeper than 256..1024 levels, while possible, are rare and one should optimize for the common case or at least allow manual instruction that this case prevails. AFAIK there is no such "fast mode". Maybe on some BSD?)

Meanwhile, I think the Rust fd is slow because of (probably counterproductive) multi-threading (at least it does 11,000 calls to futex).

> I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth".

I haven't benchmarked find specifically, but I believe the most common Rust library for the purpose, walkdir[1], also allows arbitrary file system recursion depth, and is extremely fast. It was fairly close to some "naive" limited depth code I wrote in C for the same purpose. A lot of go-to C approaches seem to needlessly call stat on every file, so they're even slower.

I'd be curious to see benchmarks of whether this actually makes a difference.

[1] https://github.com/BurntSushi/walkdir

I cannot speak to why your "naive" C variant might have been slower than necessary. I might (wildly) guess that you did unnecessary string handling/allocation. You really just need one re-used buffer and a memcpy out of dirents to the tail of said buffer (or even directly to stdio's output buffer). With modern Linux FSes you can use the d_type to decide recursion, not a stat. EDIT: Output perhaps might also be/have been hamstrung by not using fwrite_unlocked on Linux. Really just wild guesses, though. I can also say that https://github.com/google/walk mentioned in other subthreads is almost as fast as `dents find` and over 2x faster than GNU find on the same linux git tree problem (up to commit 923dcc5eb0c111eccd51cc7ce1658537e3c38b25, btw).

It may actually be because most ftw()s call stat (and so are quite slow, at least without some kernel magic like IO uring or sys_batch) that the non-stat calling mode is poorly optimized. In that context, it may seem like a more minor optimization.