What does HackerNews think of Code-used-on-Daniel-Lemire-s-blog?
This is a repository for the code posted on my blog
Quoting the article "I use GCC 11 on an Ice Lake server. My source code is available.", linking to https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... .
From the README at the top-level:
> Unless otherwise stated, I make no copyright claim on this code: you may consider it to be in the public domain.
> Don't bother forking this code: just steal it.
Is that not included in what's 6x faster? I agree comparing against strptime is a bit of a gimme and a scalar fixed-format parser would be fairer, but I'm pretty sure it will beat that too.
Edit to reply since rate-limited:
> No mention of calendar in either place. Did you see something in the article or code, or are you just asking us to check?
The (lack so far of) calendar is mentioned in the blog post - "It does not get all the parsing done... We can just use standard C code for the result." - and is very obviously in the GitHub code.
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
However, the code at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... shows it's indeed at line 6, because it has an extra newline after the '#include '.
0: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
for (size_t i = 0; i < N; i++) {
out[i++] = g();
}
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.This is also visible in the assembly
add x8, x8, #2
So if I see this correctly the results are off by a factor of 2.[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
godbolt clang compiles it to:
.LBB5_2: // =>This Inner Loop Header: Depth=1
mul x13, x11, x10
umulh x14, x11, x10
eor x13, x14, x13
mul x14, x13, x12
umulh x13, x13, x12
eor x13, x13, x14
str x13, [x0, x8, lsl #3]
add x8, x8, #2 // =2
cmp x8, x1
add x11, x11, x9
b.lo .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...Great work on reducing the size of the character conversion tables, but I don't see any evidence for the title of the web page.
"Compact character case conversion" is more appropriate currently based on the data given in the readme.
Data can be greatly compressed at the cost of access speed or the compressed data could speed up data access - with x86 CPU complexity these days, I wouldn't bet on determining the speed based on a quick glance of the code.
You state the case conversion is fast, yet I don't see any benchmarks backing up the assertion.
I'd expect a few different types of input to be benchmarked - ASCII upper/lower case alphabet chars (typical A-Z,a-z & space/punctuation thrown in to make it look like typical written language text inputs), Unicode upper/lower case et al., and random characters.
Not gonna leave people in the lurch though - Daniel Lemire has a recent post which can be used as a scaffold for benchmarking.
Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
Post: https://lemire.me/blog/2021/02/09/on-the-cost-of-converting-...
If the code is to be timed on Windows, the code will need a Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see https://stackoverflow.com/questions/5404277/porting-clock-ge...
If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
Different benchmarks favor different processors… Here's another one, now multi-threaded
My quick run of `sysbench cpu` (something with prime numbers) shows:
Skylark 3.3GHz 32core (at Packet; Ubuntu 18.04):
events per second: 44287.87
Zen 3.85GHz 8c16t (my desktop; FreeBSD 13-CURRENT): events per second: 14632.61
And with a single thread, 1386.24 vs 1740.91. Divided by clock speed, it's about 93% of the performance.As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.
This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.
I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."
Also, he does point out in this paragraph that there is nothing inherently slower about the functional approach: "All these functions achieve nearly the same best performance if you disable safety checks at build time (swift build --configuration release -Xswiftc -Ounchecked). This means, in particular, that we get the fancy abstraction (the reduce approach) for free, as long as we disable safety checks. That makes performance a lot easier to understand."
See the Rust benchmarks posted here.
Send him a pull request, and I'd bet he'd be happy to include them for comparison and mention them in an addendum: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
$ swift build --configuration release
$ cset proc -s nohz -e .build/release/reduce
# count (basic, reduce, unsafe basic, unsafe reduce)
1000 (0.546, 0.661, 0.197, 0.576)
10000 (0.403, 0.598, 0.169, 0.544)
100000 (0.391, 0.595, 0.194, 0.542)
1000000 (0.477, 0.663, 0.294, 0.582)
10000000 (0.507, 0.655, 0.337, 0.608)
100000000 (0.509, 0.655, 0.339, 0.608)
1000000000(0.511, 0.656, 0.345, 0.611)
$ swift build --configuration release -Xswiftc -Ounchecked
$ cset proc -s nohz -e .build/release/reduce
# count (basic, reduce, unsafe basic, unsafe reduce)
1000 (0.309, 0.253, 0.180, 0.226)
10000 (0.195, 0.170, 0.168, 0.170)
100000 (0.217, 0.203, 0.196, 0.201)
1000000 (0.292, 0.326, 0.299, 0.252)
10000000 (0.334, 0.337, 0.333, 0.337)
100000000 (0.339, 0.339, 0.340, 0.339)
1000000000(0.344, 0.344, 0.344, 0.344)
Code is from https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... with modification to loop over the different array lengths. Numbers are for Skylake at 3.4 GHz with swift-3.0.1-RELEASE-ubuntu16.04. Count is the number of 8B ints in the array being summed. Results shown were truncated by hand --- I wasn't sure how to specify precision from within Swift. The execution with "cset proc -s nohz" was to reduce jitter between runs, but doesn't significantly affect total run time. The anomalously fast result for the L3 sized 'unsafe' 'unchecked' is consistent.Even if they're doing whole program analysis, the fact that oeprator<<() and WallClockTimer.split() are side-effectful and I believe that means that it would be invalid for the compiler to find the current time before the first section of the print statement is finished.
Fwiw, I just ran it after fixing the time call, and it doesn't seems to significantly change the result on my (low spec) machine.