What does HackerNews think of Code-used-on-Daniel-Lemire-s-blog?

This is a repository for the code posted on my blog

Language: C

You should also worry about how other peoples' time is wasted when you miss important details then comment about easily assuaged worries.

Quoting the article "I use GCC 11 on an Ice Lake server. My source code is available.", linking to https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... .

From the README at the top-level:

> Unless otherwise stated, I make no copyright claim on this code: you may consider it to be in the public domain.

> Don't bother forking this code: just steal it.

> omits the massively expensive calendar conversion step

Is that not included in what's 6x faster? I agree comparing against strptime is a bit of a gimme and a scalar fixed-format parser would be fairer, but I'm pretty sure it will beat that too.

Edit to reply since rate-limited:

> No mention of calendar in either place. Did you see something in the article or code, or are you just asking us to check?

The (lack so far of) calendar is mentioned in the blog post - "It does not get all the parsing done... We can just use standard C code for the result." - and is very obviously in the GitHub code.

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

A newline is missing in the example code. As given there's a segfault at line 5 not line 6.

However, the code at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... shows it's indeed at line 6, because it has an extra newline after the '#include '.

There's a copy of the loop used on the escape function inside the avx512_escape function [0]. Is it needed or just a copy and paste mistake? (I know nothing about vector instructions)

0: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2
So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
Based on the title, I thought you were Daniel Lemire with another writeup for benchmarking character conversion of some sort - for examples see https://hn.algolia.com/?q=lemire.me.

Great work on reducing the size of the character conversion tables, but I don't see any evidence for the title of the web page.

"Compact character case conversion" is more appropriate currently based on the data given in the readme.

Data can be greatly compressed at the cost of access speed or the compressed data could speed up data access - with x86 CPU complexity these days, I wouldn't bet on determining the speed based on a quick glance of the code.

You state the case conversion is fast, yet I don't see any benchmarks backing up the assertion.

I'd expect a few different types of input to be benchmarked - ASCII upper/lower case alphabet chars (typical A-Z,a-z & space/punctuation thrown in to make it look like typical written language text inputs), Unicode upper/lower case et al., and random characters.

Not gonna leave people in the lurch though - Daniel Lemire has a recent post which can be used as a scaffold for benchmarking.

Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

Post: https://lemire.me/blog/2021/02/09/on-the-cost-of-converting-...

If the code is to be timed on Windows, the code will need a Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see https://stackoverflow.com/questions/5404277/porting-clock-ge...

> There's not enough info to figure out what's going on.

If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

The current winner is a SIMD-ish solution using 32 bit math as 4 8-bit lanes. This has been added to the repo, and is discussed in the comments.

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

His code https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... does not use OpenMP.

Different benchmarks favor different processors… Here's another one, now multi-threaded

My quick run of `sysbench cpu` (something with prime numbers) shows:

Skylark 3.3GHz 32core (at Packet; Ubuntu 18.04):

    events per second: 44287.87
Zen 3.85GHz 8c16t (my desktop; FreeBSD 13-CURRENT):

    events per second: 14632.61
And with a single thread, 1386.24 vs 1740.91. Divided by clock speed, it's about 93% of the performance.
I think Daniel's use of the word "separate" in "separate and expensive" is ill-advised, as it implies a critique of ARM's ISA design in a way that isn't relevant for this case. One might be concerned if you needed all 128 bits in some other use, but not here.

As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.

This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.

I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."

Given that the title is "Don’t assume that safety comes for free: a Swift case study", I don't think he's being particularly misleading. "Don't assume" usually just means that one should not take something for granted, not that it is never true. And doesn't appending "a Swift case study" makes it fairly clear that he is talking specifically about this case in Swift?

Also, he does point out in this paragraph that there is nothing inherently slower about the functional approach: "All these functions achieve nearly the same best performance if you disable safety checks at build time (swift build --configuration release -Xswiftc -Ounchecked). This means, in particular, that we get the fancy abstraction (the reduce approach) for free, as long as we disable safety checks. That makes performance a lot easier to understand."

See the Rust benchmarks posted here.

Send him a pull request, and I'd bet he'd be happy to include them for comparison and mention them in an addendum: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

I thought it might be interesting to see how this effect changes with the size of the array being summed. How do the relative speeds change when operating out of L1, L3, and memory? Does the lower speed of memory access overwhelm the overhead of the overflow checking?

  $ swift build --configuration release
  $ cset proc -s nohz -e .build/release/reduce

  # count  (basic, reduce, unsafe basic, unsafe reduce)
  1000      (0.546, 0.661, 0.197, 0.576)
  10000     (0.403, 0.598, 0.169, 0.544)
  100000    (0.391, 0.595, 0.194, 0.542)
  1000000   (0.477, 0.663, 0.294, 0.582)
  10000000  (0.507, 0.655, 0.337, 0.608)
  100000000 (0.509, 0.655, 0.339, 0.608)
  1000000000(0.511, 0.656, 0.345, 0.611)

  $ swift build --configuration release  -Xswiftc -Ounchecked
  $ cset proc -s nohz -e .build/release/reduce

  # count  (basic, reduce, unsafe basic, unsafe reduce)
  1000      (0.309, 0.253, 0.180, 0.226)
  10000     (0.195, 0.170, 0.168, 0.170)
  100000    (0.217, 0.203, 0.196, 0.201)
  1000000   (0.292, 0.326, 0.299, 0.252)
  10000000  (0.334, 0.337, 0.333, 0.337)
  100000000 (0.339, 0.339, 0.340, 0.339)
  1000000000(0.344, 0.344, 0.344, 0.344)
Code is from https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... with modification to loop over the different array lengths. Numbers are for Skylake at 3.4 GHz with swift-3.0.1-RELEASE-ubuntu16.04. Count is the number of 8B ints in the array being summed. Results shown were truncated by hand --- I wasn't sure how to specify precision from within Swift. The execution with "cset proc -s nohz" was to reduce jitter between runs, but doesn't significantly affect total run time. The anomalously fast result for the L3 sized 'unsafe' 'unchecked' is consistent.
Here's the link to the github account where the code is: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

Even if they're doing whole program analysis, the fact that oeprator<<() and WallClockTimer.split() are side-effectful and I believe that means that it would be invalid for the compiler to find the current time before the first section of the print statement is finished.

Code is at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

Fwiw, I just ran it after fixing the time call, and it doesn't seems to significantly change the result on my (low spec) machine.