Is the article saying that the M1 is slower than we would have expected in this case?
My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.
But the M1 is so parallelized that it goes to grab both arr[idx] and arr[idx+1] separately. So we have to wait for both of those two return. Meanwhile, on a less parallelized processor, we would have done arr[idx] first and waited for it to return, and the processor would realize that it already had arr[idx+1] without having to do the second fetch.
Am I understanding this right?
>> My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.
That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array[] is in elements or what size each element is.
I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.
Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.
If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...