Q: Does this make it even more urgent for git to move to a different hash?
It adds to the weight of the argument, but there isn't a big issue. This article (https://www.zdnet.com/article/linus-torvalds-on-sha-1-and-gi...) and the linked email (https://marc.info/?l=git&m=148787047422954) both seem to still apply.
Further details as to why Torvalds is not concerned:
From the email...
"I haven't seen the attack yet, but git doesn't actually just hash the data, it does prepend a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header."
[...]
"I haven't seen the attack details, but I bet
(a) the fact that we have a separate size encoding makes it much harder to do on git objects in the first place
(b) we can probably easily add some extra sanity checks to the opaque data we do have, to make it much harder to do the hiding of random data that these attacks pretty much always depend on."
That first quote is misleading. git's special hashing scheme doesn't make the attack "much harder". First there is no difference in length in the original shattered collision already:
$ curl https://shattered.io/static/shattered-1.pdf | wc -c
422435
$ curl -s https://shattered.io/static/shattered-2.pdf | wc -c
422435
Second, the length is already being hashed into the content during computation of a SHA-1 hash. Look up Merkle-Damgard construction: https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_co...
There is benefit in storing the length at the prefix as well, as you can avoid length extension attacks, but that's not making attacks "much harder".
Yeah, that quote doesn't exactly make me confident about Linus's understanding of this particular issue.
It's not about it being the same length, but the length of the data being part of the hashed data, which, Linus assumes, will likely make it more difficult to find a collision. He even says at the beginning that he hasn't had a look at the attack yet and is just making an assumption.
> It's not about it being the same length, but the length of the data being part of the hashed data
As I tried to point out, the length is already part of what the SHA-1 function hashes:
https://tools.ietf.org/html/rfc3174#section-4
As a summary, a "1" followed by m "0"s followed by a 64-
bit integer are appended to the end of the message to produce a
padded message of length 512 * n. The 64-bit integer is the length
of the original message. The padded message is then processed by the
SHA-1 as n 512-bit blocks.
Now, storing the length as a prefix does give you advantages: you can't mount a length extension attack, which limits your ability to exploit one shattered attack, e.g. the pdfs released by google, for different files/types of files. But it doesn't make mounting a novel shattered attack "much harder" as Linus claims.> But it doesn't make mounting a novel shattered attack "much harder" as Linus claims.
From what I understood the core of Linus' argument[1] is that it's very hard to make a "bad" variant of the code which has the same length _and_ the same hash while still looking like sane code. For random data files, sure those are more at risk.
That is very hard, but not what was quoted above. The length has no part in it. The core part needed for the shattered collision attacks involves basically binary data.
$ curl -s https://shattered.io/static/shattered-1.pdf | hexdump -C > s1
$ curl -s https://shattered.io/static/shattered-2.pdf | hexdump -C > s2
$ diff s1 s2
13,20c13,20
< 000000c0 73 46 dc 91 66 b6 7e 11 8f 02 9a b6 21 b2 56 0f |sF..f.~.....!.V.|
< 000000d0 f9 ca 67 cc a8 c7 f8 5b a8 4c 79 03 0c 2b 3d e2 |..g....[.Ly..+=.|
< 000000e0 18 f8 6d b3 a9 09 01 d5 df 45 c1 4f 26 fe df b3 |..m......E.O&...|
< 000000f0 dc 38 e9 6a c2 2f e7 bd 72 8f 0e 45 bc e0 46 d2 |.8.j./..r..E..F.|
< 00000100 3c 57 0f eb 14 13 98 bb 55 2e f5 a0 a8 2b e3 31 | 000000c0 7f 46 dc 93 a6 b6 7e 01 3b 02 9a aa 1d b2 56 0b |.F....~.;.....V.|
> 000000d0 45 ca 67 d6 88 c7 f8 4b 8c 4c 79 1f e0 2b 3d f6 |E.g....K.Ly..+=.|
> 000000e0 14 f8 6d b1 69 09 01 c5 6b 45 c1 53 0a fe df b7 |..m.i...kE.S....|
> 000000f0 60 38 e9 72 72 2f e7 ad 72 8f 0e 49 04 e0 46 c2 |`8.rr/..r..I..F.|
> 00000100 30 57 0f e9 d4 13 98 ab e1 2e f5 bc 94 2b e3 35 |0W...........+.5|
> 00000110 42 a4 80 2d 98 b5 d7 0f 2a 33 2e c3 7f ac 35 14 |B..-....*3....5.|
> 00000120 e7 4d dc 0f 2c c1 a8 74 cd 0c 78 30 5a 21 56 64 |.M..,..t..x0Z!Vd|
> 00000130 61 30 97 89 60 6b d0 bf 3f 98 cd a8 04 46 29 a1 |a0..`k..?....F).|
An ASCII formatted file only has text data. Also, with the shattered attack you can't choose what the two versions should be so you are required to cross reference the different looking binary data to turn on/turn off some functionality. So the attack is mostly interesting when you include binary data. With the chosen prefix attack, you can have two arbitrary components, even textual ones, but they still have to be followed by such a binary component.Also now git has collision detection code from sha1collisiondetection [1], making attacks even harder.
[1]: https://github.com/cr-marcstevens/sha1collisiondetection