How feasible is it to store raw content in the Git content-addressable-store (CAS)? Git blobs are Zlib compressed.

I'd like to be able to store audio files uncompressed, so that they could be read directly from the CAS, rather than having to be expanded out into a checkout directory.

IIRC, a git blob has the size of the data encoded in the first 4 bytes of the file, and the data itself appended to it. It could be stored uncompressed, but I don't think there's anything in the git plumbing layer that could deal with it directly.

That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

rectang

The header for a blob file is "blob", a space, the length of the content as ASCII integer representation, then a null byte.

    $ echo "hello world" > HELLO.txt
    $ git add HELLO.txt 
    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
    > zpipe -d | \
    > hexdump -e '"|"24/1 "%_p" "|\n"'
    |blob 12.hello world.|
    $

The header and the content get concatenated together, and the whole thing gets Zlib compressed. The SHA1 is calculated from the header-plus-content before it gets Zlib compressed.

    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \                  
    > zpipe -d | \
    > shasum
    3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -
    $

What I would like to do is record an audio file (e.g. LPCM BWF), take its SHA1 and store it in the CAS as raw content, then reference it somehow from a Git commit. That way it will be part of the history and will travel with `push` and `clone`, won't get gc'd, etc.

> That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

That's a neat suggestion! However, I don't see how it would be compatible with random access, which is important for my application.

GauntletWizard

Basically that's what Git-LFS does; it takes the SHA of the file, stores it in the git version of the file, and then stores the contents next to it. It's all transparent and works pretty well.

rectang

Hmm, but the point of Git-LFS is to store large files outside the CAS so that they don't burden operations like `clone`. And Git-LFS does lots of magic.

Maybe to achieve what I've laid out, I really would need to write a Git extension a la Git-LFS. But then vanilla Git wouldn't be able to make full use of it, which undermines the purpose of using Git in the first place.

As an alternative, maybe I just commit the darn audio files to the repo.

• In relative terms, audio files grow smaller ever year.

• Large repository size isn't as critical for a music composition tool as it is for perpetually maintained software source code.

• I'm imagining a tool to prune edit history which would consolidate commits and potentially garbage collect audio files that become unreferenced.

I wish there was a way in vanilla Git to just associate a CAS object containing arbitrary bytes with a commit object, though.

kvnhn

I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud