As someone who creates open, medium-sized, reusable datasets, is Dat something I should try? Is it too early? The linked page is very much about technical details of the implementation and not about how one would typically use it.

I maintain ConceptNet [1], a multilingual knowledge graph. I do everything I can to make its published results reproducible. The biggest hurdle for people reproducing it has always been getting the data -- building it requires about 100 GB of raw data or 15 GB of computed data that can be imported into PostgreSQL.

I once tried git-annex. It turned out not to be a good choice -- its tools were flaky, its usage patterns confusing, it leaves a permanent record of your mistakes in configuring data sources, and it was very hard to convince to use ordinary HTTP downloads instead of trying to get read-write access to S3 (which wouldn't work for anyone but me). Now I have weird branches and remotes in my repositories, and weird data in my S3 buckets, that I can't get rid of in case someone tries to use git-annex in a way I told them would work.

After that I just went with distributing the data with plain HTTP downloads from S3. I wish I could do better than this. The only semblance of versioning is putting the date in the URL, and also people in Asia tell me that the build fails because their downloads from us-east-1 get interrupted. Oh, and if I ever stop paying for S3, everything will break.

If I tried making data reproducible with Dat, would it be safe to promise people that they could use Dat to get the data? Even if in the future I don't like Dat anymore?

For instance, do I have to commit to hosting the data somewhere? If not, who does? Does it disappear when people lose interest, like BitTorrent?

[1] http://conceptnet.io