As a non ML person I have been playing around with torch the past few weeks. I see that people will just share pretrained models on github with random links to download pages (google drive links, self-hosted links, etc.) I was quite surprised by this.

Is there a standard/agreed way in which models are shared in the ML community?

Is there some agreed model integrity check or signature when pulling random files?

jeroenhd

The most fun are the ML models shared in pickle format. They can contain executable code and who knows if that Stable Diffusion model you just downloaded will make your image generation dreams come true or is just full of viruses!

There are ways to verify the safety of these models but I doubt most users will go through the effort.

DoingIsLearning

> There are ways to verify the safety of these models but I doubt most users will go through the effort.

Could you expand on this? I assume it's some sort of serialization format, other than parsing it what can you do to inspect?

jeroenhd

It's Python's serialisation format: https://docs.python.org/3/library/pickle.html

There are tools to check the format for suspicious behaviour: https://github.com/mmaitre314/picklescan seems to be the most developed one.

You can also check the format manually (being careful not to call into it), like demonstrated by this more rudimentary scanner: https://github.com/zxix/stable-diffusion-pickle-scanner

It you do check for security issues yourself, you'll need to read up on what magical methods/variables may cause code execution. Simple demonstrations of dangerous code can be found all over the web (https://stackoverflow.com/questions/47705202/pickle-exploiti...) but I'm sure there are obfuscation tricks that simple scans won't catch.