> Do the engineers that made this hand edit this file? Or did they have other source that they used and this is the build product?
Do any open source product provide all the tools used to make software? I haven't seen the linux kernel included in any other open source product and that'd quite frankly be insane. As well as including vim/emacs, gcc, gdb, X11, etc.
But I do agree that training data is more important than those things. But you need to be clear about that because people aren't understanding what you're getting at. Don't get mad, refine your communication.
> Windows was free for a year. Did that make it open source?
Windows didn't attach an Apache-2.0 license to it. This license makes this version of the code perpetually open source. They can change the license later, but it will not back apply to previous versions. Sorry, but this is just a terrible comparison. Free isn't what makes a thing "open source." Which let's be clear, is a fuzzy definition too.
> Do any open source product provide all the tools used to make software? I haven't seen the linux kernel included in any other open source product and that'd quite frankly be insane. As well as including vim/emacs, gcc, gdb, X11, etc.
BSD traditionally comes as a full set of source for the whole OS, it's hardly insane.
But the point is you don't need those things to work on Linux - you can use your own preferred editor, compiler, debugger, ... - and you can work on things that aren't Linux with those things. Calling something "open source" if you can only work on it with proprietary tools would be very dubious (admittedly some people do), and calling a project open source when the missing piece you need to work on it is not a general-purpose tool at all but a component that's only used for building this project is an outright falsehood.
But what's proprietary here? That's what I'm not getting from the other person. You have the algorithm. Hell, they even provided the model in pytorch/python. They just didn't provide training parameters and data. But that's not necessary to use or modify the software just like it isn't necessary for nearly any other open sourced project. I mean we aren't calling PyTorch "not open source" because they didn't provide source code for vim and VS code. That's what I'm saying. Because at that point I'm not sure what's the difference between saying "It's not open source unless you provide at least one node of H100 machines." That's what you kinda need to train this stuff.
> But what's proprietary here? That's what I'm not getting from the other person. You have the algorithm. Hell, they even provided the model in pytorch/python. They just didn't provide training parameters and data. But that's not necessary to use or modify the software just like it isn't necessary for nearly any other open sourced project.
It's necessary if you want to rebuild the weights/factors/whatever the current terminology is, which are a major part of what they're shipping. If they found a major bug in this release, the fix might involve re-running the training process, and currently that's something that they can do and we users can't.
> I mean we aren't calling PyTorch "not open source" because they didn't provide source code for vim and VS code.
You can build the exact same PyTorch by using emacs, or notepad, or what have you, and those are standard tools that you can find all over the place and use for all sorts of things. If you want to fix a bug in PyTorch, you can edit it with any editor you like, re-run the build process, and be confident that the only thing that changed is the thing you changed.
You can't rebuild this model without their training parameters and data. Like maybe you could run the same process with an off-the-shelf training dataset, but you'd get a very different result from the thing that they've released - the whole point of the thing they've released is that it has the weights that they've "compiled" through this training process. If you've built a system on top of this model, and you want to fix a bug in it, that's not going to be good enough - without having access to the same training dataset, there's no way for you to produce "this model, but with this particular problem fixed".
(And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)
As a researcher I want to know the HPs and datasets used, but they honestly aren't that important for usage. You're right that to "debug" them one method would be to retrain from scratch. But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch. Most of the NLP research world is using pretrained models these days (I don't like this tbh, but that's a different discussion all together). Only a handful of companies are actually training models. And I mean companies, I don't mean academics. Academics don't have the resources (unless partnering), and without digressing too much, the benchmarkism is severely limiting the ability for academics to be academics. Models are insanely hard to evaluate, especially after RLHF'd to all hell.
> (And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)
The truth is that this is how most ML refinement is happening these days. If you want better refinement we have to have that other discussion.