> If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?

This is my favorite question about Copilot ever.

Perhaps because there is a (small) risk of leaking confidential information through its output.

But that's not as damning as it sounds.

First, we know Copilot, if given the right prompt and told to autocomplete repeatedly without any manual input, can regurgitate bits of code seen many times in many different repositories, like the famous Quake fast inverse square root function and the text of licenses. That doesn't mean it does so under normal prompts and normal use. Perhaps it does sometimes, and that would be a real concern. But any regurgitation that isn't under normal use, which only happens if the user is trying to make Copilot regurgitate, is not a problem when it comes to copyright violations of open source code (since anyone trying to violate an open source license can do so much more easily without using Copilot), yet it may still be a problem when it comes to leaking confidential information.

Second, whether something is a copyright violation and whether it risks leaking confidential information are somewhat orthogonal. A copyright violation usually requires at least several lines of code, and more if the copying is not verbatim, or if the code is just a series of function calls which must be written near-verbatim in order to use an API. On the other hand, `const char PRIVATE_KEY[] = ` could hypothetically complete to something dangerous in just one line of code. That said, it almost certainly wouldn't, since even if a private key was stored in source code in the first place (obviously it shouldn't be), it probably wouldn't be repeated enough to be memorized by the model. Yet…

…third, the risk tolerances are different. If, to use completely made-up numbers, 0.1% of Copilot users commit minor copyright violations and 0.001% commit major ones, that's probably not a big deal considering how many copyright violations are committed by hand – sometimes intentionally, mostly unintentionally. (When it comes to unintentional ones, consider: Did you know that if you copy snippets from Stack Overflow, you're supposed to include attribution even in any binary packages you distribute, and also the resulting code is incompatible with several versions of the GPL? Did you know that if you distribute binaries of code written in Rust, you need to include a copy of the standard library's license?) But when it comes to leaking confidential information, even one user getting it would be somewhat bad (though admittedly Microsoft does distribute much of their source code privately to some parties), and taking even a small risk would be a questionable decision when there is a ready alternative.

FWIW, there are some (admittedly fairly naive) checks to prevent PII and other sensitive info from being suggested to users. Copilot looks for things like ssh keys, social security numbers, email addresses, etc, and removes them from the suggestions that get sent down to the client.

There's also a setting at https://github.com/settings/copilot (link only works if you've signed up for copilot) that will check any suggestion on the server against hashes of the training set, and block anything that exactly duplicates code in the training set (with a minimum length, so very common code doesn't get completely blocked). Users must choose the value for this setting when they sign up for copilot.

source: I work on copilot at github