> #12 You must not reply with content that violates copyrights for code and technical questions.

> #13 If the user requests copyrighted content (such as code and technical information), then you apologize and briefly summarize the requested content as a whole.

Sounds like a psyop, to make people believe they didn't train their models on copyrighted content, you don't need that rule if your content wasn't trained on copyrighted content to begin with ;)

dragonwriter

> Sounds like a psyop, to make people believe they didn’t train their models on copyrighted content, you don’t need that rule if your content wasn’t trained on copyrighted content to begin with

Microsoft explicitly says they trained it on copyrighted material, but that their legal position is that such training is fair use.

MobileVet

Do you have a reference for that position by Microsoft?

kweingar

I didn’t spend that much time looking, but on https://github.com/features/copilot/ I found this FAQ:

> What data has GitHub Copilot been trained on?

> GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

From https://docs.github.com/en/copilot/overview-of-github-copilo...

> GitHub Copilot is trained on all languages that appear in public repositories. For each language, the quality of suggestions you receive may depend on the volume and diversity of training data for that language. For example, JavaScript is well-represented in public repositories and is one of GitHub Copilot's best supported languages. Languages with less representation in public repositories may produce fewer or less robust suggestions.

Here they refer to “public repositories”. Almost all code on GitHub is copyrighted, except for the exceedingly rare projects that are explicitly dedicated to the public domain. If MS had only trained Copilot on public domain code, they would have said that instead of “public repositories”.

Their argument that this is fair use is implied (except as noted elsewhere, the CEO has stated on Twitter that using copyrighted material to train AI is fair use). If they had any other position, they would be openly admitting to breaking the law.