What does HackerNews think of copilot?

It is already happening [1][2][3][4]. People want something that makes their jobs easier, not essentialy something they can trust.

Not all tasks need to be 100% accurate, and, to be honest, people are not known for their trustworthiness either.

[1] https://news.ycombinator.com/item?id=36097900

[2] https://futurism.com/neoscope/microsoft-doctors-chatgpt-pati...

[3] https://github.com/features/copilot

[4] https://www.electropages.com/blog/2023/06/researchers-demons...

I didn’t spend that much time looking, but on https://github.com/features/copilot/ I found this FAQ:

> What data has GitHub Copilot been trained on?

> GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

From https://docs.github.com/en/copilot/overview-of-github-copilo...

> GitHub Copilot is trained on all languages that appear in public repositories. For each language, the quality of suggestions you receive may depend on the volume and diversity of training data for that language. For example, JavaScript is well-represented in public repositories and is one of GitHub Copilot's best supported languages. Languages with less representation in public repositories may produce fewer or less robust suggestions.

Here they refer to “public repositories”. Almost all code on GitHub is copyrighted, except for the exceedingly rare projects that are explicitly dedicated to the public domain. If MS had only trained Copilot on public domain code, they would have said that instead of “public repositories”.

Their argument that this is fair use is implied (except as noted elsewhere, the CEO has stated on Twitter that using copyrighted material to train AI is fair use). If they had any other position, they would be openly admitting to breaking the law.

I have files open in VSCode that aren't committed to Git, and that I wouldn't be allowed to commit there even in private repos (customer data covered by GDPR).

In this context I think it's important to note the distinction between "Copilot for Individuals" and "Copilot for Business", because for twice the money you potentially get a lot more privacy:

> What data does Copilot for Individuals collect?

> [...] Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path.

> What data does Copilot for Business collect?

> [...] GitHub Copilot transmits snippets of your code from your IDE to GitHub to provide Suggestions to you. Code snippets data is only transmitted in real-time to return Suggestions, and is discarded once a Suggestion is returned. Copilot for Business does not retain any Code Snippets Data.

https://github.com/features/copilot

? Did you even read the thread? It's about Copilot X, not Copilot. I use Copilot in Intellij daily, and that's exact what I'm worrying about: once MS's AI tools get big enough, they won't be so friendly to third-party IDEs.

It's not a stretch at all. Take a look at Copilot's homepage: https://github.com/features/copilot

> Keep flying with your favorite editor

Now, look at Copilot X's announcement:

> We are bringing a chat interface to the editor that’s focused on developer scenarios and natively integrates with VS Code and Visual Studio

> It recognizes what code a developer has typed, what error messages are shown, and it’s deeply embedded into the IDE.

It reads like "we're going to make AI a competitive advantage of VS and VScode." Of course they have the right to do it, I'm just saying I hate it.

IMO Copilot for Business has a very reasonable data collection policy. They discard any code snippets once the suggestion is returned.

https://github.com/features/copilot

https://github.com/features/copilot subtitle

> GitHub Copilot uses the OpenAI Codex to suggest code and entire functions in real-time, right from your editor.

(not sure if it's 4 or 3 tho)

By using it, you're making their product better, it's in the feeding stage. It generates code, you change the code, the changes get sent back if telemetry is on, it's in the TOS.

To quote: "Code Snippets Data

Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path." [0]

[0] https://github.com/features/copilot/#faq-privacy-copilot-for...

An extra $9/mo for:

* Simple license management

* Organization-wide policy management

* Industry-leading private

* Corporate proxy support

Wow. Who’s going to pay a 90% premium for these features?

Edit: OK seems like different marketing pages have different features. The list above comes from https://github.com/features/copilot/. Still seems like a very steep increase over the base. And I cannot believe there are only 400ish companies using copilot.

Do you have software engineers? GitHub Copilot is pretty good at writing simple functions and boilerplates https://github.com/features/copilot
All you need you need to know about Copilot (taken from “Privacy – Copilot for Individuals” at https://github.com/features/copilot):

—————————

What data does Copilot for Individuals collect? GitHub Copilot relies on file content and additional data to work. It collects data to provide the service, some of which is then retained for further analysis and product improvements. GitHub Copilot collects the following data for individual users:

User Engagement Data When you use GitHub Copilot it will collect usage information about events generated when interacting with the IDE or editor. These events include user edit actions like completions accepted and dismissed, and error and general usage data to identify metrics like latency and features engagement. This information may include personal data, such as pseudonymous identifiers.

Code Snippets Data Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path.

————————

The "Privacy – Copilot for Individuals" section under https://github.com/features/copilot does say that Copilot collects code snippets if allowed by telemetry.

> User Engagement Data When you use GitHub Copilot it will collect usage information about events generated when interacting with the IDE or editor. These events include user edit actions like completions accepted and dismissed, and error and general usage data to identify metrics like latency and features engagement. This information may include personal data, such as pseudonymous identifiers.

> Code Snippets Data Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path.

>If you don't know Microsoft's history, a lot of what more informed people are worried about seems overblown.

Or maybe they do know about it, and don't agree with you. Do you allow for such an option?

https://github.com/features/copilot

"What can I do to reduce GitHub Copilot’s suggestion of code that matches public code?

We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment."

> We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment.

From the FAQ https://github.com/features/copilot/

Code Snippets Data

Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files paths.

https://github.com/features/copilot/#faq

Getty's business relies on the legal framework of copyright, and how it enables control (and sale) of the licensing of copyrighted material. And they're saying: nope - AI output is so ambiguous w.r.t. copyright and licensing of the inputs (when it's not flagrantly in violation, as with recreating our watermarks), that we want to steer totally clear of this.

When HN has discussed Github's Copilot [1] for coding, it seems like the role of copyright and licensing isn't discussed in much detail [2] (with some exceptions awhile back [3, 4]).

Do you think there is a software-development analog to Getty (I mean a company, not FSF), saying "no copilot-generated code here"? Or is the issue of copyright/licensing/attribution even murkier with code than for images?

[1] https://github.com/features/copilot/

[2] https://hn.algolia.com/?dateRange=all&page=1&prefix=false&qu...

[3] https://news.ycombinator.com/item?id=32187362

[4] https://news.ycombinator.com/item?id=31874166

Not presently per their FAQ.

> [...] The GitHub Copilot extension sends your comments and code to the GitHub Copilot service, and it relies on context, as described in Privacy below - i.e., file content both in the file you are editing, as well as neighboring or related files within a project. It may also collect the URLs of repositories or file paths to identify relevant context. The comments and code along with context are then used by OpenAI Codex to synthesize and suggest individual lines and whole functions.

- from https://github.com/features/copilot/#faq - see "How does GitHub Copilot work?"

I read the article and I have some skepticism. I think my skepticism is well-founded but it may well be the case that a machine will one day do my job. I don't believe that time is at hand, though.

First off, I don't see a link to the "HTTP server in JavaScript" task. It's really hard for me to place much faith in their conclusions when it's not even clear what the problem definition was.

Second, I believe that a lot of more senior developers and development managers who take secure development practices somewhat seriously will not be able or willing to use Copilot in any sort of proprietary setting. Here is a quote from the Copilot FAQ:

> [...] The GitHub Copilot extension sends your comments and code to the GitHub Copilot service, and it relies on context, as described in Privacy below - i.e., file content both in the file you are editing, as well as neighboring or related files within a project. It may also collect the URLs of repositories or file paths to identify relevant context. The comments and code along with context are then used by OpenAI Codex to synthesize and suggest individual lines and whole functions.

- from https://github.com/features/copilot/#faq - see "How does GitHub Copilot work?"

I believe this makes it simply a nonstarter in a lot of environments. I am wondering if there are a number of places that have restrictions on sharing their code with a third-party but don't know or don't care and so end up using Copilot anyway. I believe that short-sighted thinking like this is more prevalent in shops that have low-quality code, and I believe that the higher-quality the code, the less likely someone is to use Copilot, simply for the "I can't share my code, even if I use the most restrictive no-telemetry settings" reason. Give me a self-hosted Copilot, and I may try it out in anger.

Finally, I based some of my thinking on a recent Reddit /r/programming discussion of Copilot: https://old.reddit.com/r/programming/comments/wsnend/since_g...

After reading those posts, and internalizing them with my own view of coding, I believe Copilot is not ready for my personal use. Again: licensing considerations aside (if you actually can feel comfortable putting them aside, see NoraCodes comment in this HN thread e.g.), it is simply a non-starter for anything proprietary in nature. I am also of the mind that any code that is of necessity very tedious to write is in dire need of real attention, most likely in the form of tests and quite possibly refactoring to reduce the boilerplate if at all possible. I believe in the value of linters and automated code analysis tools and in continuous integration that runs after every commit. Give me a self-hosted Copilot, and we'll have a real chance to see how it works out - until then it's not going to be a boon to programmers.

> I assume GitHub could force you to allow Copilot to use your code for training with their terms and conditions?

You’d think so, wouldn’t you? You’d think that at least Microsoft had some legal veneer to cover what they are doing. But no. They simply took “source code from publicly available sources, including code in public repositories on GitHub”¹ and used it.

1. https://github.com/features/copilot/#what-data-has-github-co...

A compiler cannot take arbitrary high level descriptions and generate code.

You’ve seen copilot right? (https://github.com/features/copilot)

It’s fundamentally more sophisticated. This is like asking what is the difference between modern ML translation and the previous 20 years of research on language translation; the former actually works.

The latter basically doesn’t except in very specific circumstances.

Compilers can turn language into instructions only in a limited extremely specific set of circumstances.

I don't think you're right here frankly, since the buggy snippet is taken from the Copilot marketing page (https://github.com/features/copilot). The examples on that page which could conceivably have missing escape bugs are the sentiment analysis example (sentiments.ts), the tweet fetcher examples (fetch_tweets.js, fetch_tweets.ts, fetch_tweets.go) and the goodreads rating examples (rating.js, rating.py, rating.ts, rating.go). Of all of them, only the rating.go example is without a serious escaping bug, and only because Copilot happened to use a URL string generation library for rating.go.

These are the examples which GitHub itself uses to demonstrate what Copilot is capable of, so it's not just a matter of people tweeting without reading through the code properly. It also suggests that the people behind Copilot do believe that one primary use-case for Copilot is to generate entire functions.

Source? In other places, it is stated they only use public data, such as from public repositories. See the faq at the bottom here: https://github.com/features/copilot/
Oh as of this week copilot is open to the public: https://github.com/features/copilot/ . You can get a trial, or it is free for students and open source maintainers.
It's in the FAQ. Scroll to the bottom of this page:

https://github.com/features/copilot/

"What can I do to reduce GitHub Copilot's suggestion of code that matches public code?

"We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment."

Relevant parts from the Copilot FAQ (https://github.com/features/copilot/):

Does GitHub own the code generated by GitHub Copilot?

GitHub Copilot is a tool, like a compiler or a pen. GitHub does not own the suggestions GitHub Copilot generates. The code you write with GitHub Copilot’s help belongs to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code before pushing it to production, as you would with any code you write that incorporates material you did not independently originate.

Does GitHub Copilot recite code from the training set?

The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set. Previous research showed that many of these cases happen when GitHub Copilot is unable to glean sufficient context from the code you are writing, or when there is a common, perhaps even universal, solution to the problem.

The main page [0] shows you awesome demos, but also its weaknesses in the very first example. It doesn't encode the url encoded body properly:

> body: `text=${text}`,

So it breaks if the text contains a '&' and even allows parameter injection to the call of the 3rd party service. Isn't that critical on a sentiment analysis API, but could result in actual security holes.

I hope the users won't blindly use the generated code without review. These mistakes can be so subtle, nobody even noticed them when they put them on the front page of the product.

[0]: https://github.com/features/copilot/

Nitpick - they're not (just) using public domain code for training - they're using "publicly available sources, including code in public repositories on GitHub."[0]

This includes a lot of code under copyleft licenses, and possibly even more code under no license at all (implicitly All Rights Reserved). It's not obvious to me that it's ethical (or possibly even legal) to sell a model derived from code not in the public domain.

[0]: https://github.com/features/copilot