Bravo! I'm so glad this is born. I'm curious about its model, SalesForce CodeGen. Does it train on all the public repos on GitHub? Does Copilot have access to private repos that CodeGen cannot access?

Also, it would be really cool if I can personalize FauxPilot by feeding it with all my repos on GitHub. Sometimes I just need to reimplement a function I've written before but it's really hard to find where my old code is.

It is possible to fine-tune CodeGen using Huggingface Transformers! Then you'd be able to fine-tune it on your own code and use the resulting model. However, training is more expensive -- you'd need an A6000 or better to train the 6B model. Something like the following should work:

    deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json

Where run_clm.py is this script: https://github.com/huggingface/transformers/blob/main/exampl...

It might be doable to set this up on an AWS machine with a beefy GPU or two. I haven't tried it yet though.

Once you have a model trained in Huggingface Transformers you'd be able to convert it using this script:

https://github.com/moyix/fauxpilot/blob/main/converter/huggi...

p1esk

How do I create a dataset?

moyix

Have a look at the datasets library [1], but as a shortcut, you can just create a file named "my_code.json" in jsonlines format with one line per source file that looks like:

   {"text": "contents_of_source_file_1"}
   {"text": "contents_of_source_file_2"}
   ...

And then pass that my_code.json as the dataset name.

[1] https://github.com/huggingface/datasets