The character limit in GPT is a fundamental limit of the software architecture, not an artificial limitation used as an upsell or rate limit. Uploading a file give the plugin access to the file and GPT access to the plugin's output, so you might be able to get more informative answers using it, but it's not really fair to say that it "bypasses" the character limit in any meaningful way.

Consider a similar "bypass": you can upload the file to the web and the web browser plugin can read the page. You can do this with the whole git repo if it's public!

The token limit is 100% an artificial limitation. When ChatGPT first released last november I took the opportunity to try pasting 3k-line codebases into it to get it to walk me through them and it worked perfectly fine, putting that same code in the OpenAI tokenizer tells me it's ~33k tokens, way above the limits today. The reason they do this is because every token takes up ~1mb of video memory and that adds up real quick. If you had infinite video memory there would be no "fundamental limit" to how long a LLM can output.

OpenAI then has two limits on inputs. The first artificial one ensures that people don't get overzealous inputing too much, otherwise they'll hit the second hard limit of how much vram their cards have. To the LLM itself there is no difference between characters from the chatbot and human, the only hard limiter is the total number of tokens. I tried this out by inputing a 4k-token string into ChatGPT as many times as I could and it failed on the 20th input, meaning that the hard limit is >80k tokens. Converting this to vram gives us >80gb which is the exact amount of ram the Nvidia a100 card has.

> When ChatGPT first released last november I took the opportunity to try pasting 3k-line codebases into it to get it to walk me through them and it worked perfectly fine.

A common technique to work around the limitations in context length is to simply pull the most recent context that fits into the length. It can be difficult to notice that this happens because oftentimes the full context isn't actually necessary. However, specific details from the context are actually lost. For example, if you ask the model to list the filenames back in the same order, and the context was truncated, it would start from the first non-truncated file and the others would be dropped.

> If you had infinite video memory there would be no "fundamental limit" to how long a LLM can output.

Well, you've certainly got me there. One of the big limits with the transformer architecture today is that the memory usage grows quadratically with context length due to the attention mechanism. This is why there's so much interest in alternatives like RWKV <https://news.ycombinator.com/item?id=36038868>, and why scaling them is hard <https://news.ycombinator.com/item?id=35948742>.

FlashAttention has memory linear in sequence length. https://github.com/HazyResearch/flash-attention