Responding in hopes some Git non-novices are here and can give some quick advice.

I have a fairly large Git repo with 5 years of commits from numerous team members including a bunch of non-technical people who had never used Git before.

There were two major issues:

1. We started off storing binary files -- mostly images, but also a ton of raw data files that got versioned every day or two -- in this repro and it spiralled out of control size-wise. I ended up using "BFG" to nuke the binary files and I think that worked, but it still feels like the repo is way too large in terms of file size. Is it possible that orphaned old versions of files are floating around somewhere in .git/? What are the best practices here?

2. The branching strategy was badly wrong. We used two branches: production and dev. New commits are made to dev, dev is merged into production periodically via GitHub PRs. Dev is NOT deleted and we did not use a squash strategy on merges. Somehow this resulted in the repo having 2, 3, 4, or 5 copies of the same commit immediately in a row. I think that somehow a branch got merged into itself? I don't know how that would be possible, but I can tell you the symptom is that there's a period of time several years ago where every commit appears in quintuplicate, and this slowly decreases until every commit is just doubled, and then at some point we're back to a correct commit history. What's the likely cause here and what's the likely solution?

We would like to fix both these things while preserving history (obviously the problem could be immediately "solved" with rm -rf .git && git init, but we'd like to avoid that at least partially so that no one who worked on the project has their historical commit record broken and so we can still use blame to know who most recently touched some of the older stuff)

My own git-fu is not great, so, thanks for posting these exercises.

For 1), try using git-filter-repo (https://github.com/newren/git-filter-repo). This is the currently recommended alternative to previous tools like filter-branch, and it is much more user-friendly.

`git filter-repo --analyze` will generate a report of blobs stored in the repo at `.git/filter-repo/analysis/blob-shas-and-paths.txt`, and it's very easy to sort them by filesize and strip them out from there.