Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove large files from git repo & history #2031

Open
sneakers-the-rat opened this issue Mar 27, 2024 · 3 comments
Open

remove large files from git repo & history #2031

sneakers-the-rat opened this issue Mar 27, 2024 · 3 comments
Labels
community-generated devops poetry, setuptools, actions, etc. related changes

Comments

@sneakers-the-rat
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
i have a tiny lil computer and am always teetering on 100% full disk, so maybe i run ncdu more than the average person, so bear that in mind.

linkml is ~300MB, which is not huge but it's also not small.

Thankfully it looks like there is a pretty clear culprit and a relatively lossless way of thinning the repo:

In-repo space usage

doing ncdu from the root on the main branch shows us that

  • tests - 167MB
    • data - 109MB
      • hp.dill - 63.4MB
      • hp.ttl - 42.2MB
  • .git - 113.9MB

So the major contributor there are those hp.dill and hp.ttl files, and just doing a search for those filenames as well as hp from within the tests directory doesn't match anything. Are those just vestigial historical files? If so we could remove those safely, and if they aren't unique/likely to be depended on we can recover the space.

git history usage

looking at the usage of space within the .git directory with this:

git rev-list --objects --all | \
  git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' | \
  sort -nr -k 3 | \
  perl -ne 'm#^(\w+) blob (\d+) (.+)# or next; print "$1\t$2\t$3\n";' | \
  head -n 200 | \
  column -t -s $'\t'

shows us that we have a few versions of the hp* files above, and then nearly all the rest of the space in the git history is from historically versioned generated docs files - linkml.generators.html, _modules/linkml_runtime/linkml_model/meta.html and so on.

Since the source of those docs files is likely in the version history, we can also safely remove those html files from the git history.

mitigations

removing files from git history is safer than it might appear, though it does require one leap of faith moment (that can also be fully recovered with a single backup).

git filter repo is super easy to use, to remove a file from the history you just do

git filter-repo --path file_to_delete.csv --invert-paths

you can test it out by cloning the repo, running the filter repo command, and then diffing between that and the other non-cleaned repo. you can validate history by iterating through commits and diffing those too. if all went well the only diff should be the files you removed.

if you want to be extra sure you can make a fork and test force pushing to that before doing so to the main repo, and then yes the final leap of faith is force pushing to main. Even if that were to go catastrophically wrong, if you make a local clone of the repo, you should be able to fully restore it with another force push.

anyway, feel free to rapidly triage and close if not something y'all are interested in, it would just be a minor quality of life improvement, drop barriers to contribution, and also it's sort of an aesthetic thing - we want new contributors to be delighted and pleasantly surprised, and starting with a big clone is a minor code smell. if no, totally cool.

How important is this feature? Select from the options below:
• Low - it's an enhancement but not crucial for work

When will use cases depending on this become relevant? Select from the options below:
• Long-term - 6 months - 1 year

@cmungall
Copy link
Member

At the very least we should remove these, they are vestigial.

@nlharris nlharris added community-generated devops poetry, setuptools, actions, etc. related changes labels Mar 28, 2024
@nlharris
Copy link
Contributor

So can @sneakers-the-rat do a PR to delete those, and someone can approve?

@sneakers-the-rat
Copy link
Collaborator Author

I can PR to remove the files, that's np, but i can't PR to remove them from the git history which is what would save the space :). i can't force push to main (and if i could, i would want that turned off lol, i don't need that kinda stress ;) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-generated devops poetry, setuptools, actions, etc. related changes
Projects
None yet
Development

No branches or pull requests

3 participants