Deduplicate Git Forks on a Server

Deduplicate Git forks on a server

I have decided to do this:

 shared-objects-database.git/
foo.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)
bar.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)
baz.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)

All the forks will have an entry in their objects/info/alternates file that gives a relative path to the objects' database repository.

It is important to make the object database a repository, because we can save objects and refs of different users having a repository of the same name.

Steps:

  1. git init --bare shared-object-database.git
  2. I run the following lines of code either every time there is a push to any fork (via post-recieve) or by running a cronjob

    for r in list-of-forks
    do

    (
    cd "$r" &&
    git push ../shared-objects-database.git "refs/:refs/remotes/$r/" &&
    echo ../../shared-objects-database.git/objects >objects/info/alternates
    # to be save I add the "fat" objects to alternates every time
    )
    done

Then in the next "git gc" all the objects in forks that already exist in alternate will be deleted.

git repack -adl is also an option!

This way we save space so that two users pushing the same data on their respective forks on the server will share the objects.

We need to set the gc.pruneExpire variable up to never in the shared-object-database. Just to be safe!

To occasionally prune objects, add all forks as remotes to the shared, fetch, and prune! Git will do the rest!

(I finally found a solution that works for me! (Not tested in production! :p Thanks to this post.)

How can one safely use a shared object database in git?

Why not just crank the gc.pruneExpire variable up to never? It's unlikely you'll ever have loose objects 1000 years old that you don't want deleted.

To make sure that the things which really should be pruned do get pruned, you can keep one repo which has all the others as remotes. git gc would be quite safe in that one, since it really knows what is unreachable.

Edit: Okay, I was a bit cavalier about the time limit; as is pointed out in the comments, 1000 years isn't gonna work too well, but the beginning of the epoch would, or never.

What is git actually doing when it says it is resolving deltas ?

Git uses delta encoding to store some of the objects in packfiles. However, you don't want to have to play back every single change ever on a given file in order to get the current version, so Git also has occasional snapshots of the file contents stored as well. "Resolving deltas" is the step that deals with making sure all of that stays consistent.

Here's a chapter from the "Git Internals" section of the Pro Git book, which is available online, that talks about this.

open repo file instead of a tmp file to modify in git difftool

No: because you've selected two specific commits, there are no files to change.

That seems a little odd to say. It actually is a little odd, until you realize this thing about commits: while commits do contain files,1 the files they contain are frozen. None of them can ever be changed. They are committed, and the contents of any commit are as permanent as the commit itself,2 and totally read-only.

Of course, when you use Git, you have files that you can change. If you didn't, it would be hard to use Git. But those aren't the committed files: those are the work-tree copies. If you direct git difftool --tool=vimdiff to use those files as one side of the operation, it already does open those files directly. To do that:

git difftool --tool=vimdiff <options> <commit>

where <options> includes your --no-prompt and <commit> might be HEAD~1 again, for instance.

(As with git diff, git difftool can be told to compare two commits, or to compare one commit to the current work-tree. There is no option for comparing a commit to the current index contents. The rest of this answer does not mention the index, but the index holds a third copy of every file. Files inside commits are in a special, read-only, Git-only format. Files in your work-tree are in a useful format, so that you can read or edit them directly. Files in your index are in a halfway zone, between the HEAD commit where they're frozen and the work-tree where they're normal: the index copies are unfrozen, but still Git-only and compressed. Git makes new commits from the index copy, which is why you have to keep running git add all the time.)


1Technically, a commit doesn't so much contain the files as refer to the files. The files are stored using a series of indirections: commits point to tree objects, which give the name of the file, the mode, and a hash ID for the contents; and then the hash ID in the tree points to the Git blob object that saves the file's content. This allows two different trees (with maybe different modes or different sets of files) to re-use existing frozen file contents, and allows different commits (with maybe different authors or timestamps) to re-use existing frozen trees that re-use existing frozen commits. This is just one of several tricks that Git uses to save a lot of space, even though every commit stores a full and complete copy of every file: under the hood, there is a whole lot of re-use of old files.

2A commit normally lives forever, but if everyone who has some commit—as identified by some hash ID—agrees to stop using that commit forever and takes it out of the history-list, Git will eventually forget the commit for real. If anyone didn't agree, they can easily reintroduce the commit again, and in fact, that's the default. Commits are therefore hard to get rid of permanently once they've been spread to other Git repositories, because to get rid of them permanently, you have to ditch them from every Git repository that picked them up.

Is there a way to limit the amount of memory that git gc uses?

Yes, have a look at the help page for git config and look at the pack.* options, specifically pack.depth, pack.window, pack.windowMemory and pack.deltaCacheSize.

It's not a totally exact size as git needs to map each object into memory so one very large object can cause a lot of memory usage regardless of the window and delta cache settings.

You may have better luck packing locally and transfering pack files to the remote side "manually", adding a .keep files so that the remote git doesn't ever try to completely repack everything.

Why does GitHub need me to rebase a Merge Request when nothing has changed in the MR

Git—and GitHub—does not necessarily need this. The things that need it are humans, and/or rules imposed by humans. The following is long but I suggest it is worth reading.

Long: what rebase is about

To use Git and GitHub effectively, you should know the following:

  • Git does not have pull requests or merge requests. These are add-ons, provided by various hosting sites (GitHub, Bitbucket, GitLab, and others).

  • GitHub call their add-ons pull requests; it's GitLab that call theirs merge requests.

These are both relatively minor, but because Git terminology is already horribly confusing, it's best to be as clear as possible.

Regardless of what they call them and how they implement them internally and externally—these also differ from hosting site to hosting site—these do all build on some fundamental Git technologies. Mastering these will help. Here's what to know:

  • Git is built around commits. The commit is the raison d'être for Git. Nothing else really matters except insofar as it acts in service to commits.

  • Every commit has a unique number, usually expressed in hexadecimal, that looks like a big ugly string of letters and digits. In an important way, that number is the commit, which is why it's required to be unique. Two Gits, when talking to each other, will just exchange the raw numbers to see if they both have the commit. If not, one Git may have to send the commit to the other Git. (Git repositories "like to" add new commits to themselves, and "dislike" ever forgetting any commit.1) We call these numbers hash IDs.

  • Every commit contains two things:

    • Each commit has a snapshot of all of the files of some project. The internal storage format here is complicated, and not really relevant, but it's worth knowing that (1) it acts as a snapshot and (2) all the files in it are de-duplicated against all the copies that exist in any other commit, so the fact that most commits mostly contain the same files as most other commits doesn't bloat up the repository too much.

    • Each commit has some metadata, or information about the commit itself: who made it and when, for instance. The metadata include your name and email address (from user.name and user.email settings) and any log message you put in. And—crucial from Git's point of view—each commit contains the raw commit hash ID(s) of some set of earlier commit(s).

  • All parts of any commit, once created, are read-only. One reason for this is that the hash ID of a commit is just a cryptographic checksum of the contents of the commit.2 If you take a commit—or any internal object—out of the Git database, make some changes, and write the result back in, you just get a new object with a different hash ID. The original object remains.

This last bit makes Git repositories generally append-only (which explains the anthropomorphized "liking" of adding new commits). It is possible to take some existing commits that are not "good enough" and copy them to new-and-improved commits, and to stop using the old commits. If every Git repository does so, the old commits can eventually "fall away".3 This is what git rebase is about.


1Don't anthropomorphize computers—they hate that! /p>

2This implies that each commit must be unique. The stored hash ID of a previous commit, plus the time-stamps, help out here, for instance. It also means that Git must eventually fail: the pigeonhole principle tells us that any hashing scheme will eventually have a collision. The size of the hash ID determines how quickly Git might fail. It's been engineered to make it take many decades before this occurs. Malicious hacking of SHA-1 can lead to earlier failure, though, and the size of repositories is growing in general, both of which are causing Git to move from SHA-1 to SHA-256 eventually.

3The details are complicated and we won't cover them here.



Working in a Git repository

When humans do work in a Git repository, the process generally goes like this:

  1. Clone some existing repository if needed, or, update if needed, then use existing clone.
  2. Do some work. Make new commits, because Git is all about commits.
  3. Test and rework as needed.
  4. Announce readiness.
  5. Go through any review process; this may send one back to step 2 or 3.
  6. Incorporate the work.

If only one person is doing any work at any time, and no reworking or cycling-through-steps occurs, this process is pretty straightforward. The only squirrelly parts happen at steps 1 and 6. If we ignore those, we see a nice, simple process that looks like this. Here, I'll draw commits using single uppercase letters to stand in for their hash IDs:

... <-F <-G <-H   <--main

Right now, the commit whose hash ID is H is the latest commit on the main or master branch. Git itself doesn't care at all about branch names: it just uses them to find commits. Specifically, a branch name holds the hash ID of one commit, and that one commit is the latest commit that we—or Git—will call "part of the branch".

Since each commit holds the hash ID of an earlier commit—or sometimes two earlier commits; we'll see this in a moment—commit H contains the raw hash ID of earlier commit G. We say that commit H points to commit G, hence the backwards arrow in the drawing above.

Commit H also contains a full snapshot of every file. These are the files we get to work on / with, when we run git checkout main. Note that the files we work on / with go in our working tree. They are copied out of the commit: in the commit, they're in some special weird Git-only format, compressed and de-duplicated, not usable by most of the software on the computer.

Git found commit H using the hash ID stored in the name main. That's how git checkout main (or git switch main, in Git 2.23 or later) got all the files out of it. And, that's how git log shows you information about commit H: it uses the name main to look up the hash ID, and then uses the hash ID to look up the internal commit in a big database-of-all-Git-commits-and-other-supporting-objects.

Since commit H stores commit G's hash ID, Git can use that to fish commit G's files out too, and can compare the snapshot in G to that in H. By doing that, Git can show us what files, if any, changed, and what changes were made, even though H is just a snapshot.

Of course, commit G is a full commit, with a previous commit hash ID F, so Git can load both commits F and G and use that to show what changed in commit G. The git log command can also show the log message for commit G, having found the hash ID from commit H.

And of course, commit F is a full commit, so Git can go on doing this. It can keep it up all the way back to the very first commit ever. That commit is special in one way: it doesn't point back to any earlier commit. Git knows, upon reaching that commit, to stop going backwards.

So, Git works backwards. But what about making a new commit? That's actually pretty straightforward too. Before we make our new commit, though, let's make one new branch name, like this:

...--F--G--H   <-- develop, main

Git requires that we pick out one branch name to be the current branch. We do this with git checkout or git switch. To remember which one we picked, we'll draw in the special name HEAD, in parentheses, attached to one of these branch names:

...--F--G--H   <-- develop, main (HEAD)

Here we're on branch main, as git status will say, and using commit H. If we git checkout develop, we get:

...--F--G--H   <-- develop (HEAD), main

Now we're on branch develop, as git status will say, but still using commit H. Note that every commit is on both branches at this point.

We now modify some files in our working tree, in the usual way (they're ordinary files that aren't actually in Git) and then run git add to prepare them for committing. Skimming over some other fairly important stuff, this replaces the copies of the files in Git's staging area. These extra copies are what Git uses to make the new commit, and the staging area actually has copies of the same files Git copied out of commit H. They're just in ready-to-commit, compressed and pre-de-duplicated form at this point. Using git add tells Git to replace some of these files with new copies: Git will compress and de-duplicate the files at git add time, so that git commit can just use whatever is in Git's index right then.

Finally, we run git commit. This:

  • Obtains metadata from us: a log message to put in the new commit, the setting of user.name and user.email right now, and so on. The date and time stamps are "now" and the parent commit, for the new commit, is the current commit, as found by the current branch name. So that's commit H, in this case.

  • Makes a permanent snapshot from whatever is in Git's index right now. Since we used git add to update these files, that's the right snapshot.

  • Writes all of this out as a new commit. This obtains a new, unique hash ID for the new, unique commit. The commit object goes into the big database:

    ...--F--G--H
    \
    I

    Note how new I points back to existing commit H. (I can't draw great arrows here, so I've gone to lines. H can't point to I though: H was made long ago, and can't be changed. So I must point back to H.)

  • Last, Git does its special trick: it writes the new commit's hash ID into the current branch name.

This last trick is what get us:

...--F--G--H   <-- main
\
I <-- develop (HEAD)

This lets us add new commits, one at a time, with each one single new commit advancing the current branch (develop, since that's where HEAD is attached). If we add one more commit, we get:

...--F--G--H   <-- main
\
I--J <-- develop (HEAD)

If all looks great here, we can now simply git checkout main and tell Git: commit J is great, use it as the last commit in main too. Skipping over the details—Git does this with what it calls a fast-forward merge—the result is:

...--F--G--H--I--J   <-- develop, main (HEAD)

and now all commits are on both branches and it's safe to delete either name—the other one will find all commits.

Note that, earlier, it's safe (in some sense) to delete the name main. We can find all the commits by starting at J and working backwards. That's the point of branch names: they give us places to start from, to work backwards. The only reason to keep main is to remember H specially for a while, but that's a fairly decent reason—especially if we decide that I-J are terrible commits after all and want to throw them out.

Throwing out old commits

Suppose we do decide that I-J are terrible. Here's one way to throw them away, instead of merging them in:

git checkout main
git branch -D develop

The first step, which we'd also do if we wanted to do the fast-forward merge, gets us:

...--G--H   <-- main (HEAD)
\
I--J <-- develop

If we run git log now, we don't see commits I-J. We have to run git log develop to see them, using the name develop to find J.

The second command tells Git to delete the name develop—forcibly, because without forcing Git, it will say no: this would lose us access to commits I-J. By deleting the name develop, we end up with:

...--G--H   <-- main (HEAD)
\
I--J [abandoned]

By deleting the name, we can't find the commits any more and we will never be bothered with them again—as long as we didn't send them to any other Git yet, that is.

We can now create develop again, pointing to H again, and try our development work again, this time knowing what we did wrong:

          K--L   <-- develop (HEAD)
/
...--G--H <-- main
\
I--J [abandoned]

When we incorporate the (new, good) commits, we could just call them I-J if we like, as if the abandoned commits are totally gone. Their real names are some big ugly hash IDs; we're just making these one-letter names up, after all.

That's all great if we're the only one doing any work, but that's not realistic in a lot of cases.

Parallel development

Let's start with two users. I'll use the standard "Alice and Bob" here, although apparently this idiom is falling out of favor for some reason. Each person makes their own clone, so that each person has their own branch names. This gets into a small side discussion:

  • Each repository shares commits.
  • But each repository has its own branch names.

On Alice's system, she gets:

...--G--H   <-- main

and on Bob's, he gets:

...--G--H   <-- main

When Alice makes two new commits (on main or any other branch name), her commits get unique hash IDs:

          I--J   <-- alice
/
...--G--H

Meanwhile, when Bob makes two new commits, his commits also get unique hash IDs:

...--G--H
\
K--L <-- bob

If we take all these commits and combine them in a single repository, and use the names alice and bob to find the last ones, we get this picture:

          I--J   <-- alice
/
...--G--H
\
K--L <-- bob

(with, perhaps, main pointing to H—though we don't need a name for H, as we can find it by starting at either branch tip and working backwards).

Given the existence of parallel development, we now have a problem: How do we join these parallel lines of development?

Merge (true merges)

One way to do this is to use Git's ability to merge work. We obtain all the commits, into some Git repository somewhere, and use branch names like the above to find them. Then we pick one of the two branches to check out / switch to, and run git merge with the other:

git checkout alice
git merge bob

for instance.

Git's merge engine now does what I like to call merge as a verb: the action of finding changes since some common starting point. The common starting point is obvious from the drawing: it's commit H.

Git will now use its comparison software—git diff, more or less—to compare the snapshot in commit H to that in commit J, to see what files Alice changed, and what changes Alice made to those files. Git will also use git diff to compare H vs L, to see what Bob changed. Then, for each file:

  • If nobody changed the file, we just take the original file.
  • If one person changed the file and the other didn't, we take the changed file.
  • If both persons changed the same file, we have Git work hard to combine their changes.

Git's combining is done with a simple and stupid algorithm, that just looks at line-by-line changes. If the changed lines don't "touch" or "overlap", Git will take both changes. If they do touch-or-overlap, Git will generally declare a merge conflict and force the human running git merge to clean up the mess. There are many special cases here, but if Alice and Bob are working on different parts of the system, Git will often be able to do all the work-combining on its own.

Since we're not really covering git merge properly here, let's just assume that Git thinks all went well, so that Git makes its own new commit for you. Git applies the combined changes to the snapshot from the common starting point—what Git calls the merge base—in commit H, which keeps both sets of changes. Git writes all of the resulting files to both your working tree and Git's own index / staging-area. Then, Git makes a new commit from these files:

          I--J
/ \
...--G--H M <-- alice (HEAD)
\ /
K--L <-- bob

Since we ran git checkout alice to start this, we're on branch alice, so the new merge commit's hash ID goes into the name alice. The resulting commit has a snapshot—just like any commit—made by applying the combined changes to the snapshot from H. It has metadata, just like any ordinary commit, saying that we made this commit just now. The only thing special about this commit is that, instead of pointing back to just commit I, it points back to both branch-tip commits, I and K.

We are now allowed to delete any name we don't need. The name we don't need here is bob: we can find commit K by working backwards through M. Git will work backwards to both commits, following both backwards-pointing arrows automatically.

This is a true merge, and is one way to combine work. Git can do this; GitHub can do this; and a GitHub pull request can be handled through this kind of process, as long as there are no merge conflicts. But some humans don't like to do this.

Rebasing instead of merging

Let's suppose that we are Alice, and we have this situation:

          I--J   <-- alice (HEAD)
/
...--G--H <-- main

But Bob gets his commits added to some repository first:

...--G--H--K--L   <-- main

We can now run git fetch against this other repository—we'll call it origin here—to pick up new commits K-L. Here is what we will see in our own local repository:

          I--J   <-- alice (HEAD)
/
...--G--H <-- main
\
K--L <-- origin/main

If we have a "no merges" rule—who knows why we have this rule4—we have to take our perfectly good I-J commits and "improve" them, by making new commits that add on to L.

To do this manually, we'd use git cherry-pick twice, with a new temporary branch name, then (e.g.) change the branch names. But the git rebase command can do this for us all in one go:

git rebase origin/main

The rebase operation copies the effects of some set of commits. To do this, it has to use Git's merge machinery, the merge-as-a-verb part of the idea. The git cherry-pick command implements this, one commit at a time, and git merge runs git cherry-pick repeatedly.5 Once it has the right snapshot prepared, each cherry-pick step commits this snapshot, re-using the original commit's log message, but as a regular single-parent commit, not as a merge commit. So once this stage is done, we have:

          I--J   <-- alice
/
...--G--H <-- main
\
K--L <-- origin/main
\
I'-J' <-- HEAD

where I' is Git's automated copy of commit I, and J' is Git's automated copy of commit J. This drawing also illustrates a trick that rebase uses internally: it runs in what Git calls detached HEAD mode, to avoid having to make up a temporary branch name.6

Once all the copies are done, though, Git uses another internal command to force the original branch name to point to the last copied commit. It then re-attaches HEAD, so that we have:

          I--J   [abandoned]
/
...--G--H <-- main
\
K--L <-- origin/main
\
I'-J' <-- alice (HEAD)

Note how this resembles what happens when we deliberately throw out never-sent-to-anyone-else commits.

(An unfortunate side effect here is that usually, with this kind of work-flow, we've already sent these commits somewhere for review. We'll touch on this in the next section.)


4This is, in my opinion, not really a good rule. I've followed it before in projects—it's not a terrible rule—but just blindly saying "no merges" is, I think, wrong. Still, people like it.

5In fact, git rebase is a horrifically complicated command, that can do its job in one of many different ways. In modern Git, it now defaults to using git cherry-pick internally. In slightly older Git versions, you need -m or -i or similar options to get it to use git cherry-pick. Other options add special features, and rebase already has numerous special features so some options disable these features. But mostly, it's about copying some set of existing not-quite-good-enough commits to new-and-improved commits, and that's what git cherry-pick is also about, so they're closely related.

6This detached HEAD mode implementation detail "leaks out" if the rebase has to stop to get help with a merge conflict, or if you use the git rebase -i variant and make it stop on purpose: when rebase stops in the middle of the operation, you're still in this detached-HEAD mode. You must tell Git to resume the rebase, or terminate it, to get out of the detached-HEAD mode. This gets messy, and you should be careful not to use git checkout to get out of the detached HEAD mode.



Rebasing in the GitHub fork world

All of the above is stuff we can do in base Git. GitHub and other hosting sites, however, add on a bunch of features, in the hope that we'll like those features enough to actually pay for services on those hosting sites.

The first GitHub specific feature is the GitHub fork. A fork, on GitHub, is like a clone, but with two changes:



Leave a reply



Submit