Linux Kernel "Historical" Git Repository with Full History

How to split a historical git repository?

You can simply use something like

git init
git add .
git commit -m "Import project blablabla"

in a directory containing all the files you want to import (but not the directories like .svn), and get a new Git repository without history.

git checkout --orphan would also work if you already have an imported history and if the latest commit of this history contains the right files.

Then, you want to graft this commit on top of the imported history. Graft points were initially created for this use-case, but they were superseded by git replace. The manual page for git replace is hard to read, but there's a Replace Kicker tutorial.

How to fetch all git history after I clone the repo with `--depth 1`?

Use git pull --unshallow and it will download the entire commit history.

How to prepend the past to a git repository?

For importing the old snapshots, you find some of the tools in Git's contrib/fast-import directory useful. Or, if you already have each old snapshot in a directory, you might do something like this:

# Assumes the v* glob will sort in the right order
# (i.e. zero padded, fixed width numeric fields)
# For v1, v2, v10, v11, ... you might try:
#     v{1..23}     (1 through 23)
#     v?{,?}       (v+one character, then v+two characters)
#     v?{,?{,?}}   (v+{one,two,three} characters)
#     $(ls -v v*)  (GNU ls has "version sorting")
# Or, just list them directly: ``for d in foo bar baz quux; do''
(git init import)
for d in v*; do
    if mv import/.git "$d/"; then
        (cd "$d" && git add --all && git commit -m"pre-Git snapshot $d")
        mv "$d/.git" import/
    fi
done
(cd import && git checkout HEAD -- .)

Then fetch the old history into your working repository:

cd work && git fetch ../import master:old-history

Once you have both the old history and your Git-based history in the same repository, you have a couple of options for the prepend operation: grafts and replacements.

Grafts are a per-repository mechanism to (possibly temporarily) edit the parentage of various existing commits. Grafts are controlled by the $GIT_DIR/info/grafts file (described under “info/grafts” of the gitrepository-layout manpage).

INITIAL_SHA1=$(git rev-list --reverse master | head -1)
TIP_OF_OLD_HISTORY_SHA1=$(git rev-parse old-history)
echo $INITIAL_SHA1 $TIP_OF_OLD_HISTORY_SHA1 >> .git/info/grafts

With the graft in place (the original initial commit did not have any parents, the graft gave it one parent), you can use all the normal Git tools to search through and view the extended history (e.g. git log should now show you the old history after your commits).

The main problem with grafts is that they are limited to your repository. But, if you decide that they should be a permanent part of the history, you can use git filter-branch to make them so (make a tar/zip backup of your .git dir first; git filter-branch will save original refs, but sometime it is just easier to use a plain backup).

git filter-branch --tag-name-filter cat -- --all
rm .git/info/grafts

The replacement mechanism is newer (Git 1.6.5+), but they can be disabled on a per-command basis (git --no-replace-objects …) and they can pushed for easier sharing. Replacement works on individual objects (blobs, trees, commits, or annotated tags), so the mechanism is also more general. The replace mechanism is documented in the git replace manpage. Due to the generality, the “prepending” setup is a little more involved (we have to create a new commit instead of just naming the new parent):

# the last commit of old history branch
oldhead=$(git rev-parse --verify old-history)
# the initial commit of current branch
newinit=$(git rev-list master | tail -n 1)
# create a fake commit based on $newinit, but with a parent
# (note: at this point, $oldhead must be a full commit ID)
newfake=$(git cat-file commit "$newinit" \
        | sed "/^tree [0-9a-f]\+\$/aparent $oldhead" \
        | git hash-object -t commit -w --stdin)
# replace the initial commit with the fake one
git replace -f "$newinit" "$newfake"

Sharing this replacement is not automatic. You have to push part of (or all of) refs/replace to share the replacement.

git push some-remote 'refs/replace/*'

If you decide to make the replacement permanent, use git filter-branch (same as with grafts; make a tar/zip backup of your .git directory first):

git filter-branch --tag-name-filter cat -- --all
git replace -d $INITIAL_SHA1

search manual containing git

You can get a list of all man pages with man -k .. The last '.' is a regular expression. If you search for "git", simply use man -k git, but this also gives you pages like "isdigit" what is not what we want. Maybe we search for a full word in the regular expression which can be done with `man -k "<git>". Now you get a long list of all the git pages.

For your second question, there is no "command learning resource". A linux/unix system is what you summarized in it. As this, as more programs you have installed as more "commands" are available. Maybe you want to start with learning the shell internal commands, so you take a look for bash or any other shell you prefer.

As always: A good beginner book is always a good resource!

What is the difference between linux-next and linux-net-history git repositories?

I think this explains the reason https://lkml.org/lkml/2011/8/2/95:

Date      Tue, 2 Aug 2011 20:08:34 +1000
From      Stephen Rothwell <>
Subject   linux-next changes

Hi all,

Noone seems to have noticed, but I have mode the following changes to the
linux-next repository on git.kernel.org:

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git now
contains just the last 90 (or so) linux-next trees.  I have removed the
"history" branch from this tree as it was serving no real purpose.  You
can fetch a particular tree by using its name as a tag.  It is now aonly
about 40MB relative to Linus' tree (as opposed to 300M for the complete
tree.

I have put the old tree into
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next-history.git
which I will keep maintaining with the new trees, but is just there for
historical purposes.  It also no longer has the "history" branch.

If anyone had a tree left over that referenced linux-next through an
alternate, then you should probably change that to reference
linux-next-history until you have fixed it to not reference it at all.  
-- 
Cheers,  
Stephen Rothwell                    sfr@canb.auug.org.au  
[unhandled content-type:application/pgp-signature]

How to find/identify large commits in git history?

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:

http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

#!/bin/bash
#set -x 
 
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
 
# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
 
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
 
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
 
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done
 
echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:

Which commit has this blob?

... to find the commit that points to each of those blobs.

How to clone a git repo piecemeal

Cloning cannot be resumed, if it's interrupted you'd need to start over. There can be a couple of workaround though:

You can use shallow clone i.e. git clone --depth=1, then you can deepen this repository using git fetch --depth=N, with increasing N. But disclaimer is, I have never tried myself.

Another option could be git-bundle. The bundle itself is a single file, which you can download via HTTP or FTP with resume support (via BitTorrent, rsync or using any download manager). You can have somebody to create a bundle for you and then download it and create a clone from that. Correct the configuration and next of fetch from the original repo.

Kernel building: how are the torvalds and stable repos related?

Yes, you'd mostly want to build off stable unless you're working on bleeding-edge stuff.

Tags are merely pointers to commits - just because one repo has a tag and the other doesn't doesn't mean that the commit isn't present in both repos. (For instance, 'stable' could have a tag 'Foo' that points to commit 'A' - torvalds might also have that commit A as part of some branches, but doesn't have the named tag.)