Git Sparse-Checkout Ignore Specific File Type

Git sparse checkout with exclusion

With Git 2.25 (Q1 2020), Management of sparsely checked-out working tree has gained a dedicated "sparse-checkout" command.

Git 2.37 (Q3 2022) makes the cone mode the default. See last section of this answer.


First, here is an extended example, starting with a fast clone using a --filter option:

git clone --filter=blob:none --no-checkout https://github.com/git/git
cd git
git sparse-checkout init --cone
# that sets git config core.sparseCheckoutCone true
git read-tree -mu HEAD

Using the cone option (detailed/documented below) means your .git\info\sparse-checkout will include patterns starting with:

/*
!/*/

Meaning: only top files, no subfolder.

If you do not want top file, you need to avoid the cone mode:

# Disablecone mode in .git/config.worktree
git config core.sparseCheckoutCone false

# remove .git\info\sparse-checkout
git sparse-checkout disable

# Add the expected pattern, to include just a subfolder without top files:
git sparse-checkout set /mySubFolder/

# populate working-tree with only the right files:
git read-tree -mu HEAD

In details:

(See more at "Bring your monorepo down to size with sparse-checkout" from
Derrick Stolee)

So not only excluding a subfolder does work, but it will work faster with the "cone" mode of a sparse checkout (with Git 2.25).

See commit 761e3d2 (20 Dec 2019) by Ed Maste (emaste).

See commit 190a65f (13 Dec 2019), and commit cff4e91, commit 416adc8, commit f75a69f, commit fb10ca5, commit 99dfa6f, commit e091228, commit e9de487, commit 4dcd4de, commit eb42fec, commit af09ce2, commit 96cc8ab, commit 879321e, commit 72918c1, commit 7bffca9, commit f6039a9, commit d89f09c, commit bab3c35, commit 94c0956 (21 Nov 2019) by Derrick Stolee (derrickstolee).

See commit e6152e3 (21 Nov 2019) by Jeff Hostetler (Jeff-Hostetler).

(Merged by Junio C Hamano -- gitster -- in commit bd72a08, 25 Dec 2019)

sparse-checkout: add 'cone' mode

Signed-off-by: Derrick Stolee

The sparse-checkout feature can have quadratic performance as the number of patterns and number of entries in the index grow.

If there are 1,000 patterns and 1,000,000 entries, this time can be very significant.

Create a new Boolean config option, core.sparseCheckoutCone, to indicate that we expect the sparse-checkout file to contain a more limited set of patterns.

This is a separate config setting from core.sparseCheckout to avoid breaking older clients by introducing a tri-state option.

The config man page includes:

`core.sparseCheckoutCone`:

Enables the "cone mode" of the sparse checkout feature.

When the sparse-checkout file contains a limited set of patterns, then this mode provides significant performance advantages.

The git sparse-checkout man page details:

CONE PATTERN SET

The full pattern set allows for arbitrary pattern matches and complicated inclusion/exclusion rules.

These can result in O(N*M) pattern matches when updating the index, where N is the number of patterns and M is the number of paths in the index. To combat this performance issue, a more restricted pattern set is allowed when core.spareCheckoutCone is enabled.

The accepted patterns in the cone pattern set are:

  1. Recursive: All paths inside a directory are included.
  2. Parent: All files immediately inside a directory are included.

In addition to the above two patterns, we also expect that all files in the root directory are included. If a recursive pattern is added, then all leading directories are added as parent patterns.

By default, when running git sparse-checkout init, the root directory is added as a parent pattern.
At this point, the sparse-checkout file contains the following patterns:

/*
!/*/

This says "include everything in root, but nothing two levels below root."

If we then add the folder A/B/C as a recursive pattern, the folders A and A/B are added as parent patterns.

The resulting sparse-checkout file is now

/*
!/*/
/A/
!/A/*/
/A/B/
!/A/B/*/
/A/B/C/

Here, order matters, so the negative patterns are overridden by the positive
patterns that appear lower in the file.

If core.sparseCheckoutCone=true, then Git will parse the sparse-checkout file expecting patterns of these types.

Git will warn if the patterns do not match.

If the patterns do match the expected format, then Git will use faster hash-
based algorithms to compute inclusion in the sparse-checkout.

So:

sparse-checkout: init and set in cone mode

Helped-by: Eric Wong

Helped-by: Johannes Schindelin

Signed-off-by: Derrick Stolee

To make the cone pattern set easy to use, update the behavior of 'git sparse-checkout (init|set)'.

Add '--cone' flag to 'git sparse-checkout init' to set the config option 'core.sparseCheckoutCone=true'.

When running 'git sparse-checkout set' in cone mode, a user only needs to supply a list of recursive folder matches. Git will automatically add the necessary parent matches for the leading directories.


Note, the --cone option is only documented in Git 2.26 (Q1 2020)

(Merged by Junio C Hamano -- gitster -- in commit ea46d90, 05 Feb 2020)

doc: sparse-checkout: mention --cone option

Signed-off-by: Matheus Tavares

Acked-by: Derrick Stolee

In af09ce2 ("sparse-checkout: init and set in cone mode", 2019-11-21, Git v2.25.0-rc0 -- merge), the '--cone' option was added to 'git sparse-checkout init'.

Document it in git sparse-checkout:

That includes:

When --cone is provided, the core.sparseCheckoutCone setting is also set, allowing for better performance with a limited set of patterns.

("set of patterns" presented above, in the "CONE PATTERN SET" section of this answer)


How much faster this new "cone" mode would be?

sparse-checkout: use hashmaps for cone patterns

Helped-by: Eric Wong

Helped-by: Johannes Schindelin

Signed-off-by: Derrick Stolee

The parent and recursive patterns allowed by the "cone mode" option in sparse-checkout are restrictive enough that we can avoid using the regex parsing. Everything is based on prefix matches, so we can use hashsets to store the prefixes from the sparse-checkout file. When checking a path, we can strip path entries from the path and check the hashset for an exact match.

As a test, I created a cone-mode sparse-checkout file for the Linux repository that actually includes every file. This was constructed by taking every folder in the Linux repo and creating the pattern pairs here:

/$folder/
!/$folder/*/

This resulted in a sparse-checkout file sith 8,296 patterns.

Running 'git read-tree -mu HEAD' on this file had the following performance:

    core.sparseCheckout=false: 0.21 s (0.00 s)
core.sparseCheckout=true : 3.75 s (3.50 s)
core.sparseCheckoutCone=true : 0.23 s (0.01 s)

The times in parentheses above correspond to the time spent in the first clear_ce_flags() call, according to the trace2 performance traces.

While this example is contrived, it demonstrates how these patterns can slow the sparse-checkout feature.

And:

sparse-checkout: respect core.ignoreCase in cone mode

Signed-off-by: Derrick Stolee

When a user uses the sparse-checkout feature in cone mode, they add patterns using "git sparse-checkout set <dir1> <dir2> ..." or by using "--stdin" to provide the directories line-by-line over stdin.

This behaviour naturally looks a lot like the way a user would type "git add <dir1> <dir2> ..."

If core.ignoreCase is enabled, then "git add" will match the input using a case-insensitive match.

Do the same for the sparse-checkout feature.

Perform case-insensitive checks while updating the skip-worktree bits during unpack_trees(). This is done by changing the hash algorithm and hashmap comparison methods to optionally use case- insensitive methods.

When this is enabled, there is a small performance cost in the hashing algorithm.

To tease out the worst possible case, the following was run on a repo with a deep directory structure:

git ls-tree -d -r --name-only HEAD |
git sparse-checkout set --stdin

The 'set' command was timed with core.ignoreCase disabled or enabled.

For the repo with a deep history, the numbers were

core.ignoreCase=false: 62s
core.ignoreCase=true: 74s (+19.3%)

For reproducibility, the equivalent test on the Linux kernel repository had these numbers:

core.ignoreCase=false: 3.1s
core.ignoreCase=true: 3.6s (+16%)

Now, this is not an entirely fair comparison, as most users will define their sparse cone using more shallow directories, and the performance improvement from eb42feca97 ("unpack-trees: hash less in cone mode" 2019-11-21, Git 2.25-rc0) can remove most of the hash cost. For a more realistic test, drop the "-r" from the ls-tree command to store only the first-level directories.

In that case, the Linux kernel repository takes 0.2-0.25s in each case, and the deep repository takes one second, plus or minus 0.05s, in each case.

Thus, we can demonstrate a cost to this change, but it is unlikely to matter to any reasonable sparse-checkout cone.


With Git 2.25 (Q1 2020), "git sparse-checkout list" subcommand learned to give its output in a more concise form when the "cone" mode is in effect.

See commit 4fd683b, commit de11951 (30 Dec 2019) by Derrick Stolee (derrickstolee).

(Merged by Junio C Hamano -- gitster -- in commit c20d4fd, 06 Jan 2020)

sparse-checkout: list directories in cone mode

Signed-off-by: Derrick Stolee

When core.sparseCheckoutCone is enabled, the 'git sparse-checkout set' command takes a list of directories as input, then creates an ordered list of sparse-checkout patterns such that those directories are recursively included and all sibling entries along the parent directories are also included.

Listing the patterns is less user-friendly than the directories themselves.

In cone mode, and as long as the patterns match the expected cone-mode pattern types, change the output of 'git sparse-checkout list' to only show the directories that created the patterns.

With this change, the following piped commands would not change the working directory:

git sparse-checkout list | git sparse-checkout set --stdin

The only time this would not work is if core.sparseCheckoutCone is true, but the sparse-checkout file contains patterns that do not match the expected pattern types for cone mode.


The code recently added in this release to move to the entry beyond the ones in the same directory in the index in the sparse-cone mode did not count the number of entries to skip over incorrectly, which has been corrected, with Git 2.25.1 (Feb. 2020).

See commit 7210ca4 (27 Jan 2020) by Junio C Hamano (gitster).

See commit 4c6c797 (10 Jan 2020) by Derrick Stolee via GitGitGadget (``).

(Merged by Junio C Hamano -- gitster -- in commit 043426c, 30 Jan 2020)

unpack-trees: correctly compute result count

Reported-by: Johannes Schindelin

Signed-off-by: Derrick Stolee

The clear_ce_flags_dir() method processes the cache entries within a common directory. The returned int is the number of cache entries processed by that directory.

When using the sparse-checkout feature in cone mode, we can skip the pattern matching for entries in the directories that are entirely included or entirely excluded.

eb42feca ("unpack-trees: hash less in cone mode", 2019-11-21, Git v2.25.0-rc0 -- merge listed in batch #0) introduced this performance feature. The old mechanism relied on the counts returned by calling clear_ce_flags_1(), but the new mechanism calculated the number of rows by subtracting "cache_end" from "cache" to find the size of the range.

However, the equation is wrong because it divides by sizeof(struct cache_entry *). This is not how pointer arithmetic works!

A coverity build of Git for Windows in preparation for the 2.25.0 release found this issue with the warning:

Pointer differences, such as `cache_end` - cache, are automatically 
scaled down by the size (8 bytes) of the pointed-to type (struct `cache_entry` *).
Most likely, the division by sizeof(struct `cache_entry` *) is extraneous
and should be eliminated.

This warning is correct.


With Git 2.26 (Q1 2020), some rough edges in the sparse-checkout feature, especially around the cone mode, have been cleaned up.

See commit f998a3f, commit d2e65f4, commit e53ffe2, commit e55682e, commit bd64de4, commit d585f0e, commit 4f52c2c, commit 9abc60f (31 Jan 2020), and commit 9e6d3e6, commit 41de0c6, commit 47dbf10, commit 3c75406, commit d622c34, commit 522e641 (24 Jan 2020) by Derrick Stolee (derrickstolee).

See commit 7aa9ef2 (24 Jan 2020) by Jeff King (peff).

(Merged by Junio C Hamano -- gitster -- in commit 433b8aa, 14 Feb 2020)

sparse-checkout: fix cone mode behavior mismatch

Reported-by: Finn Bryant

Signed-off-by: Derrick Stolee

The intention of the special "cone mode" in the sparse-checkout feature is to always match the same patterns that are matched by the same sparse-checkout file as when cone mode is disabled.

When a file path is given to "git sparse-checkout set" in cone mode, then the cone mode improperly matches the file as a recursive path.

When setting the skip-worktree bits, files were not expecting the MATCHED_RECURSIVE response, and hence these were left out of the matched cone.

Fix this bug by checking for MATCHED_RECURSIVE in addition to MATCHED and add a test that prevents regression.

The documentation now includes:

When core.sparseCheckoutCone is enabled, the input list is considered a
list of directories instead of sparse-checkout patterns.

The command writes patterns to the sparse-checkout file to include all files contained in those directories (recursively) as well as files that are siblings of ancestor directories.

The input format matches the output of git ls-tree --name-only. This includes interpreting pathnames that begin with a double quote (") as C-style quoted strings.


With Git 2.26 (Q1 2020), "git sparse-checkout" learned a new "add" subcommand.

See commit 6c11c6a (20 Feb 2020), and commit ef07659, commit 2631dc8, commit 4bf0c06, commit 6fb705a (11 Feb 2020) by Derrick Stolee (derrickstolee).

(Merged by Junio C Hamano -- gitster -- in commit f4d7dfc, 05 Mar 2020)

sparse-checkout: create 'add' subcommand

Signed-off-by: Derrick Stolee

When using the sparse-checkout feature, a user may want to incrementally grow their sparse-checkout pattern set.

Allow adding patterns using a new 'add' subcommand.

This is not much different from the 'set' subcommand, because we still want to allow the '--stdin' option and interpret inputs as directories when in cone mode and patterns otherwise.

When in cone mode, we are growing the cone.

This may actually reduce the set of patterns when adding directory A when A/B is already a directory in the cone. Test the different cases: siblings, parents, ancestors.

When not in cone mode, we can only assume the patterns should be appended to the sparse-checkout file.

And:

sparse-checkout: work with Windows paths

Signed-off-by: Derrick Stolee

When using Windows, a user may run 'git sparse-checkout set A\B\C' to add the Unix-style path A/B/C` to their sparse-checkout patterns.

Normalizing the input path converts the backslashes to slashes before we add the string 'A/B/C' to the recursive hashset.


The sparse-checkout patterns have been forbidden from excluding all paths, leaving an empty working tree, for a long time.

With Git 2.27 (Q2 2020), this limitation has been lifted.

See commit ace224a (04 May 2020) by Derrick Stolee (derrickstolee).

(Merged by Junio C Hamano -- gitster -- in commit e9acbd6, 08 May 2020)

sparse-checkout: stop blocking empty workdirs

Reported-by: Lars Schneider

Signed-off-by: Derrick Stolee

Remove the error condition when updating the sparse-checkout leaves an empty working directory.

This behavior was added in 9e1afb167 ("sparse checkout: inhibit empty worktree", 2009-08-20, Git v1.7.0-rc0 -- merge).

The comment was added in a7bc906f2 ("Add explanation why we do not allow to sparse checkout to empty working tree", 2011-09-22, Git v1.7.8-rc0 -- merge) in response to a "dubious" comment in 84563a624 ("[unpack-trees.c](https://github.com/git/git/blob/ace224ac5fb120e9cae894e31713ab60e91f141f/unpack-trees.c): cosmetic fix", 2010-12-22, Git v1.7.5-rc0 -- merge).

With the recent "cone mode" and "git sparse-checkout init [--cone]" command, it is common to set a reasonable sparse-checkout pattern set of

/*
!/*/

which matches only files at root. If the repository has no such files, then their "git sparse-checkout init" command will fail.

Now that we expect this to be a common pattern, we should not have the commands fail on an empty working directory.

If it is a confusing result, then the user can recover with "git sparse-checkout disable" or "git sparse-checkout set". This is especially simple when using cone mode.


With Git 2.37 (Q3 2022), deprecate non-cone mode of the sparse-checkout feature.

See commit 5d4b293, commit a8defed, commit 72fa58e, commit 5d295dc, commit 0d86f59, commit 71ceb81, commit f69dfef, commit 2d95707, commit dde1358 (22 Apr 2022) by Elijah Newren (newren).

(Merged by Junio C Hamano -- gitster -- in commit 377d347, 03 Jun 2022)

sparse-checkout: make --cone the default

Signed-off-by: Elijah Newren

Make cone mode the default, and update the documentation accordingly.

git config now includes in its man page:

The "non
cone mode" can be requested to allow specifying a more flexible
patterns by setting this variable to 'false'.

git sparse-checkout now includes in its man page:

Unless core.sparseCheckoutCone is explicitly set to false, Git will
parse the sparse-checkout file expecting patterns of these types. Git will
warn if the patterns do not match. If the patterns do match the expected
format, then Git will use faster hash-based algorithms to compute inclusion
in the sparse-checkout.

And:

git-sparse-checkout.txt: wording updates for the cone mode default

Signed-off-by: Elijah Newren

Now that cone mode is the default, we'd like to focus on the arguments to set/add being directories rather than patterns, and it probably makes sense to provide an earlier heads up that files from leading directories get included as well.

git sparse-checkout now includes in its man page:

By default, the input list is considered a list of directories, matching
the output of git ls-tree -d --name-only.

This includes interpreting
pathnames that begin with a double quote (") as C-style quoted strings.

Note that all files under the specified directories (at any depth) will
be included in the sparse checkout, as well as files that are siblings
of either the given directory or any of its ancestors (see 'CONE PATTERN
SET' below for more details).

In the past, this was not the default,
and --cone needed to be specified or core.sparseCheckoutCone needed to be enabled.

Git: How to ignore/specify files for *checkout*

If you want to package up files for deployment, you probably don't need - or want - the repo itself. This is exactly what git archive is for. A couple examples from the manpage (linked):

git archive --format=tar --prefix=junk/ HEAD | (cd /var/tmp/ && tar xf -)

Create a tar archive that contains the contents of the latest commit on the current branch, and extract it in the /var/tmp/junk directory.

git archive --format=tar --prefix=git-1.4.0/ v1.4.0 | gzip > git-1.4.0.tar.gz

Create a compressed tarball for v1.4.0 release.

You ought to be able to get it to do exactly what you want, with the help of the export-ignore attribute:

export-ignore

Files and directories with the attribute export-ignore won’t be added to archive files. See gitattributes(5) for details.

For example, to exclude the directory private and the files mine.txt and secret.c, you could put in the file .gitattributes:

private/     export-ignore
secret.c export-ignore

Just like gitignore files, you can put those anywhere in your repository, and they'll operate from that directory, but starting from the top level is a good bet.

Is it possible to do a sparse checkout without checking out the whole repository first?

Works in git 3.37.1

git clone --filter=blob:none --no-checkout --depth 1 --sparse <project-url>
cd <project>

Specify folders you want to clone

git sparse-checkout add <folder1> <folder2>
git checkout

How to use git sparse-checkout in 2.27+

I believe I found the reason for this. Commit f56f31af0301 to Git changed the implementation of sparse-checkout so that, when you have an uninitialized working tree (as you would right after running git clone --no-checkout), running git sparse-checkout init will not check out any files into your working tree. In previous versions, the command would actually check out files, which could have unexpected effects given that you wouldn't have an active branch at that point.

The relevant commit, f56f31af0301 was included in Git 2.27, but not in 2.25. That accounts for why the behavior you see is not the behavior shown on the web page you're trying to follow. Basically, the behavior on the web page was a bug that nobody realized was a bug at the time, but with Git 2.27, it has been fixed.

This is explained very well, I think, in the message for commit b5bfc08a972d:

So...that brings us to the special case: a git clone performed with
--no-checkout. As per the meaning of the flag, --no-checkout does not
check out any branch, with the implication that you aren't on one and
need to switch to one after the clone. Implementationally, HEAD is
still set (so in some sense you are partially on a branch), but

  • the index is "unborn" (non-existent)
  • there are no files in the working tree (other than .git/)
  • the next time git switch (or git checkout) is run it will run
    unpack_trees with initial_checkout flag set to true.

It is not until you run, e.g. git switch <somebranch> that the index
will be written and files in the working tree populated.

With this special --no-checkout case, the traditional read-tree -mu HEAD
behavior would have done the equivalent of acting like checkout --
switch to the default branch (HEAD), write out an index that matches
HEAD, and update the working tree to match. This special case slipped
through the avoid-making-changes checks in the original sparse-checkout
command and thus continued there.

After update_sparsity() was introduced and used (see commit f56f31a
("sparse-checkout: use new update_sparsity() function", 2020-03-27)),
the behavior for the --no-checkout case changed: Due to git's
auto-vivification of an empty in-memory index (see do_read_index() and
note that must_exist is false), and due to sparse-checkout's
update_working_directory() code to always write out the index after it
was done, we got a new bug. That made it so that sparse-checkout would
switch the repository from a clone with an "unborn" index (i.e. still
needing an initial_checkout), to one that had a recorded index with no
entries. Thus, instead of all the files appearing deleted in git status
being known to git as a special artifact of not yet being on a
branch, our recording of an empty index made it suddenly look to git as
though it was definitely on a branch with ALL files staged for deletion!
A subsequent checkout or switch then had to contend with the fact that
it wasn't on an initial_checkout but had a bunch of staged deletions.

How to sparsely checkout only one single file from a git repository?

Originally, I mentioned in 2012 git archive (see Jared Forsyth's answer and Robert Knight's answer), since git1.7.9.5 (March 2012), Paul Brannan's answer:

git archive --format=tar --remote=origin HEAD:path/to/directory -- filename | tar -O -xf -

But: in 2013, that was no longer possible for remote https://github.com URLs.

See the old page "Can I archive a repository?"

The current (2018) page "About archiving content and data on GitHub" recommends using third-party services like GHTorrent or GH Archive.


So you can also deal with local copies/clone:

You could alternatively do the following if you have a local copy of the bare repository as mentioned in this answer,

git --no-pager --git-dir /path/to/bar/repo.git show branch:path/to/file >file

Or you must clone first the repo, meaning you get the full history:
- in the .git repo
- in the working tree.

  • But then you can do a sparse checkout (if you are using Git1.7+),:

    • enable the sparse checkout option (git config core.sparsecheckout true)
    • adding what you want to see in the .git/info/sparse-checkout file
    • re-reading the working tree to only display what you need

To re-read the working tree:

$ git read-tree -m -u HEAD

That way, you end up with a working tree including precisely what you want (even if it is only one file)


Richard Gomes points (in the comments) to "How do I clone, fetch or sparse checkout a single directory or a list of directories from git repository?"

A bash function which avoids downloading the history, which retrieves a single branch and which retrieves a list of files or directories you need.

GIT checkout except one folder

You could use sparse checkout to exclude the committed contents of the node_modules directory from the working tree. As the documentation says:

"Sparse checkout" allows populating the working directory sparsely. It uses the skip-worktree bit to tell Git whether a file in the working directory is worth looking at.

Here's how you use it. First, you enable the sparseCheckout option:

git config core.sparseCheckout true

Then, you add the node_modules path as a negation in the .git/info/sparse-checkout file:

echo -e "/*\n!node_modules" >> .git/info/sparse-checkout

This will create a file called sparse-checkout containing:

/*
!node_modules

which effectively means ignore the node_modules directory when reading a commit's tree into the working directory.



Related Topics



Leave a reply



Submit