Git Diff with Line Numbers and Proper Code Alignment/Indentation

git diff with line numbers and proper code alignment/indentation

1/3: the bug fix

@Inian was correct: I just needed commas between arguments. I've put in the work (perhaps as much as ~20~30 hrs since posting this question) and I'm pretty decent at the basics of using awk now. I've learned a ton.

For the sake of answering this question, here's the solution I came up with right after @Inian posted his answer, based on his feedback. The key parts to focus in on are the printf calls. Notice I've added commas in between the format string and each argument thereafter. As he said, that's the fix.

Parts to focus on:

printf "-%+4s     :%s\n", left++, line
printf "+%+4s     :%s\n", right++, line
printf " %+4s,%+4s:%s\n", left++, right++, line

Whole thing in context:

git diff HEAD~..HEAD --color=always | \
  gawk '{bare=$0;gsub("\033[[][0-9]*m","",bare)};\
    match(bare,"^@@ -([0-9]+),[0-9]+ [+]([0-9]+),[0-9]+ @@",a){left=a[1];right=a[2];next};\
    bare ~ /^(---|\+\+\+|[^-+ ])/{print;next};\
    {line=gensub("^(\033[[][0-9]*m)?(.)","\\2\\1",1,$0)};\
    bare~/^-/{printf   "-%+4s     :%s\n", left++, line;next};\
    bare~/^[+]/{printf "+%+4s     :%s\n", right++, line;next};\
    {printf            " %+4s,%+4s:%s\n", left++, right++, line;next}'

Here's some sample output I get just by copying and pasting the above script into my terminal. If you'd like to duplicate this exactly, go git clone my dotfiles repo and run git checkout 4386b089f163d9d5ff26d277b53830e54095021c. Then, copy and paste the above script into your terminal. The output looks pretty good. The alignment of the numbers and things on the left now looks nice:

$     git diff HEAD~..HEAD --color=always | \
>       gawk '{bare=$0;gsub("\033[[][0-9]*m","",bare)};\
>         match(bare,"^@@ -([0-9]+),[0-9]+ [+]([0-9]+),[0-9]+ @@",a){left=a[1];right=a[2];next};\
>         bare ~ /^(---|\+\+\+|[^-+ ])/{print;next};\
>         {line=gensub("^(\033[[][0-9]*m)?(.)","\\2\\1",1,$0)};\
>         bare~/^-/{printf   "-%+4s     :%s\n", left++, line;next};\
>         bare~/^[+]/{printf "+%+4s     :%s\n", right++, line;next};\
>         {printf            " %+4s,%+4s:%s\n", left++, right++, line;next}'
diff --git a/useful_scripts/git-diffn.sh b/useful_scripts/git-diffn.sh
index 22c74e2..cf8ba08 100755
--- a/useful_scripts/git-diffn.sh
+++ b/useful_scripts/git-diffn.sh
   49,  49: #   4. `git-gs_diffn`
   50,  50: #   3. `gs_git-diffn`
   51,  51: 
+  52     :+# FUTURE WORK:
+  53     :+# 1. Make work with standard awk?
+  54     :+#    This has been tested on Linux Ubuntu 18.04. If anyone can't get this working on their system,
+  55     :+#    such as in the git bash terminal that comes with Git for Windows, or on MacOS, due to 
+  56     :+#    compatibility probems with `gawk`, I can rewrite the few places relying on `gawk` extensions
+  57     :+#    to just use basic awk instead. That should solve any compatibility problems, but there's no
+  58     :+#    sense in doing it if there's no need. If I ever need to do this in the future though, I'm
+  59     :+#    going to need this trick to obtain a substring using standard awk:
+  60     :+#    https://stackoverflow.com/questions/5536018/how-to-print-matched-regex-pattern-using-awk/5536342#5536342
+  61     :+#   1. Also, look into this option in gawk for testing said compatibility: 
+  62     :+#     1. `--lint` - https://www.gnu.org/software/gawk/manual/html_node/Options.html
+  63     :+#     1. `--traditional` and `--posix` - https://www.gnu.org/software/gawk/manual/html_node/Compatibility-Mode.html
+  64     :+#   1. Currently, `--lint` is telling me that the 3rd argument to `match()` (ie: the array 
+  65     :+#      parameter) is a gawk extension.
+  66     :+
   52,  67: # References:
   53,  68: # 1. This script borrows from @PFudd's script here:
   54,  69: #    https://stackoverflow.com/questions/24455377/git-diff-with-line-numbers-git-log-with-line-numbers/33249416#33249416
  133, 148: # "41", "42", etc. codes is this:
  134, 149: #       ^(\033\[(([0-9]{1,2};?){1,10})m)?
  135, 150: 
+ 151     :+# Be sure to place all args (`"$@"`) AFTER `--color=always` so that if the user passes in
+ 152     :+# `--color=never` or `--no-color` they will override my `--color=always` here, since later
+ 153     :+# options override earlier ones.
  136, 154: git diff --color=always "$@" | \
  137, 155: gawk \
  138, 156: '

Here's a screenshot to show the nice color output:

Sample Image

The original script, shown here:

git diff HEAD~..HEAD --color=always | \
  gawk '{bare=$0;gsub("\033[[][0-9]*m","",bare)};\
    match(bare,"^@@ -([0-9]+),[0-9]+ [+]([0-9]+),[0-9]+ @@",a){left=a[1];right=a[2];next};\
    bare ~ /^(---|\+\+\+|[^-+ ])/{print;next};\
    {line=gensub("^(\033[[][0-9]*m)?(.)","\\2\\1",1,$0)};\
    bare~/^-/{print "-"left++ ":" line;next};\
    bare~/^[+]/{print "+"right++ ":" line;next};\
    {print "("left++","right++"):"line;next}'

produces pretty awful-looking (in comparison), unaligned output:

$     git diff HEAD~..HEAD --color=always | \
>       gawk '{bare=$0;gsub("\033[[][0-9]*m","",bare)};\
>         match(bare,"^@@ -([0-9]+),[0-9]+ [+]([0-9]+),[0-9]+ @@",a){left=a[1];right=a[2];next};\
>         bare ~ /^(---|\+\+\+|[^-+ ])/{print;next};\
>         {line=gensub("^(\033[[][0-9]*m)?(.)","\\2\\1",1,$0)};\
>         bare~/^-/{print "-"left++ ":" line;next};\
>         bare~/^[+]/{print "+"right++ ":" line;next};\
>         {print "("left++","right++"):"line;next}'
diff --git a/useful_scripts/git-diffn.sh b/useful_scripts/git-diffn.sh
index 22c74e2..cf8ba08 100755
--- a/useful_scripts/git-diffn.sh
+++ b/useful_scripts/git-diffn.sh
(49,49): #   4. `git-gs_diffn`
(50,50): #   3. `gs_git-diffn`
(51,51): 
+52:+# FUTURE WORK:
+53:+# 1. Make work with standard awk?
+54:+#    This has been tested on Linux Ubuntu 18.04. If anyone can't get this working on their system,
+55:+#    such as in the git bash terminal that comes with Git for Windows, or on MacOS, due to 
+56:+#    compatibility probems with `gawk`, I can rewrite the few places relying on `gawk` extensions
+57:+#    to just use basic awk instead. That should solve any compatibility problems, but there's no
+58:+#    sense in doing it if there's no need. If I ever need to do this in the future though, I'm
+59:+#    going to need this trick to obtain a substring using standard awk:
+60:+#    https://stackoverflow.com/questions/5536018/how-to-print-matched-regex-pattern-using-awk/5536342#5536342
+61:+#   1. Also, look into this option in gawk for testing said compatibility: 
+62:+#     1. `--lint` - https://www.gnu.org/software/gawk/manual/html_node/Options.html
+63:+#     1. `--traditional` and `--posix` - https://www.gnu.org/software/gawk/manual/html_node/Compatibility-Mode.html
+64:+#   1. Currently, `--lint` is telling me that the 3rd argument to `match()` (ie: the array 
+65:+#      parameter) is a gawk extension.
+66:+
(52,67): # References:
(53,68): # 1. This script borrows from @PFudd's script here:
(54,69): #    https://stackoverflow.com/questions/24455377/git-diff-with-line-numbers-git-log-with-line-numbers/33249416#33249416
(133,148): # "41", "42", etc. codes is this:
(134,149): #       ^(\033\[(([0-9]{1,2};?){1,10})m)?
(135,150): 
+151:+# Be sure to place all args (`"$@"`) AFTER `--color=always` so that if the user passes in
+152:+# `--color=never` or `--no-color` they will override my `--color=always` here, since later
+153:+# options override earlier ones.
(136,154): git diff --color=always "$@" | \
(137,155): gawk \
(138,156): '

Screenshot:

Sample Image

2/3: making the numbers colored as well:

To answer the 2nd part of my question:

Additionally, the numbers and +/- signs should be green and red, respectively, like in normal git diff output.

I then added some ANSI color codes for red (\033[31m) and green (\033[32m) to the 3rd-from-last and 2nd-from-last lines shown below:

git diff HEAD~..HEAD --color=always | \
  gawk '{bare=$0;gsub("\033[[][0-9]*m","",bare)};\
    match(bare,"^@@ -([0-9]+),[0-9]+ [+]([0-9]+),[0-9]+ @@",a){left=a[1];right=a[2];next};\
    bare ~ /^(---|\+\+\+|[^-+ ])/{print;next};\
    {line=gensub("^(\033[[][0-9]*m)?(.)","\\2\\1",1,$0)};\
    bare~/^-/{printf   "\033[31m-%+4s     :%s\n", left++, line;next};\
    bare~/^[+]/{printf "\033[32m+%+4s     :%s\n", right++, line;next};\
    {printf                    " %+4s,%+4s:%s\n", left++, right++, line;next}'

and got this nicer-looking output. Notice the numbers at the far left are now colored too:

Sample Image

3/3: Lastly:

I then:

dissected the code above for days
studied awk like crazy
threw away the whole code above because it
1. was impossible to read and horribly formatted
2. did some really weird things that didn't make any sense and weren't needed, and
3. didn't handle any edge cases or custom git diff colors
kept the really good parts that were absolutely ingenious and perfect, and
wrote this really nice and full implementation of git diffn which:
1. handles all edge cases I could think of
2. acts as a drop-in replacement to git diff
3. handles all colors and text formatting possible (as far as I can tell), and
4. produces really nice output with or without color on

See here for git diffn info & installation instructions: Git diff with line numbers (Git log with line numbers)

The end.

Git diff with line numbers (Git log with line numbers)

You can't get human-readable line numbers with `git diff`

There aren't currently any options to get line-numbers displayed vertically on the side with git diff.

Unified-diff format

That information is available in the (c)hunk headers for each change in the diff though, it's just in unified-diff format:

@@ -start,count +start,count @@

The original state of the file is represented with -, and the new state is represented with + (they don't mean additions and deletions in the hunk header. start represents the starting line number of each version of the file, and count represents how many lines are included, starting from the start point.

Example

diff --git a/osx/.gitconfig b/osx/.gitconfig
index 4fd8f04..fcd220c 100644
--- a/osx/.gitconfig
+++ b/osx/.gitconfig
@@ -11,7 +11,7 @@ <== HERE!
 [color "branch"]
        upstream = cyan
 [color "diff"]
-       meta = yellow
+       meta = cyan
        plain = white dim
        old = red bold
        new = green bold

The hunk header

@@ -11,7 +11,7 @@

says that the previous version of the file starts at line 11, and includes 7 lines:

11  [color "branch"]
12         upstream = cyan
13  [color "diff"]
14 -       meta = yellow
14 +       meta = cyan
15         plain = white dim
16         old = red bold
17         new = green bold

while the next version of the file also starts at line 11, and also includes 7 lines.

Unified-diff format isn't really for human consumption

As you can probably tell, unified-diff format doesn't make it easy to figure out line numbers (at least if you're not a machine). If you really want line numbers that you can read, you'll need to use a diffing tool that will display them for you.

Additional Reading

Official git-diff(1) Manual Page

Git diff to show only lines that have been modified

What you want is a diff with 0 lines of context. You can generate this with:

git diff --unified=0

git diff -U0

You can also set this as a config option for that repository:

git config diff.context 0

To have it set globally, for any repository:

 git config --global diff.context 0

How can I format the code in a multi-branch project?

Edit, Jun 2022

I'm just boosting the signal from Rufus' comment below:

https://github.com/emilio/clang-format-merge contains code that provides a merge driver, rather than clean and smudge filters. It looks likely to be useful though, especially for repositories that have never had standard formatting enforced.

Recipe with assumptions

(note: I have not tested any of this)

We'll assume the reformatter is in ~/Downloads/android-studio/bin/format.sh and [note: apparently this is a bad assumption!] that it reads stdin and writes stdout, and works on one file at a time. (It's possible, but very difficult, to make this work with something that needs more than one file at a time. You cannot use this recipe for this case, though. Git's basic filtering mechanism requires that each filter simply read stdin and write stdout. By default Git assumes the filter works, even if it exits with a failure status.)

Choose where to run the filter as well; here I've set it up as the "clean" filter only.

In ~/.gitconfig or .git/config, add the definition for the filter:

[filter "my-xyz-language-formatter"]
    clean = ~/Downloads/android-studio/bin/format.sh
    smudge = cat

(this assumes that running cat runs a filter that writes, to its stdout, its unchanged input; this is true on any Unix-like system).

Then, create a .gitattributes file if needed. It will apply to the directory you create it in, and all sub-directories, unless overridden in those sub-directories, so place it in the highest sensible location, usually the root of the repository, but sometimes underneath a source/ or src/ or whatever directory. Add line(s) to direct file(s) matching some pattern(s) through your formatter. We'll assume here that all files named *.xyz should be formatted:

*.xyz   filter=my-xyz-language-formatter

This filter will now apply to all extractions and insertions of *.xyz files. The gitattributes documentation talks about these being applied at check-out and check-in time, but that's not quite precisely correct. Instead, a clean filter is applied whenever Git copies from work-tree to index (essentially, git add—well before git commit unless you use git commit -a or similar flags). A smudge filter is applied whenever Git copies from index to work-tree (essentially, git checkout, but also some additional cases, such as git reset --hard).

Note that spinning up one filter for each file can be quite slow. There's a "long running filter process" protocol you can use if you have a lot of control over the filter, which can speed this up (especially on Windows). That's beyond the scope of this answer, though.

Running git merge normally does not use the filters (it works on the copies that are already in the index, which is outside the filtering step). However, adding -X renormalize to a standard merge will make git merge do the "virtual check-in and check-out" described below, so that it will apply the filters. This happens for all three commits involved in the merge (and in both directions—clean and smudge—so it's roughly 6x slower than for just one commit).

Description (see below)

Git itself is only partially helpful here.

Fundamentally, the problem is that Git is stupid and line-oriented: it runs git diff from the merge base commit to each tip commit. If one or both of these git diffs sees a lot of formatting changes, it considers those significant and worthy of applying to the base. It has no semantic knowledge of the input code.

(Since you can take over the entire merge process, you could write a smarter merge that does use semantic analysis. This is pretty difficult, though. The only system I know of that does this, or something approaching this, is Ira Baxter's commercial software, and I've never actually used that; I just understand the theory behind it.)

There is a solution that does not depend on making Git smarter. If you have a semantic analyzer that outputs consistently formatted code, regardless of the input form, you can feed all three versions—B for base, L for left or local or --ours, and R for right or remote or other or --theirs—into this formatter:

reformat < B > B.formatted
reformat < L > L.formatted
reformat < R > R.formatted

Now you can have Git merge all three formatted versions, rather than merging the original possibly-not-yet-formatted (but maybe formatted) versions.

The result of this merge will, of course, be re-formatted. But presumably this is what you'd like anyway.

The way to achieve this with Git's built-in tools is to use what it calls smudge and clean filters. A smudge filter is applied to files as they are extracted from the repository into the work-tree. A clean filter is applied to files whenever they go from the work-tree into the repository.

In this case, the smudge filter can be "do nothing to the data", preserving exactly what was committed. The clean filter can be the reformatter. Or, if you prefer, the smudge filter can be the reformatter, and the clean filter can be the reformatter again, or a no-op filter. Once you have this in place—this is something you set up in .gitattributes, by defining a filter for particular files by path names, and the filter-driver in .git/config or your main (user or system wide) .gitconfig.

Once you have all that set up, you can run git merge -X renormalize. Git will extract the B, L, and R versions as usual, but then run them through a "virtual check-out and check-in" step, making three temporary commits,¹ B.formatted and so on. It then does the merge using the three temporary commits, rather than from the original three commits.

The hard part is finding a reformatter that does just what you want / need. Some modern systems have them, e.g., gofmt or clang-format. If there's one that does what you need, it just becomes a matter of plugging all this together—and getting buy-in from the rest of your group, that this reformatting is a good idea.

¹Technically it just makes tree objects; there's no need for actual commits.

How do I run a code formatter over my source without modifying git history?

What you are trying to do is impossible. You cannot, at some point in time, change a line of code, and yet have git report that the most recent change to that line of code is something that happened before that point in time.

I suppose a source control tool could support the idea of an "unimportant change", where you mark a commit as cosmetic and then history analysis would skip over that commit. I'm not sure how the tool would verify that the change really was cosmetic, and without some form of tool enforcement the feature would assuredly be misused resulting in bug introductions potentially being hidden in "unimportant" commits. But really the reasons I think it's a bad idea are academic here - the bottom line is, git doesn't have such a feature. (Nor can I think of any source control tool that does.)

You can change the formatting going forward. You can preserve the visibility of past changes. You can avoid editing history. But you cannot do all three at the same time, so you're going to have to decide which one to sacrifice.

There are actually a couple down-sides to the history rewrite, by the way. You mentioned processing time, so let's look at that first:

As you've noted, the straightforward way to do this with filter-branch would be very time consuming. There are things you can do to speed it up (like giving it a ramdisk for its working tree), but it's a tree-filter and it involves processing of each version of each file.

If you did some pre-processing, you could be somewhat more efficient. For example, you might be able to preprocess every BLOB in the database and create a mapping (where a TREE contains BLOB X, replace it with BLOB Y), and then use an index-filter to perform the substitutions. This would avoid all the checkout and add operations, and it would avoid repeatedly re-formatting the same code files. So that saves a lot of I/O. But it's a non-trivial thing to set up, and still might be time consuming.

(It's possible to write a more specialized tool based on this same principle, but AFAIK nobody has written one. There is precedent that more specialized tools can be faster than filter-branch...)

Even if you come to a solution that will run fast enough, bear in mind that the history rewrite will disturb all of your refs. Like any history rewrite, it will be necessary for all users of the repo to update their clones - and for something this sweeping, the way I recommend to do that is, throw the clones out before you start the rewrite and re-clone afterward.

That also means if you have anything that depends on commit ID's, that will also be broken. (That could include build infrastructure, or release documentation, etc.; depending on your project's practices.)

So, a history rewrite is a pretty drastic solution. And on the other hand, it also seems drastic to suppose that formatting the code is impossible simply because it wasn't done from day 1. So my advice:

Do the reformatting in a new commit. If you need to use git blame, and it points you to the commit where reformatting occurred, then follow up by running git blame again on the reformat commit's parent.

Yeah, it sucks. For a while. But a given piece of history tends to become less important as it ages, so from there you just let the problem gradually diminish into the past.

Quantifying the amount of change in a git diff?

wdiff does word-by-word comparison. Git can be configured to use an external program to do the diffing. Based on those two facts and this blog post, the following should do roughly what you want.

Create a script to ignore most of the unnecessary arguments that git-diff provides and pass them to wdiff. Save the following as ~/wdiff.py or something similar and make it executable.

#!/usr/bin/python

import sys
import os

os.system('wdiff -s3 "%s" "%s"' % (sys.argv[2], sys.argv[5]))

Tell git to use it.

git config --global diff.external ~/wdiff.py
git diff filename

My diff contains trailing whitespace - how to get rid of it?

Are you sure those are hard errors? By default, git will warn about whitespace errors, but will still accept them. If they are hard errors then you must have changed some settings. You can use the --whitespace= flag to git apply to control this on a per-invocation basis. Try

git apply --whitespace=warn patchname.patch

That will force the default behavior, which is to warn but accept. You can also use --whitespace=nowarn to remove the warnings entirely.

The config variable that controls this is apply.whitespace.

For reference, the whitespace errors here aren't errors with your patch. It's a code style thing that git will, by default, complain about when applying patches. Notably, it dislikes trailing whitespace. Similarly git diff will highlight whitespace errors (if you're outputting to a terminal and color is on). The default behavior is to warn, but accept the patch anyway, because not every project is fanatical about whitespace.

Git status shows files as changed even though contents are the same

Update: as per the comment on this question, the problem has been solved:

That is easy: the first file has CRLF line-ends (windows), the second
LF (Unix). The file util (available in git\usr\bin) will show you that (file a b will reply
something like a: ASCII text, with CRLF line terminators b: ASCII text)

Original answer below:

The diff you show does not show a single different line. Can you post .git/config (or better git config -l).

You might have some whitespace ignores activated

You should try to disable core.whitespace=fix,-indent-with-non-tab,trailing-space,cr-at-eol;

also

git show HEAD:myfile|md5sum
md5sum myfile

could be used to verify that the files are in fact different. Using external diff could work as well

git show HEAD:myfile > /tmp/myfile.HEAD

diff -u myfile /tmp/myfile.HEAD

# or if you prefer an interactive tool like e.g.:
vim -d myfile /tmp/myfile.HEAD

Git Diff with Line Numbers and Proper Code Alignment/Indentation