Bounding Unicorns

More Advanced Git Features

Git is a very powerful version control system. It can be used the same way one would use, say, Subversion, doing just the basic checkout/commit/branch operations. But the more advanced capabilities of git are well worth learning because they not only provide a marginal boost in productivity but enable undertakings not easily possible with traditional version control systems.

Reading source

Source code exists for two purposes: writing it, which typically accomplishes some task, and reading it, which typically is done during debugging or otherwise to answer questions like "why was this done?", "when was this done?", "what was this doing before?".

Developers read source code every day, because most real life software has long life spans (in contrast to, say, college assignments where each program is written from scratch and takes several hours). But maybe you only read the current source. Revision history gives a temporal dimension to the source, allowing you to examine what the source was at any point in the past. I'm sure I said nothing new, but how often do you actually browse source history?

If the answer is "pretty much never", we have some work to do.

First tool I am going to introduce is gitk. It is part of git but sometimes needs to be installed separately. Try it now in whichever repository is most convenient:

gitk

This produces a tree of commits in the current branch. Depending on the project, it may be a list of commits that does not branch at all, or it may be a tree with more merges than non-merge commits where figuring out what commit belongs to where is far from trivial.

The other gitk mode of operation you need to know is viewing history of all branches at the same time:

gitk --all

If your project has a single master branch, this view will be identical to the current branch gitk view. On a project with many branches, this view will be more involved and can be difficult to follow.

Now, what's the point of looking at gitk you ask? Here are some typical use cases:

Suppose you identified a commit that changes behavior in some way, but this commit is part of a larger set of changes (think of a pull request). How can you figure out which pull request the commit was part of, and what other changes were made in that pull request?

Open gitk and enter the commit's hash into the appropriate box. Look up and down the history around the commit. In a repository with clean history, the branch on which the commit was made was branched off master and merged back to master. So you can follow the branch up and down to find the pull request number and identify other commits in the branch as well as see what they changed.

In a repository with messy history, especially for long lived branches, there might be other merges present. These additional merges make it more difficult to follow history. We'll get to examples of useless merges later.

Identifying common and different commits between branches

Suppose you have two branches that share some commits. How do you identify where the branches diverged, how much they have in common and how much they differ? You can do all of this with command line git tools but it is typically much more efficient to use gitk. Open it with the branches in question:

gitk branch1 branch2

and navigate to one of the branches. Follow the branch's history until you see a split point - that's the other branch splitting. Keep following history of original branch back to master or whatever the main branch is.

If the history is clean, you often can (and I do) have a single instance of gitk --all running with all branches visible, and you can locate any history information you need from that view.

Reviewing commits

The next two commands I am going to cover are git status and git diff. I'm sure you have used them, but you should use them each time you are about to commit something. I see people working like this all the time:

... make some changes in an editor ...
git add .
git commit -m 'Fixed bug whatever'

This workflow simply commits whatever changes are in your working tree. Which, ideally, are exactly the changes that should be committed, but every now and again will have some debugging output, a syntax error or a mix of changes for unrelated bugs or features that should have been separate commits.

Instead I suggest the following workflow:

... make some changes in an editor ...

git status
... review the list of changed files, are any irrelevant files changed?
... should any untracked files be added?
... are there any untracked files that should NOT be added?

If there are no untracked files:

git diff
... review the changes you are about to commit.
... are there any irrelevant changes? are the changes complete?

# -a to commit everything without an add step
git commit -am 'Fixed bug whatever'

If there are untracked files:

git add .
git diff --cached
... review the changes you are about to commit ...

git commit -m 'Fixed bug whatever'

If your review of the changes finds that there are multiple unrelated changes that you are about to commit, stop and split the commit:

git reset
# add all changes in a file
git add <file>
# interactively add changes in all files
git add -p
# interactively add changes in a file
git add -p <file>
# repeat the addition for other files as necessary
# when done, review again:
git diff --cached

I alias git diff --cached to git dc to save my fingers:

git config --global alias.dc 'diff --cached'

... along with a bunch of other aliases that you can see here.

Rebasing and interactive rebasing

Rebasing is a feature that, as far as I know, first appeared in git that allows you to change development history. Traditional version control systems, subversion for example, insist that once a commit is made it cannot be altered or removed. Other version control systems started out with the traditional view of immutable commits but over time recognized the power of editing history that git offers.

In any event, git permits anyone to do anything to any of the existing commits - combine them, split them, change their contents or commit message, change the author, and - not to be overlooked - to take them from one branch and apply them to another branch instead.

Rebasing is a topic well covered in various guides on the Internet, so I will not spend time explaining how to do it. I will, however, explain why you should do it.

Commit management

Each commit should ideally do one thing, and do that thing completely, and be easily understandable. Well, when a "thing" is a major feature completeness is at odds with readability. A 2,000 line commit is not readable. Its commit message probably overlooks many fine gotchas that are hiding in those 2,000 lines of changes.

Therefore, I suggest you favor small, readable commits over "complete" commits.

As we are all human, we don't always commit often enough. Especially if we have to think "are these really all of the changes I am going to make?", it is easy to have commits that are too large. With rebasing in your tool belt, you commit as often as you take breaks. Each tiny change that you are mentally done with goes into a commit. Later, when you are finished with development, you can perform and interactive rebase and squash together commits that are, for example, implementation of some feature and a trivial bug fix in the same feature, where the value of the bug fix being standalone from the feature is nil.

Similarly you have the tools to take a commit that you realize is too big, or does too much, and split it into several smaller commits. This typically does not happen often on short lived branches but it becomes crucial when dealing with complex changes on long lived branches.

Clean history

Once you understand rebasing, you can rebase your feature branches on the main development branch, usually master, before you pull request them. This gets rid of merges of master into your feature branches which are typically nothing but distracting noise.

With rebasing you will no longer have merge commits between the same feature branch in different repositories, if you are using multiple computers. Such merges are nearly universally noise because you are merging with yourself and as such the probability of conflicts is nearly zero.

Success

If you got this far, your development histories are readable, your commits are readable and other people can easily read your code. Congratulations!