Michael Rash, Security Researcher

git    [Summary View] git Repositories Safe After github Hack

github After a widely publicized hack of github (now fixed), I thought it would be a good idea to ensure that the git repositories remain secure on both github and on the webserver. The techniques in the blog post may not be well suited to large git repositories with a lot of different people able to commit code, but for my repositories these checks provide a fairly high level of confidence that no malicious code has been introduced. First, the github hack was made possible through a "mass assignment" vulnerability in Ruby on Rails, and would have permitted an attacker to gain admin privileges to any project on github. Once admin access is acquired, an attacker would be in a position to do anything to the underlying code base - including adding new code that implements "undocumented features".

Now, in order to add a backdoor into a code base on github, what would an attacker need to do?

Altering the code base for a project would need to be done through standard git operations as a new commit - i.e., make code changes to a local copy, git add ..., git commit ... - as opposed to manually editing previously committed code in the git repository itself. This is because every commit must match a corresponding SHA1 hash according to git's object model, and a SHA1 collision such that the bogus data is also working code would be "computationally difficult" to say the least (attacks against SHA1 not withstanding). As a basic check, one can create a git repository for testing purposes, write a random byte to a random position within the .git/objects/pack/*.pack file and then try to clone it to see what happens:
error: packfile ./objects/pack/pack-9c886ed427d9a7538093f09edf516a0a718201ac.pack does not match index
error: packfile ./objects/pack/pack-9c886ed427d9a7538093f09edf516a0a718201ac.pack cannot be accessed
fatal: git upload-pack: cannot find object 7e8e48412ff985461095a09874059e955145d513:
fatal: The remote end hung up unexpectedly
The repository has essentially been corrupted and the clone operation fails.

Ok, so a malicious code modification would most likely need to be done via an entirely new commit. This could certainly be done by an attacker, but anyone who has cloned the repository would be able to see the change. For a large, highly active project without rigorous code review and committer hierarchy, it is conceivable that such a change might just get lost within the noise of lots of commits. After all, a human would need to review the change and recognize it as being malicious.

Could such a malicious code change affect the projects? In a word, "no", and here's why: all commits to the code bases are pushed to github from private git repositories on a dedicated system that is not generally accessible, and git pull ... is only done occasionally and every change comes from a known source and is reviewed. The accessible non-private git repositories on are mirrors of the github repositories, so if a malicious change were introduced into github, then they would have this change too. The private repositories would still be safe however. So, given this work flow, what I need is a way to verify that there are no commits in either the github repositories or mirrors that are not in private repositories. Further, I would like to be able to verify this without having to push or pull code into the private repositories (so I can regularly check at any time without any modifications coming through). There are probably many ways to do this in the git world - for example, one could just use git fetch to bring in changes into remote tracking branches and compare these against local branches (any malicious code would not be merged into a local branch after it is discovered), but here is an alternate solution:

  • On my system where the private repositories live, create two directories GH/ and CD/, and clone all of the github repositories into the GH directory and all repository mirrors into the CD directory. We'll assume that the private repositories live in a directory called private/
  • For each repository pair in the GH and CD directories, diff the output of git rev-list --all. There should be zero differences here.
  • If the step above checks out, then diff the git rev-list --all output across the GH/<repo> and private/<repo> pairs. For this step we expect that the private repositories will have local commits that are not necessarily pushed upstream - what we're concerned about is any commit in a GH/<repo> that is not in a private/<repo>.
Here are a few commands to accomplish the above (we'll assume that the git clone --bare <repo> steps have already been done for brevity):
$ for r in fwknop psad gpgdir fwsnort IPTables-Parse IPTables-ChainMgr
> do
> echo "[+] Checking $r...";
> diff -u <(cd ~/CD/$r.git && git rev-list --all ) <( cd ~/GH/$r.git && git rev-list --all)
> done
[+] Checking fwknop...
[+] Checking psad...
[+] Checking gpgdir...
[+] Checking fwsnort...
[+] Checking IPTables-Parse...
[+] Checking IPTables-ChainMgr...
The output above indicates there are identical git commits in the github repositories vs. the mirrors. Good. Now, let's compare the github repositories vs the private ones - we grep on "+" which would indicate new commits in github that are not in the private repositories:
$ for r in fwknop psad gpgdir fwsnort IPTables-Parse IPTables-ChainMgr
> do
> echo "[+] Checking $r...";
> diff -u <(cd ~/private/$r.git && git rev-list --all ) <( cd ~/GH/$r.git && git rev-list --all) | egrep "^\+" | grep -v @
> done
[+] Checking fwknop...
+++ /dev/fd/62  2012-03-07 22:50:48.004281002 -0500
[+] Checking psad...
+++ /dev/fd/62  2012-03-07 22:50:48.054281002 -0500
[+] Checking gpgdir...
+++ /dev/fd/62  2012-03-07 22:50:48.164281002 -0500
[+] Checking fwsnort...
+++ /dev/fd/62  2012-03-07 22:50:48.194281002 -0500
[+] Checking IPTables-Parse...
[+] Checking IPTables-ChainMgr...
Again, good, no commits in github that are not in the private repositories. If there had been a line like the following I would have been concerned:
Rather than modifying code and committing it to a git repository on github, it would have been far more damaging for an attacker to just alter the github website to serve up drive by exploits for a popular web browser. Either way, I'm glad they fixed the vulnerability.

Bing Indexing of gitweb.cgi Links

Bing indexing of gitweb.cgi links In June, 2011, all of the software projects were switched over to git from svn, and at the same time the web interface was switched to gitweb (along with hosting at github) from trac. Given the switch, I knew there would be a change to how search engines indexed the code/data, and one question would be whether any particular search engine would take a specific interest in the code provided via git and/or gitweb. Note that each of the fwknop, psad, fwsnort, and gpgdir projects have raw git repositories that can be cloned directly over HTTP from (a nice feature of git), or viewed with any browser through gitweb. (Personally, I like the "links2" text-based browser rendering of gitweb pages - nice and clean.)

First, here are some stats for indexing bots from major search engines across all Apache log data for hits against gitweb.cgi from June, 2011 to today:

50505581.01%Mozilla/5.0 (compatible; bingbot/2.0;)
502428.06%msnbot/2.0b (+
257074.12%Mozilla/5.0 (compatible; Ezooms/1.0;
65831.06%Feedfetcher-Google; (+;)
43100.69%Mozilla/5.0 (compatible; Googlebot/2.1; +
19560.31%Mozilla/5.0 (compatible; SISTRIX Crawler;
19050.31%Mozilla/5.0 (compatible; Purebot/1.1; +
17510.28%Mozilla/5.0 (compatible; Yahoo! Slurp;)
16250.26%Mozilla/5.0 (compatible; MJ12bot/v1.4.0;)
14510.23%TwengaBot-Discover (

Wow! So bots associated with Microsoft's Bing search engine take the top two spots for a combined hit total of well over 500,000 since June, 2011. If spread out over the entire time period (which it's not as we'll see) that would be an average of about 2,600 hits per day, and this figure is more than 20 times the third place bot. Google is in a distant forth place, even though Google used to heavily index Trac repositories.

So, let's see how the search engine hits are distributed since June, 2011. First, here is a graph of just gitweb hits by the top five crawlers: top 5 gitweb indexers Clearly, that is not a very uniform distribution from day to day. It looks like Bing has been hitting the gitweb interface at a rate of over 17,000 hits per day for a significant portion of late December and early January. The other search engines hardly even show up in the graph - you know there are big spikes when everything looks better on a logarithmic scale: top 5 gitweb indexers logarithmic With some additional work, it looks like the gitweb.cgi links that Bing is indexing are not all unique. That is, one might expect that Bing would hit a link, grab the content, and then not return to the same link for a while. Some gitweb.cgi links were hit more than 10 times and more than 100,000 links were hit more than once during this time period.

How does this compare with hits across other portions of Bing indexing is still far and away the largest outlier: top 5 indexers of Given that 1) all of the information gitweb displays is derived from the underlying git repositories, and 2) the git repositories are directly accessible via HTTP anyway, it would seem that a better way for search engines to behave would be to just ignore gitweb altogether and pull directly from git. That would certainly cut down on the server-side resources necessary to service search engine requests. Perhaps though the general strategy of search engines is not to be too smart about such things - they probably just want access to data, and when they see a link they go after it. Either way, the kind of dedicated and repetitive indexing the Bing is doing against gitweb is a bit much, and it certainly seems as though they could implement a less intensive crawler. I'm curious if other server admins are seeing similar behavior.

Update 01/23: There are tons of web analysis tools out there, but I wrote a couple of quick scripts to generate the data in this blog post. The first "" parses Apache logs and produces user-agent graphs with Gnuplot as shown in this post. The second "" is extremely simple and just counts the number of hits against the same links within the Apache log data. Both scripts accept log data via TDIN - here is an example where user agents who hit any "index.html" link are plotted (graph is not shown): $ zcat ../logs/*.gz |grep "index.html" | ./ -p index_hits
[+] Parsing Apache log data...
[+] Total agents: 1769 (abbreviated to: 174 agents)
[+] Executing gnuplot...
Plot file: index_hits.gif
Agent stats: index_hits.agents

Switched from subversion to git

switched to git After using subversion for several years, I've switched to git for all projects. Subversion has certainly served its purpose, but it is hard to look at git and not feel a compelling draw. Further, with easy to set up web interfaces to git repositories such as gitweb and free hosting services such as github, providing a public git repository is trivial. Git itself can allow repositories to be cloned directly over HTTP without needing infrastructure like WebDAV, and here are links for the projects (github and gitweb links too):

The trac interface will remain active for a little while to see the legacy svn repositories, but the git repositories were all converted from these in order to preserve the history so trac is no longer important. If you are interested in the latest code changes in, say, fwsnort then just clone the repository and then you can make your own changes: $ git clone
Initialized empty Git repository in /home/mbr/tmp/git/fwsnort/.git/
$ cd fwsnort
$ git status
# On branch master
nothing to commit (working directory clean)
$ git show --summary
commit 00c4379a69975097948ed9e5ba356eeba69c0c93
Author: Michael Rash <>
Date: Mon Jun 20 21:00:57 2011 -0400

Added the --Conntrack-state argument

Added the --Conntrack-state argument to specify a conntrack state in place of
the "established" state that commonly accompanies the Snort "flow" keyword.
By default, fwsnort uses the conntrack state of "ESTABLISHED" for this. In
certain corner cases, it might be useful to use "ESTABLISHED,RELATED" instead
to apply application layer inspection to things like ICMP port unreachable
messages that are responses to real attempted communications. (Need to add
UDP tracking for the _ESTAB chains for this too - coming soon.)