You have a frontend application which gets deployed to production, for millions of people to use. The source code is private, you have your development workflow all sorted out, the build and release process is working fine without any glitches, but one day you get a vulnerability report from a researcher that your source code can be recreated because your
.git folder is reachable through the production URLs, like
https://myapp.com/.git/ . Now you may or may not have directory listing enabled, so the attacker might not be able to see the contents of this folder, but as we will see, that does not really matter to the attacker.
How can you end up in a situation like this?
This can usually happen due to an incorrect deployment. If you have a way to manually deploy the build code, for example, if you use s3 for static web hosting, and anyone can upload the build code in the s3 bucket, then someone might end up copying the
.git folder to the s3 bucket. S3 web hosting would make it available in production on the corresponding path. Now you can argue, that during manual deployment, it is also likely that someone just copies the whole
src folder, but that is still less likely, because, in general, developers understand the
src folder and the implications of exposing it, but they might not understand the importance of the
what is the .git folder?
.git folder is where git keeps all its metadata. Inside this folder is information about all the commits, branches, files, and everything related to your git repository. This folder alone can be used to recreate your whole git repository with the history, files, and everything ever committed to git. So, if an unauthorized person gets hold of your
.git repository, they can recreate your private codebase.
True, but that still is very limited information. For one, it is not the source code, it is the built code, which is a lot different from your source code. You source code might contain tests, stories, dev dependencies, dev only code. You might not have the strictest guidelines for dev only code. For e.g, you might have some credentials hardcoded in your tests, or some zeplin token hardcoded in your storybook setup, or may be you have to call an api as part of the build process, which requires auth which is hardcoded in your webpack config. All of this can be recreated/exposed if your
.git folder is exposed. Also, the source code might have other files with sensitive information which are not part of the build, but are checked in to git.
How can someone recreate the codebase using the .git folder ?
The folder has a standard directory structure, so, the attacker would already know where to look for what. This folder is used by git to recreate the repository at the time of git clone.
For example, if the attacker has the full folder, or has visibility for the whole folder (because you have enabled directory listing on your server)
- they can check
refs/heads/masterwhich is the reference to the head of the
masterbranch. This will have the commit id for the latest commit in the master branch.
- The commit details for this commit id would be stored in the
objectsfolder, or better, the
git cat-file -p commit_idcommand can be used to check the commit details.
- The commit details would have the hash for the directory structure or the tree. We can get the details of this tree using the
- This would give us the hash for each and every file, that was part of the commit.
- Now, we can just use
cat-fileto get the contents of each file.
- This process can just be automated and the whole source code can be generated.
Even if the attacker does not have visibility to the whole folder (because you have not enabled the directory listing on your server), they can still recreate everything. If they know that the
.git folder is exposed, but they can’t see the filenames, they can still try accessing the
HEAD file, which will give them a branch name (or tag name), and they can then find that in
refs/heads which will again give them a object hash, which can be found in the objects folder. So, now, one by one, they can traverse the whole structure without knowing any of the specific file names initially.
what is the solution?
- improve your deployment process, only ever deploy the build files and nothing else, automate it, restrict access for manual deployment
- never assume that your source code can not end up in production or exposed. So never have secrets in the source code for example.