Skip to main content

How to Scrub Sensitive Information from Git

09-02-20 Melissa Thompson

Did you accidentally commit sensitive data? Learn how to scrub it from your Git repository and prevent committing sensitive data in the future.

As humans, we sometimes make mistakes. One of those is committing sensitive data to our Git repositories. The canonical term for fixing those mistakes is Git scrubbing, which is just a fancy phrase for removing passwords, API tokens, license keys, etc. from a Git repository. Keep in mind that preventing those mistakes is the ideal solution, but, if you find yourself in the position where this already happened, I want to help. Let’s take a look at a couple of situations where you may need to scrub your repo and how you would go about doing so.

I Committed Some Sensitive Info

Hey, it happens. Do you have the power to change the value of the sensitive data?

Whew, I Can Change the Password/Token/Key

Great! Let’s scrub, or remove, the file containing the sensitive data from our repository by running the following commands in order:

git rm --cached <path-to-file>
git commit --amend -CHEAD

These commands will remove the file containing your password and rewrite your commit without it. If you did not push your commit containing sensitive data to a Git repo hosting service like GitHub, you can use the existing password/token/key after running those commands since you only worked with it locally. If you did push the sensitive data up, however, you need to git push -f the amended commit you just made, and you need to change your password/token/key.

PSA: Any time you send passwords or other sensitive data to your remote repo, you should consider it compromised and update the password.

Let’s talk about what to do when your sensitive data is compromised and you can’t change the password, token, or key.

Oh No! I Can’t Change the Sensitive Info

Not ideal, but we can still fix it. An example of unchangeable sensitive data might be a license key that you only get one of (like old school Photoshop or a CMS) and can’t change without communicating with customer service. Another example might be if your private package artifacts are committed publicly. In this case, we need to rewrite our Git history, both on local and remote, to remove any trace of our sensitive data.

There are a few tools that can help you do this (git filter-branch, BFG Repo Cleaner, and git filter-repo to name a few), and all of the documentation around them seems intended to frighten the reader, who I imagine is an already frantic developer who’s just realized they’ve compromised their application. Why are these documents written with such strong warnings? Because rewriting the entirety of a project’s Git history is a serious action. It’s important to understand why you’re doing it and how it’s going to work. Rewriting the history of a repo is a pretty powerful move, but it’s no more dangerous than rebasing, which we do a lot at Sparkbox in order to maintain a linear history. The key is to know when to do it and why. So with intentionality and understanding, let’s move forward with a git filter-branch.

Git Filter-branch in Action

First, I’m going to cd into my Git scrubbing example repo. If you clone this repo down and run git log --one-line, you’ll notice that I have a suspicious commit:

de6515f docs: link to git scrubbing article in readme
d13467e feat: hello world
6ec9e03 feat: init env
a7c3ac2 feat: init js
8d46088 feat: init html
fc670d3 Initial commit

Usually, we don’t check .env files into our repos because they hold a lot of sensitive information. Before we fix that, let’s check the state of our tags by running git tag:

tag1
tag2

Now that we know the state of our repo, we can prepare to remove the .env file with the following git filter-branch command, which I found in GitHub’s docs, but the command could also be pieced together using the git filter-branch documentation:

git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch .env" \
  --prune-empty --tag-name-filter cat

That’s a lot of flags. Let’s dig into what all this means before we run it:

  • --force

    • You can try running without this, but you might get this error: Cannot create a new backup. A previous backup already exists in refs/original/. Force overwriting the backup with -f. That’s because filter-branch won’t start if there’s an existing refs/original/ directory, so we need to force remove the existing files.

  • --index-filter

    • The filter is what tells Git how to rewrite the history. There are other filter options, but here’s a pro tip: this is faster than --tree-filter because it doesn’t check out the tree.

  • --cached

    • This flag removes and unstages paths from the index and only the index. So your working files won’t be affected.

  • --ignore-unmatch

    • If no files match the file you’ve supplied to filter-branch, this tells it to exit the process with a zero status.

  • --prune-empty

    • Sometimes after running filter-branch you’re left with empty commits. This flag removes them, which is nice for keeping the commit history clean. You wouldn’t want to see empty commit messages in your history.

  • --tag-name-filter cat

    • This is our filter for rewriting our repo’s tag names. For every reference that is rewritten using filter-branch, this filter says “change the tag name to XYZ” depending on what name you provide it. Here, we’ve passed cat, which just accepts the updated reference SHA without changing the tag name.

Now that we understand all the pieces of our command, we can run it. We get the following output, which tells us that our commit and tag SHAs are being rewritten and our .env file removed.

Rewrite 6ec9e03d1ab89c8374f624015853705c6147786a (3/6) (1 seconds passed, remaining 1 predicted)    rm '.env'
Rewrite d13467efe14dd0bb5bce2578fb0e48d5a36f35c7 (3/6) (1 seconds passed, remaining 1 predicted)    rm '.env'
Rewrite de6515f8df775d0871cd2cc4400ea9352ce635cb (3/6) (1 seconds passed, remaining 1 predicted)    rm '.env'

Ref 'refs/heads/master' was rewritten
tag1 -> tag1 (de6515f8df775d0871cd2cc4400ea9352ce635cb -> 1904bea774c1060d793dc615dbd438f52650139d)
tag2 -> tag2 (de6515f8df775d0871cd2cc4400ea9352ce635cb -> 1904bea774c1060d793dc615dbd438f52650139d)

Let’s see what our git log --one-line looks like now:

1904bea docs: link to git scrubbing article in readme
4b3afd8 feat: hello world
c9b8844 feat: init js
e9dc984 feat: init html
07e01dc Initial commit

All our commit SHAs were rewritten, and our commit was completely removed because of our --prune-empty flag. If we run git tag again, we’ll see that the tag names haven’t changed because of our --tag-name-filter cat flag:

tag1
tag2

We’ve successfully rewritten our commit history, so now let’s push it up with git push origin --force --all and git push origin --force --tags to get our remote repo up to date.

At this point, you might be good to go. However, if you have any pull requests, open or closed, that include the sensitive data, you’ll need to contact your Git hosting provider and ask them to remove them.

For any work that existed prior to the git filter-branch operation, the team should rebase off the repo’s default branch. Sparkbox prefers rebasing commits on top of the default branch instead of merging because rebasing gives us a clean Git history without any merge commits. In this case, if we did create a merge commit, we would risk reintroducing all of the old history that we just removed with git filter-branch. That’s why work needs to be rebased at this point.

Prevention

Remember how I said humans make mistakes? You know what else humans do? Learn from their mistakes. Removing sensitive data from a repo can be a lengthy process. Ideally, we would never commit sensitive data. Here are some ways you can prevent committing sensitive data to your repositories.

  • Create a .gitignore file at the beginning of your project as a first line of defense

  • Look at the changes you’re going to stage before you stage them. You can do this using git add --interactive, using your code editor, or using other third-party tools like Kaleidoscope

  • Be intentional and thoughtful about the files you’re committing by staging files individually instead of using commands like git add .

  • Use a tool like git-secrets (check out how Sparkbox uses git-secrets) or gitleaks that utilize pre-commit hooks to automatically scan your commits before committing (or make your own!)

If you commit sensitive data, don’t panic. Remember that you can remove the sensitive files if the data is changeable, and you can use a tool to rewrite repo history to remove sensitive info if it’s not. Choose to learn from your mistakes and implement ways to prevent future issues. And now that you know how to prevent committing sensitive data, and how to fix it if it does get committed, go forth and fearlessly write great code!

Sparkbox’s Development Capabilities Assessment

Struggle to deliver quality software sustainably for the business? Give your development organization research-backed direction on improving practices. Simply answer a few questions to generate a customized, confidential report addressing your challenges.

Related Content

User-Centered Thinking: 7 Things to Consider and a Free Guide

Want the benefits of UX but not sure where to start? Grab our guide to evaluate your needs, earn buy-in, and get hiring tips.

More Details

See Everything In

Want to talk about how we can work together?

Katie can help

A portrait of Vice President of Business Development, Katie Jennings.

Katie Jennings

Vice President of Business Development