Git Internals: Objects, Refs, and the DAG
A deep dive into Git's internal data model including blob, tree, commit, and tag objects, references, the DAG structure, packfiles, and plumbing commands.
Git Internals: Objects, Refs, and the DAG
Most developers use Git without understanding how it works internally. They memorize commands and workflows without knowing what commits actually are, how branches are implemented, or why merge and rebase produce different results. Understanding Git's internals transforms it from a mysterious tool into a predictable system. When something goes wrong, you know where to look. When a command does something unexpected, you understand why.
I learned Git internals by accident while debugging a corrupted repository. The knowledge has saved me countless hours since. This guide covers everything under the hood — the object model, the reference system, and the DAG that connects it all.
Prerequisites
- Git installed
- Basic Git usage (commit, branch, merge)
- Comfort with the command line
- Curiosity about how tools work
The Object Database
Git is fundamentally a content-addressable filesystem. Everything Git stores — files, directories, commits — is an object identified by its SHA-1 hash. The object database lives in .git/objects/.
The Four Object Types
blob — file content (no filename, just data)
tree — directory listing (maps filenames to blobs and subtrees)
commit — snapshot pointer with metadata (author, message, parent)
tag — named pointer to a commit (annotated tags)
Blob Objects
A blob stores file content. Nothing else — no filename, no permissions, no metadata.
# Create a file and add it
echo "Hello, World" > greeting.txt
git add greeting.txt
# Find the blob object
git ls-files --stage
# 100644 d670460b4b4aece5915caf5c68d12f560a9fe3e4 0 greeting.txt
# Inspect the blob
git cat-file -t d670460
# blob
git cat-file -p d670460
# Hello, World
# Two files with identical content share the same blob
echo "Hello, World" > duplicate.txt
git add duplicate.txt
git ls-files --stage
# Both files point to the same blob hash
The hash is computed from the content: SHA-1("blob <size>\0<content>"). Same content always produces the same hash. This is how Git deduplicates identical files.
Tree Objects
A tree maps filenames to blobs (files) and other trees (subdirectories):
# After committing
git cat-file -p main^{tree}
# 100644 blob d670460b4b4aece5915caf5c68d12f560a9fe3e4 greeting.txt
# 100644 blob a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9 app.js
# 040000 tree 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b src
Entry format:
<mode> <type> <hash> <name>
100644 blob abc123 file.js (regular file)
100755 blob def456 script.sh (executable)
040000 tree ghi789 src/ (directory)
120000 blob jkl012 link (symlink)
Trees are recursive. A tree for a project with nested directories contains subtrees:
root tree
├── 100644 blob abc123 package.json
├── 100644 blob def456 app.js
└── 040000 tree ghi789 src/
├── 100644 blob jkl012 index.js
├── 100644 blob mno345 utils.js
└── 040000 tree pqr678 routes/
├── 100644 blob stu901 home.js
└── 100644 blob vwx234 api.js
Commit Objects
A commit points to a tree (the snapshot) and includes metadata:
git cat-file -p HEAD
# tree 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
# parent abc1234def5678901234567890abcdef12345678
# author Shane <[email protected]> 1707840000 -0800
# committer Shane <[email protected]> 1707840000 -0800
#
# feat: add user authentication
Fields:
tree— the root tree object (the complete snapshot)parent— the previous commit (merge commits have multiple parents)author— who wrote the code (name, email, timestamp)committer— who committed it (can differ from author in cherry-picks)- The commit message follows a blank line
A commit is just 200-300 bytes regardless of how many files changed. The tree it points to captures the entire project state.
Tag Objects
Annotated tags are objects that point to a commit:
git tag -a v1.0.0 -m "Release 1.0.0"
git cat-file -p v1.0.0
# object abc1234def5678901234567890abcdef12345678
# type commit
# tag v1.0.0
# tagger Shane <[email protected]> 1707840000 -0800
#
# Release 1.0.0
Lightweight tags are not objects — they are just refs (pointers) without metadata.
Object Storage
Loose Objects
New objects are stored as individual files:
# Object abc1234... is stored at:
.git/objects/ab/c1234def5678901234567890abcdef12345678
# First two characters = directory
# Remaining characters = filename
Objects are zlib-compressed. You can inspect them with plumbing commands:
# Type
git cat-file -t abc1234
# Content
git cat-file -p abc1234
# Size
git cat-file -s abc1234
Packfiles
Git periodically packs loose objects into packfiles for efficiency:
ls .git/objects/pack/
# pack-abc123def456.idx (index — lookup table)
# pack-abc123def456.pack (data — compressed objects)
Packfiles use delta compression — similar objects are stored as deltas (differences) from a base object. This is extremely efficient for files that change incrementally.
# Manually trigger packing
git gc
# Verify packfile integrity
git verify-pack -v .git/objects/pack/pack-*.idx | head -20
# abc1234 commit 234 156 12
# def5678 tree 145 102 168
# ghi9012 blob 2890 1205 270
# jkl3456 blob 45 58 1475 1 ghi9012 ← delta from ghi9012
The last entry shows a delta object — it stores only the difference from ghi9012, saving space when similar file versions exist.
References (Refs)
Refs are human-readable names that point to commit hashes. They are stored as files in .git/refs/.
Branch Refs
cat .git/refs/heads/main
# abc1234def5678901234567890abcdef12345678
cat .git/refs/heads/feature-auth
# def5678901234567890abcdef12345678abc1234
A branch is literally a file containing a 40-character commit hash. Creating a branch is creating a file. That is why branches are cheap.
HEAD
HEAD is a symbolic ref that points to the current branch:
cat .git/HEAD
# ref: refs/heads/main
# After checkout to a branch:
git checkout feature
cat .git/HEAD
# ref: refs/heads/feature
# Detached HEAD (pointing directly to a commit):
git checkout abc1234
cat .git/HEAD
# abc1234def5678901234567890abcdef12345678
Tag Refs
cat .git/refs/tags/v1.0.0
# abc1234def5678901234567890abcdef12345678
Remote Tracking Refs
cat .git/refs/remotes/origin/main
# abc1234def5678901234567890abcdef12345678
Packed Refs
When there are many refs, Git packs them into a single file:
cat .git/packed-refs
# # pack-refs with: peeled fully-peeled sorted
# abc1234def5678901234567890abcdef12345678 refs/heads/main
# def5678901234567890abcdef12345678abc1234 refs/heads/develop
# ghi9012345678901234567890abcdef12345678ab refs/tags/v1.0.0
Loose refs in .git/refs/ override packed refs. Git checks loose refs first, then falls back to packed-refs.
The Directed Acyclic Graph (DAG)
Every commit points to its parent(s), forming a directed acyclic graph:
A ← B ← C ← D ← E (main)
\ ↑
F ← G ← H (feature, merged at E with parents D and H)
Properties:
- Directed — commits point to parents, not children
- Acyclic — no cycles (a commit cannot be its own ancestor)
- Roots — initial commits have no parent
Walking the DAG
# Show the DAG structure
git log --oneline --graph --all
# List all commits reachable from HEAD
git rev-list HEAD
# List commits reachable from main but not from feature
git rev-list feature..main
# List commits reachable from either but not both
git rev-list main...feature
# Find the common ancestor of two branches
git merge-base main feature
How Merge Works in the DAG
git merge feature
Git finds the merge base (common ancestor), computes diffs from the base to each branch tip, and combines them. The merge commit has two parents:
git cat-file -p HEAD # After merge
# tree ...
# parent abc1234 (main's previous HEAD)
# parent def5678 (feature's HEAD)
# ...
How Rebase Works in the DAG
git rebase main # On feature branch
Git replays each commit from the feature branch on top of main. Each replayed commit is a new object with a new hash. The old commits become unreachable (but stay in the reflog).
Plumbing Commands
Git has two layers: porcelain (user-facing) and plumbing (low-level).
Creating Objects Manually
# Create a blob from content
echo "Hello" | git hash-object -w --stdin
# ce013625030ba8dba906f756967f9e9ca394464a
# Create a blob from a file
git hash-object -w myfile.js
# Create a tree
git mktree << 'EOF'
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt
100644 blob abc1234def5678901234567890abcdef12345678 app.js
EOF
# Create a commit
echo "Initial commit" | git commit-tree <tree-hash>
# Returns the new commit hash
Inspecting the Index (Staging Area)
The index is a binary file at .git/index that tracks what will go into the next commit:
# Show the index contents
git ls-files --stage
# 100644 abc1234def5678901234567890abcdef12345678 0 app.js
# 100644 def5678901234567890abcdef12345678abc1234 0 package.json
# Show unmerged entries (conflicts)
git ls-files --unmerged
# Update the index manually
git update-index --add --cacheinfo 100644,<hash>,filename
Ref Operations
# Read a ref
git rev-parse HEAD
git rev-parse main
git rev-parse v1.0.0
# Create a ref
git update-ref refs/heads/new-branch abc1234
# Delete a ref
git update-ref -d refs/heads/old-branch
# List all refs
git for-each-ref
git for-each-ref --format='%(refname:short) %(objectname:short) %(subject)' refs/heads/
Complete Working Example: Building a Commit from Scratch
#!/bin/bash
# Build a commit using only plumbing commands
# Initialize a new repo
mkdir plumbing-demo && cd plumbing-demo
git init
# 1. Create blob objects
BLOB_APP=$(echo 'var app = require("express")();' | git hash-object -w --stdin)
BLOB_PKG=$(echo '{"name": "demo", "version": "1.0.0"}' | git hash-object -w --stdin)
echo "Created blobs: app=$BLOB_APP pkg=$BLOB_PKG"
# 2. Create a tree object
TREE=$(printf "100644 blob $BLOB_APP\tapp.js\n100644 blob $BLOB_PKG\tpackage.json\n" | git mktree)
echo "Created tree: $TREE"
# 3. Verify the tree
git cat-file -p $TREE
# 4. Create a commit object
COMMIT=$(echo "Initial commit (built manually)" | \
GIT_AUTHOR_NAME="Shane" GIT_AUTHOR_EMAIL="[email protected]" \
GIT_COMMITTER_NAME="Shane" GIT_COMMITTER_EMAIL="[email protected]" \
git commit-tree $TREE)
echo "Created commit: $COMMIT"
# 5. Point main to the new commit
git update-ref refs/heads/main $COMMIT
# 6. Set HEAD to point to main
git symbolic-ref HEAD refs/heads/main
# 7. Check out the working tree
git checkout -f
# 8. Verify everything works
git log --oneline
git status
ls -la
cat app.js
echo "Done! A complete Git commit built from plumbing commands."
Common Issues and Troubleshooting
"loose object is corrupt"
A file in .git/objects/ has been corrupted, possibly by disk error or incomplete write:
Fix: If you have a remote, re-fetch the object: git fetch origin. If the object is in a packfile, try git unpack-objects < .git/objects/pack/*.pack after removing the corrupted loose object. For severe corruption, clone fresh from the remote.
Repository is very large despite few files
Old large objects remain in packfiles even after files are deleted from the working tree:
Fix: Objects referenced by any commit in history persist. Use git filter-repo to remove large objects from history entirely, then git gc --aggressive to repack.
"unable to resolve reference" errors
A ref file in .git/refs/ is corrupted or empty:
Fix: Check the ref file: cat .git/refs/heads/main. If empty, restore from packed-refs: git pack-refs --all and check .git/packed-refs. If the hash is known (from git reflog), manually write it: echo <hash> > .git/refs/heads/main.
Detached HEAD state after checkout
You checked out a commit hash or tag instead of a branch name:
Fix: Create a branch at the current position: git checkout -b my-branch. Or return to a branch: git checkout main. Detached HEAD is not an error — it just means HEAD points directly to a commit instead of through a branch ref.
Best Practices
- Understand that branches are pointers. A branch is a 41-byte file. Creating, deleting, and switching branches is nearly free. Use branches liberally.
- Use
git cat-file -pto inspect objects. When debugging, look at the raw objects. They tell you exactly what Git sees, without porcelain formatting. - Let Git manage the object database. Do not manually modify files in
.git/objects/. Use plumbing commands if you need low-level access. - Run
git gcperiodically on large repos. Garbage collection packs loose objects and removes unreachable ones. The maintenance system handles this automatically if enabled. - Remember the reflog is your safety net. Even after destructive operations, the reflog keeps references to old commits for 30 days. Use
git reflogto find lost work. - Study the DAG to understand merge and rebase. Once you see commits as nodes in a graph, merge creates a node with two parents, and rebase moves nodes to a different parent. The mental model makes all Git operations predictable.