AI-Powered Commit Analysis: Building Gitography's Intelligence Layer
Beyond Commit Messages
Most developer tools treat commits as metadata -- a hash, a message, a timestamp. But the real story lives in the diff. A commit message says "fix: resolve authentication issue." The diff shows you that someone added rate limiting to the login endpoint, switched from bcrypt to argon2, and patched a timing attack vulnerability. That's three different stories in one commit, and the message told you almost nothing.
Gitography was born from this gap. It analyzes actual code diffs to understand what changes mean semantically -- not what the developer said they did, but what they actually did. The intelligence layer powers everything else in the app: contribution insights, project timelines, complexity tracking.
Building this meant solving a surprisingly hard problem: how do you take a raw git diff, send it to an AI model, get structured analysis back, and do it reliably at scale without going broke? Here's how I built it.
The Architecture
Gitography runs on Next.js 15 with Tailwind and shadcn/ui for the frontend, Supabase for the database, and a BullMQ + Redis pipeline for background processing. The AI analysis uses Claude's API -- specifically claude-3-5-haiku-latest for the balance of speed, quality, and cost.
The flow looks like this:
GitHub Push Event
--> Webhook Handler (validate signature)
--> Ingestion Queue (fetch commit details + diffs)
--> Diff Filter (ignore/shallow/deep classification)
--> Analysis Queue (Claude API call)
--> Store insights in Supabase
--> Deduct user credits
Each step is isolated, retryable, and observable. If Claude's API goes down, the analysis queue backs up but nothing is lost. When it recovers, the queue drains and everything catches up.
GitHub Integration: Webhooks and Polling
The primary data source is a GitHub App. When a user connects their repository, Gitography installs a webhook that fires on every push event.
The webhook handler validates the request first:
import { createHmac } from "crypto";
function validateWebhookSignature(
payload: string,
signature: string,
secret: string
): boolean {
const expected = `sha256=${createHmac("sha256", secret)
.update(payload)
.digest("hex")}`;
return timingSafeEqual(
Buffer.from(signature),
Buffer.from(expected)
);
}
HMAC-SHA256 validation prevents anyone from sending fake push events. The timingSafeEqual comparison prevents timing attacks -- a detail that's easy to overlook but important when you're validating secrets.
Not every repository has webhooks configured though. Some users connect existing repos where they can't install apps (org restrictions, forked repos). For those, Gitography falls back to a polling sync that checks for new commits periodically.
Three Queues, Three Concerns
I split the pipeline into three BullMQ queues, each with its own concurrency and rate limits:
// Queue 1: Ingestion
const ingestionQueue = new Queue("commit-ingestion", {
defaultJobOptions: {
attempts: 3,
backoff: { type: "exponential", delay: 1000 },
},
});
// Queue 2: Analysis
const analysisQueue = new Queue("commit-analysis", {
defaultJobOptions: {
attempts: 3,
backoff: { type: "exponential", delay: 1000 },
},
});
// Queue 3: Polling Sync
const pollingSyncQueue = new Queue("polling-sync", {
defaultJobOptions: {
attempts: 3,
backoff: { type: "exponential", delay: 2000 },
},
});
Each queue runs its own worker with different concurrency settings:
| Queue | Concurrency | Rate Limit | Purpose |
|---|---|---|---|
| Ingestion | 5 | 100/min | Fetch commit details + diffs from GitHub |
| Analysis | 3 | 60/min | Call Claude API, store insights |
| Polling Sync | 2 | 30/min | Check repos without webhooks |
The rate limits are deliberate. GitHub's API allows 5,000 requests per hour for authenticated apps, but bursty traffic gets throttled faster. Spreading requests across 100/min for ingestion keeps us well within limits even during heavy push periods.
The analysis queue is more conservative at 60/min because Claude API calls are the most expensive operation, both in latency and cost. Lower concurrency of 3 means at most 3 simultaneous Claude requests, which keeps costs predictable.
The Three-Tier Diff Filter
Not every file in a commit deserves AI analysis. Sending a 50,000-line package-lock.json diff to Claude would be expensive, slow, and useless. The diff filter classifies every file into one of three tiers before anything touches the AI.
IGNORED (100+ patterns): Files that are never analyzed. Lock files, binary assets, generated code, vendor directories. These add noise and zero signal.
const IGNORED_PATTERNS = [
// Lock files
"package-lock.json", "yarn.lock", "pnpm-lock.yaml",
"Gemfile.lock", "poetry.lock", "Cargo.lock",
// Binary and assets
"*.png", "*.jpg", "*.gif", "*.ico", "*.woff", "*.woff2",
// Generated
"*.min.js", "*.min.css", "*.map",
"dist/**", "build/**", ".next/**",
// Vendor
"node_modules/**", "vendor/**",
// ... 90+ more patterns
];
SHALLOW (30+ patterns): Files that get basic metadata extraction but no deep AI analysis. Configuration files, documentation, CI configs. Gitography notes that they changed but doesn't spend tokens understanding how.
const SHALLOW_PATTERNS = [
// Config
"*.config.js", "*.config.ts", "tsconfig.json",
".eslintrc*", ".prettierrc*",
// Docs
"*.md", "*.mdx", "LICENSE", "CHANGELOG*",
// CI/CD
".github/workflows/*", "Dockerfile", "docker-compose*",
// ... 20+ more patterns
];
DEEP (50+ language patterns): The actual source code. TypeScript, Python, Go, Rust, Ruby, Java -- anything that represents real logic changes. These are the files that get sent to Claude.
const DEEP_PATTERNS = [
// JavaScript/TypeScript
"*.ts", "*.tsx", "*.js", "*.jsx",
// Python
"*.py",
// Systems
"*.go", "*.rs", "*.c", "*.cpp",
// JVM
"*.java", "*.kt", "*.scala",
// ... 40+ more patterns
];
This filtering is the single biggest cost optimization in the pipeline. A typical push event might touch 15 files, but only 4-6 of those are source code worth analyzing. Filtering before the AI step cuts token usage by 60-70%.
Token Management and Diff Truncation
Claude has context limits, and diffs can be enormous. A single file refactor might produce a 2,000-line diff. I needed a strategy for fitting diffs into the analysis window without losing important context.
The approach: estimate tokens, truncate intelligently, preserve line boundaries.
const CHARS_PER_TOKEN = 4; // Rough estimation
const MAX_DIFF_TOKENS = 8000;
const MAX_DIFF_CHARS = MAX_DIFF_TOKENS * CHARS_PER_TOKEN; // 32,000
function truncateDiff(diff: string): string {
if (diff.length <= MAX_DIFF_CHARS) return diff;
// Truncate at last complete line before the limit
const truncated = diff.substring(0, MAX_DIFF_CHARS);
const lastNewline = truncated.lastIndexOf("\n");
return lastNewline > 0
? truncated.substring(0, lastNewline) + "\n[... diff truncated]"
: truncated + "\n[... diff truncated]";
}
The ~4 characters per token is a rough heuristic. Actual tokenization varies by content -- code with lots of symbols tokenizes differently than prose. But for estimation purposes, 4 chars/token gets you within 20% accuracy, which is good enough for budget management.
Preserving line boundaries matters because a truncated line in a diff looks like a syntax error to the model. Cutting at const user = await getUs versus cutting at the end of the previous complete line produces meaningfully different analysis quality.
The 8,000 token limit for the diff leaves room for the system prompt, output tokens, and overhead within the model's context window. I'd rather send a clean, truncated diff than a complete but rushed analysis.
The Claude Analysis Call
The core of the intelligence layer is a carefully structured Claude API call:
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
interface CommitAnalysis {
change_type: ChangeType;
complexity: Complexity;
summary: string;
affected_modules: string[];
languages_touched: string[];
technical_details: string;
confidence_score: number;
}
type ChangeType =
| "feature"
| "bugfix"
| "refactor"
| "performance"
| "security"
| "documentation"
| "testing"
| "dependency"
| "configuration"
| "style";
type Complexity =
| "trivial"
| "minor"
| "moderate"
| "significant"
| "major";
async function analyzeCommit(
message: string,
diff: string,
files: string[]
): Promise<CommitAnalysis> {
const response = await anthropic.messages.create({
model: "claude-3-5-haiku-latest",
max_tokens: 1024,
temperature: 0,
system: `You are a code analysis expert. Analyze the following git commit and respond with a JSON object containing:
- change_type: one of [feature, bugfix, refactor, performance, security, documentation, testing, dependency, configuration, style]
- complexity: one of [trivial, minor, moderate, significant, major]
- summary: 1-2 sentence description of what this commit actually does
- affected_modules: array of module/feature areas affected
- languages_touched: array of programming languages in the changes
- technical_details: brief technical explanation of the approach
- confidence_score: 0.0-1.0 how confident you are in this analysis
Respond ONLY with valid JSON. No markdown, no explanation.`,
messages: [
{
role: "user",
content: `Commit message: ${message}\n\nFiles changed: ${files.join(", ")}\n\nDiff:\n${truncateDiff(diff)}`,
},
],
});
const text = response.content[0].type === "text"
? response.content[0].text
: "";
return JSON.parse(text);
}
A few deliberate choices here:
temperature: 0 -- I want deterministic, consistent analysis. The same commit should produce the same classification every time. Creative variation is a liability when you're building analytics on top of AI output.
max_tokens: 1024 -- The response is a JSON object with short fields. 1024 tokens is generous but bounded. It prevents runaway responses that would eat the budget.
claude-3-5-haiku-latest -- Haiku is the right model for this task. The analysis doesn't require deep reasoning or nuanced understanding -- it needs fast, accurate classification of code changes. Haiku processes these in under a second at a fraction of the cost of Sonnet or Opus. At scale, the difference between Haiku and Sonnet pricing is the difference between a viable product and a money pit.
Retry Logic and Error Handling
AI APIs fail. They rate-limit you. They have bad days. The retry logic handles this gracefully:
const RETRY_CONFIG = {
maxAttempts: 3,
initialDelay: 1000, // 1 second
maxDelay: 30000, // 30 seconds
multiplier: 2, // exponential backoff
retryableStatuses: [429, 500, 502, 503, 529],
};
async function analyzeWithRetry(
message: string,
diff: string,
files: string[]
): Promise<CommitAnalysis> {
let lastError: Error | null = null;
let delay = RETRY_CONFIG.initialDelay;
for (let attempt = 1; attempt <= RETRY_CONFIG.maxAttempts; attempt++) {
try {
return await analyzeCommit(message, diff, files);
} catch (error: any) {
lastError = error;
const status = error?.status ?? error?.response?.status;
if (!RETRY_CONFIG.retryableStatuses.includes(status)) {
throw error; // Non-retryable, fail immediately
}
if (attempt < RETRY_CONFIG.maxAttempts) {
await sleep(delay);
delay = Math.min(delay * RETRY_CONFIG.multiplier, RETRY_CONFIG.maxDelay);
}
}
}
throw lastError;
}
The retryable status codes cover the common failure modes:
- 429: Rate limited -- back off and try again
- 500: Internal server error -- transient, usually resolves
- 502/503: Gateway errors -- infrastructure issues, temporary
- 529: Overloaded -- Claude-specific, means the API is busy
Non-retryable errors (400 bad request, 401 unauthorized) fail immediately. No point retrying a malformed request.
The exponential backoff (1s, 2s, 4s) with a 30-second cap prevents hammering the API while also not waiting unnecessarily long. Three attempts with this pattern means the worst case is about 7 seconds of waiting before giving up.
Fallback Analysis
Sometimes Claude fails all retries. Sometimes the response isn't valid JSON. The system needs to keep moving. That's where fallback analysis comes in:
function fallbackAnalysis(
message: string,
files: string[]
): CommitAnalysis {
// Rule-based classification
const lowerMessage = message.toLowerCase();
let change_type: ChangeType = "feature";
if (lowerMessage.startsWith("fix")) change_type = "bugfix";
else if (lowerMessage.startsWith("refactor")) change_type = "refactor";
else if (lowerMessage.startsWith("test")) change_type = "testing";
else if (lowerMessage.startsWith("docs")) change_type = "documentation";
else if (lowerMessage.startsWith("chore")) change_type = "configuration";
else if (lowerMessage.startsWith("perf")) change_type = "performance";
else if (lowerMessage.startsWith("style")) change_type = "style";
const extensions = files.map((f) => f.split(".").pop() ?? "");
const languages = [...new Set(extensions.map(extToLanguage).filter(Boolean))];
return {
change_type,
complexity: files.length > 10 ? "significant" : "minor",
summary: message.replace(/^(feat|fix|chore|docs|refactor|test|perf|style)(\(.+\))?:\s*/i, ""),
affected_modules: extractModulesFromPaths(files),
languages_touched: languages,
technical_details: "Analysis based on commit message and file paths (AI analysis unavailable)",
confidence_score: 0.3,
};
}
The fallback uses conventional commit prefixes and file paths to make a best guess. The 0.3 confidence score is the key -- it signals to every downstream consumer that this analysis is approximate. The dashboard can show these differently, filter them out of high-confidence reports, or flag them for re-analysis when the AI is available again.
This is a pattern I use everywhere in Gitography: never block on AI, always have a degraded path, always label the quality of the data.
Credit System and Cost Control
Running Claude on every commit for every user would bankrupt any indie project. The credit system makes costs predictable for both me and users:
| Plan | Price | Credits/Month |
|---|---|---|
| Free | $0 | 100 |
| Basic | $7/mo | 1,000 |
| Pro | $15/mo | 5,000 |
One credit equals one commit analysis. The math works out: Haiku costs roughly $0.00025-0.001 per analysis depending on diff size. At 5,000 analyses per month on the Pro plan, my worst-case cost is about $5, leaving $10 in margin. That's sustainable.
Credits are deducted after a successful analysis, not before:
async function processAnalysisJob(job: Job) {
const { commitId, userId, message, diff, files } = job.data;
// Check credits before starting
const credits = await getUserCredits(userId);
if (credits <= 0) {
// Store with fallback analysis instead
const fallback = fallbackAnalysis(message, files);
await storeInsight(commitId, fallback);
return;
}
try {
const analysis = await analyzeWithRetry(message, diff, files);
await storeInsight(commitId, analysis);
await deductCredit(userId);
} catch (error) {
// AI failed, use fallback, don't deduct credit
const fallback = fallbackAnalysis(message, files);
await storeInsight(commitId, fallback);
}
}
If the AI fails, the user doesn't lose a credit. They get the fallback analysis for free. This feels fair and builds trust -- you only pay for real AI analysis.
Lessons from 30+ Commits of Rebuilding
The intelligence layer went through Phases 2 through 4 of Gitography's rebuild, spanning January 31 to February 2. Thirty-plus commits across three days. Here's what I learned.
Separate ingestion from analysis. My first design had one queue that fetched the diff and called Claude in the same job. When Claude was slow, it blocked the GitHub API calls. Splitting into two queues means ingestion stays fast and analysis can take its time.
Filter aggressively before the AI step. Early versions sent everything to Claude, including config file changes and documentation updates. The analysis was technically correct but useless -- "this commit updates the ESLint configuration to add the no-unused-vars rule" is not an insight anyone needs AI for. The three-tier filter cut token usage dramatically and improved the quality of what was left.
Temperature 0 is not optional for classification tasks. With temperature > 0, the same commit would sometimes be classified as "refactor" and sometimes as "feature." When you're building charts and trends on top of this data, inconsistency is a bug.
Always have a fallback. The 0.3 confidence fallback saved me during a Claude API outage early in testing. Commits kept flowing in, analysis kept happening (at lower quality), and when the API came back, everything was already processed. No backlog, no data gaps.
Budget for the worst case. Token estimation is an approximation. Some diffs tokenize heavier than expected. I set the diff limit conservatively at 8,000 tokens rather than pushing the context window. The marginal analysis quality from a longer diff isn't worth the marginal cost.
What's Next
The current pipeline handles single-commit analysis. The next step is cross-commit intelligence: understanding how a series of commits relates to each other, identifying patterns in a developer's work over time, and detecting when a project is shifting direction.
There's also batch re-analysis. When I improve the prompt or switch models, I want to re-run analysis on historical commits. The queue architecture makes this straightforward -- just enqueue the old commits with a reanalyze flag.
The foundation is solid: GitHub webhooks for real-time data, BullMQ for reliable processing, Claude for intelligent analysis, and a credit system that keeps the lights on. Everything else builds on top.