Building a Gmail CLI for Claude Code (And Why the MCP Wasn't Enough)

The MCP Wasn't Enough

Claude.ai ships a Gmail MCP. I used it for months. It reads mail fine. It searches fine. The moment you ask it to do anything that mutates state (send a reply, label a thread, archive a conversation, save a draft), a few things break down.

No log of what just happened. If Claude sent an email on my behalf, my only record was in the Sent folder, mixed with everything else.

Thread replies sometimes flatten. In-Reply-To and References weren't consistently assembled from the parent's headers, so Gmail would occasionally split a thread.

And it doesn't work on my Mac Mini at all. My Mini is where the morning-brief pipeline runs. It has no claude.ai desktop client, so it has no MCP. Which means any automation that needed to touch email had to happen on the MacBook. Which means cross-device friction.

None of this is Anthropic's fault. The MCP is a reasonable generalist. I wanted a specialist.

So I built one.

What I Actually Needed

Before writing code, I wrote down what I wanted from the tool. Not features. Guarantees.

Every write, logged. Sending, replying, labeling, drafting, archiving. One greppable line per action, appended to a file in my vault. If I ever ask "did Claude send something on Thursday?", the answer is grep Thursday ACTIVITY_LOG.md.

Thread-aware replies that don't break conversations. Fetch the parent, pull Message-ID, extend References, inherit the Re: subject without double-prefixing.

A dry-run mode for every outbound. Show me the MIME. Let me confirm. Then send. Same contract as git push: visible externally, not fully reversible.

Works the same on both machines. Mini, MacBook, any directory. No cross-device special cases.

Distributes via one command. I don't want to maintain a PyPI package for a private tool. uv tool install git+ssh://... from my private repo. Done.

Write those down first, in that order, and the architecture writes itself.

The Shape

claude-gmail is a Python CLI. One binary, seven subcommands:

claude-gmail auth    init | status
claude-gmail search  --query 'from:foo after:2026/04/10'
claude-gmail read    --message-id ... | --thread-id ...
claude-gmail send    --to ... --subject ... --body-file ...
claude-gmail reply   --message-id ... --body-file ...
claude-gmail label   list | add | remove | replace
claude-gmail draft   create | list | send

Two runtime dependencies: google-auth and google-auth-oauthlib. Everything else is stdlib. imaplib and smtplib for the wire protocols. email.message.EmailMessage for MIME. argparse for the CLI. That's the whole tool.

Keeping the dependency surface tiny matters. This is a tool I hand OAuth credentials to. I want to read every line it runs. Two deps I can audit. Twenty deps I cannot.

Six Phases, One Commit Per Phase

I split the build into six phases, locked the plan into a VALIDATE.md in the repo, and ran the whole thing in one day with one atomic commit per phase.

The discipline matters. If you try to build an OAuth-authed, SMTP-dispatching, IMAP-mutating, vault-logging CLI as one lump, you get a lump. Phase it. Let each commit be a thing you can revert without tearing out the rest.

Phase 0: Skeleton

The argparse router and the package layout. Nothing else. Seven subcommands registered as stubs that just print "phase N will implement this." uv.lock committed. src/claude_gmail/ directory structure finalized.

If your Phase 0 already works and does nothing, your plan is probably sound.

Phase 1: OAuth, IMAP, SMTP, Read-Only

The biggest phase. OAuth Installed App flow. Token file at ~/.config/claude-gmail/token.json, mode 0600, not synced between machines. A refresh flow that re-authenticates before expiry.

Then the two network clients.

imap_client.py speaks XOAUTH2 to imap.gmail.com:993. It uses Gmail's Special-Use LIST to find the All Mail and Drafts folders instead of hardcoding paths. Gmail localizes those names ("Enviados" in Portuguese, "Sent" in English), can't rely on the string. It fetches messages by X-GM-MSGID and threads by X-GM-THRID because those are the stable Gmail IDs that don't change when a message moves.

smtp_client.py speaks XOAUTH2 to smtp.gmail.com:587 with STARTTLS. No app passwords. Not even as a fallback. App passwords are deprecated; I don't want code paths that encourage their use.

Then two subcommands that only read: search (X-GM-RAW query syntax, same as Gmail's search box) and read (message or full thread, plain text or JSON, optional attachment save).

No writes in Phase 1. Every piece of the network and auth layer gets validated against a read-only workload before anything mutates state.

Phase 2: Send, Reply, And The Log

This is where the tool becomes dangerous, so this is where the log lands.

log.py is 130 lines. It writes one greppable line per action:

2026-04-17T05:38:39-03:00 hels-Mac-mini [ayeeye] reply msgid=... to=... subject='Re: ...' ok

Canonical path: ~/helsky-vault/contexts/helrabelo/claude-gmail/ACTIVITY_LOG.md. Every write appends here.

When --context <tag> is passed, the same line mirrors to a second log under a work context:

--context ayeeye         -> contexts/ayeeye/ACTIVITY_LOG.md
--context planetary:dtf  -> contexts/planetary/dtf/ACTIVITY_LOG.md

The colon is the path separator. No alias translation, no clever inference from email content. If I want a log entry to show up under a client's folder, I pass the flag. If I forget, it only lands in the canonical log.

Mirror failures don't abort the canonical write. If I typo --context hlesky-labs, the canonical log still gets the line, and stderr tells me the mirror failed. The tool does the safe thing by default.

Dry runs skip both. --no-log skips both but performs the real action. Those are the two escape hatches.

send is flag-for-flag parity with what you'd expect: --to, --cc, --bcc, --subject, --body or --body-file, --html, repeatable --attachment. Bcc is kept off the MIME and only used for the envelope recipient list.

reply is the one that actually needed design. Given a --message-id or --thread-id, it fetches the parent, then builds:

In-Reply-To: the parent's Message-ID.
References: parent's existing References chain, plus the parent's Message-ID appended. If the parent had no References but had an In-Reply-To, promote that into the chain. If neither existed, start the chain with the parent's Message-ID.
Subject: the parent's subject, with Re: prepended unless it already starts with a case-insensitive Re:.

--reply-all unions the parent's To and Cc minus my own address. Dedup across both lists so my address doesn't sneak back in via Cc when it's already in To.

$ claude-gmail reply --thread-id 186xxxx --body-file /tmp/reply.txt \
    --reply-all --context ayeeye --dry-run
=== DRY RUN (reply) ===
From: Hel Rabelo <helrabelo@gmail.com>
To: michael@example.com
Cc: legal@example.com
Subject: Re: Phase 2 scope
In-Reply-To: <69e1f13f.050a0220.90519.ea11@mx.google.com>
References: <69e1f13f.050a0220.90519.ea11@mx.google.com>
Content-Type: text/plain
---
Hi Michael,

Acknowledged. I'll send the invoice by Friday.

Hel
=== END DRY RUN ===

Confirm. Re-run without --dry-run. Two lines in the log: canonical plus the ayeeye mirror. Done.

Phase 3: Labels and Modified UTF-7

Gmail's IMAP wire format for labels is the single most annoying detail in this whole project.

Labels come through X-GM-LABELS. System labels (starting with a backslash, like \Inbox, \Sent, \Category_Promotions) are flag-like and pass through as-is. User labels encode as modified UTF-7 per RFC 3501 section 5.1.3.

Modified UTF-7 is like regular UTF-7, with three modifications that will absolutely catch you:

Printable ASCII (0x20-0x7E) passes through, except & which becomes &-.
Non-ASCII runs are encoded as &<base64>- where the base64 uses , instead of / and omits padding.
The surrounding & and - are delimiters, not part of the payload.

So R&D becomes R&-D. 日本 becomes &ZeVnLA-. Não lido (Portuguese for "Unread", a label I actually use) becomes N&AOM-o lido.

I wrote a 90-line labels.py with mutf7_encode and mutf7_decode and an encode_label_imap that additionally quotes the result when it contains IMAP-unsafe characters. System labels bypass everything.

>>> encode_label_imap("\\Inbox")
'\\Inbox'
>>> encode_label_imap("R&D")
'R&-D'
>>> encode_label_imap("Não lido")
'"N&AOM-o lido"'
>>> encode_label_imap("日本")
'&ZeVnLA-'

Archive is just label remove --thread-id ... "\\Inbox". Star is label add ... "\\Starred". The label subcommand doubles as an archive and star primitive, which means I don't need dedicated archive or star subcommands. Less surface area.

Phase 4: Drafts

Three actions: create, list, send.

draft create builds the MIME, then APPENDs the bytes to \Drafts. Gmail doesn't give you the new X-GM-MSGID from the APPEND response, so after the write I issue an X-GM-RAW rfc822msgid:<...> search to resolve it, using the Message-ID header we stamped on the way out. That msgid gets printed to the user and logged.

Threaded drafts get the Phase 2 reply machinery for free: pass --thread-id, the tool fetches the parent, inherits In-Reply-To + References + Re: subject.

draft list enumerates open drafts with subject, recipients, age, size. Read-only. Not logged.

draft send --draft-id ... fetches the draft bytes, reads recipients from To + Cc + Bcc, strips the Bcc header from the outgoing MIME (Bcc recipients should not appear in the message the recipients receive), dispatches via SMTP, then marks the draft \Deleted and issues an EXPUNGE on \Drafts.

The Bcc strip is the detail I'd have gotten wrong if I hadn't written it down in VALIDATE.md first.

Phase 5: Distribution

No code in this phase. Just plumbing.

uv tool install git+ssh://git@github.com/helsky-labs/claude-gmail.git
ln -s ~/code/tooling/claude-gmail/.env ~/.config/claude-gmail/.env
claude-gmail auth init

Three commands per machine. The uv tool install drops the binary at ~/.local/bin/claude-gmail, so it's on PATH from any directory. The symlink is there because the tool looks up .env in a specific order: $CLAUDE_GMAIL_ENV, ~/.config/claude-gmail/.env, $CWD/.env, <repo>/.env. Without the symlink, running the tool from ~ would fail because it can't find credentials.

I rewrote my global ~/.claude/CLAUDE.md "Email Handling" section to document this. Every machine now has an install ritual that takes one minute and survives across sessions.

Phase 6: Tests

Optional per the plan. I shipped it anyway.

81 unit tests. All pure: no network, no real filesystem outside pytest's tmp_path. Five files:

test_labels.py: mUTF-7 round-trip across ASCII, quoted, system labels, & escape, non-ASCII. 21 tests.
test_log.py: line format, value rendering rules, mirror resolution (happy, empty, unknown), canonical + mirror write, mirror failure isolated from canonical. 26 tests.
test_send.py, test_reply.py, test_draft.py: MIME assembly, thread header chains, Re: dedupe, reply-all self-filter, Bcc strip, Message-ID stamping. 34 tests total.

Runtime: 0.20 seconds. pytest hides under [dependency-groups.dev] so it's a dev-only dep; runtime deps stayed at two.

No E2E tests. The OAuth flow would need a throwaway Gmail account, and the dry-run ritual already catches the regressions that would actually bite. If and when I move OAuth out of Google's Testing mode (so refresh tokens stop expiring weekly), E2E becomes worth it.

The Log Is The Feature

If I had to point at one thing that makes this tool different from the MCP, it's the log.

2026-04-17T05:37:20-03:00 hels-Mac-mini send to=helrabelo@gmail.com subject='claude-gmail phase 2 smoke (send real)' attachments=0 html=false ok
2026-04-17T05:38:39-03:00 hels-Mac-mini [ayeeye] reply msgid=... to=... subject='Re: ...' ok
2026-04-17T06:43:57-03:00 hels-Mac-mini label action=add thrid=... labels-added=claude/phase3-smoke ok
2026-04-17T07:23:01-03:00 hels-Mac-mini draft action=create msgid=... to=... subject='...' ok

Every write that happened. Every context tag. Every machine. Every timestamp. Grep-friendly.

This is what makes me trust Claude with the inbox. Not the dry-run, not the tests, not the OAuth scoping, though all three matter. The log. I can audit what happened last week in three seconds.

The MCP does not have this. It couldn't, really. The MCP lives in claude.ai's process, and my vault doesn't. But once I had the constraint that "every write, logged, in my vault", the shape of the tool fell out.

If you're thinking about building your own Claude Code tooling for a specific system (not a generalist, but a specialist for the one workflow you do every day), the log is where I'd start. Not the feature list. The log.

Build the thing that writes the receipts first. Everything else is easier once you trust what you built.

What's Left

A few things the tool does not yet do.

HTML-reply quoting. Today reply --html sets the Content-Type but does not wrap the parent body in a <blockquote>. Fine for plain-text threads. Less graceful for HTML-heavy ones.

OAuth consent screen productionization. Google's Testing mode limits refresh tokens to seven days. Every Monday I re-run auth init on each machine. A small tax, but a tax. Moving the consent screen to production unlocks proper refresh-token lifetimes and E2E tests against a throwaway account.

Shared attachment helper. Three subcommands (send, reply, draft create) each have their own _attach_file. Same code. When the fourth subcommand needs attachments, I'll factor it out. Not yet.

None of these are blocking. The tool ships today with everything I actually need for the morning-brief pipeline, AyeEye email, Planetary client threads, and personal correspondence. v2 is triggered by a real need, not a refactor itch.

The Honest Takeaway

I wrote this in one day because I had written the plan first. RESEARCH.md, PLAN.md, VALIDATE.md, six phases with clear exit criteria. The code was the easy part.

The MCP is a good tool. For reads, for casual use, for workflows where the log doesn't matter, keep using it. I do.

For anything that mutates state, my rule is now: if a tool can send a message on my behalf, I want to see the receipt in a file I control. Everything else (the subcommands, the threading, the OAuth, the tests) was just the work required to earn that one guarantee.