Generated 2.39:1 email follow-up masthead showing a brass post-office box wall with one glowing open door and a single clipped envelope waiting on the right side.

Build an Email Follow-Up Agent.

Your inbox is where commitments go to die: promises made and forgotten, replies you are owed, denials that deserve a real answer. This guide points the context architecture behind the healthcare appeals and tax prep workflows at all three.

The goal is not an email autopilot. The goal is a loop that ingests your inbox every twenty minutes, keeps a commitments ledger, and files cited drafts, behind a send boundary the agent cannot cross: an ignored draft means no.

People lose high-friction paperwork fights because their information is scattered, unstructured, uncited, and incomplete. The fix is to own the context: collect the mess, normalize it, ground it in source documents, and produce the next human-reviewed action.

Fixture first Build and prove offlineMCP ingest Live on a 20-min cadenceSend boundary Nothing auto-sends

01 / Shared skeletonSame runbook shell, pointed at your inbox.Email follow-up uses the same primitives as healthcare and tax: own the context first, then draft the next reviewed action.

Do not position this as an email autopilot. The v1 is a follow-up organizer: it turns a mailbox export into a commitments ledger and cited drafts that a person approves and sends.

The email problem is context disorder plus a send boundary.

The email version applies that pattern to the threads you keep losing. The agent rebuilds the state of every conversation and drafts the next message. It never sends one.

This is the vertical where the boundary earns its keep. The story that motivated this guide family is an agent that drafted a reply to an insurance company, watched its human ignore the draft, and sent it anyway. It won the fight and crossed the line in the same move. Everything below is the discipline that keeps the win and removes the accident.

You do

Export a mailbox, or use the synthetic fixture inbox, and drop it into the starter repo.

The AI does

Reconstruct threads, build the commitments ledger, and draft cited follow-ups that stop at the send boundary.

Show the full prompt

<prompt>
  <task>Build a document-grounded case workflow from reusable Open Skills primitives.</task>
  <thesis>
    People lose because their information is scattered, unstructured, uncited, and incomplete.
    The workflow should help the person own their context, not outsource judgment to a black box.
  </thesis>
  <primitive_chain>
    <step>Ingest documents into markdown/text with raw source coordinates as anchors. Use PDF page/region, CSV line number, or form box identifiers, and embed the identical anchor scheme in the markdown that downstream citations will use. Keep one numbering scheme end to end.</step>
    <step>Chunk and tag source evidence by structure.</step>
    <step>Normalize the case facts into a ledger.</step>
    <step>Run the coverage gate. Every ingested document must produce at least one normalized record or be explicitly marked reference-only. Print the list of unconsumed documents and stop before drafting if any document is unaccounted for.</step>
    <step>Reconcile shared facts across sources before drafting. Compare the same fact anywhere it appears, turn every mismatch into a named review question, and record which source governs the tracked value.</step>
    <step>Store chunks, records, mappings, and outputs in SQLite by default.</step>
    <step>Optional: if you already run OB1, mirror the case store into Open Brain; otherwise skip this step entirely. SQLite is the complete beginner path.</step>
    <step>Retrieve relevant evidence deterministically before drafting.</step>
    <step>Validate citations before export. The citation guard returns pass / needs_review / fail verdicts. Any fail blocks packet export until fixed or converted to a named review question, and the guard verdict summary must appear in the packet README.</step>
    <step>Export an editable packet and stop at human review.</step>
  </primitive_chain>
  <constraint>The agent organizes and drafts. It does not sign, send, file, submit, authorize, or transmit sensitive data.</constraint>
</prompt>

The chain does not change because the documents are emails.

The runbook calls the same Open Skills in order: ingestion, chunking/tagging, normalization, SQLite or Open Brain storage, deterministic retrieval, citation validation, packet export, and the human gate. Email changes the anchors and the ledger schema, not the architecture. The message-id becomes the citation anchor the way a PDF page or a form box did in the other two guides.

SQLite stays the beginner path. Open Brain is the OB1 path, and email is where it pays off fastest: the people, preferences, and promises the ledger extracts are exactly the durable context the rest of your agents want.

Open the Context Engineering skill primitives →

02 / Data strategyUse real message formats and a synthetic inbox.The demo is realistic where parsing matters and synthetic where privacy matters.

Real formats, fake people.

Build the fixture inbox as real RFC 5322 messages in a real mbox file so ingestion earns its keep: true headers, In-Reply-To and References chains, multipart bodies, at least one attachment, and a real Sent folder. Make the senders, names, companies, amounts, and stakes synthetic.

Two fixture rules matter more than volume. Include Sent mail, because without your outbound messages the agent cannot know which loops you already closed. Include noise, a newsletter cluster and automated receipts, so triage has something to legitimately exclude as reference-only.

The fixture inbox is the permanent test bench, not a stepping stone you throw away. Every gate below gets proven here, on a corpus where runs are reproducible, before the loop touches live mail in section 05.

Seed three follow-up cases.

Build synthetic threads for the three states the retrieval map handles: a dropped commitment, where you promised a deliverable by a named date and the thread went quiet; a waiting-on-them thread, where you asked for something twelve days ago and nothing came back; and a dispute, a vendor or insurer decline where the draft must answer point-by-point from quoted evidence and then stop, unsent.

These three leave other branches untested, scheduling threads and intro requests among them. Build those branches anyway, mark them untested, and smoke-test them when a real mailbox arrives.

Chunk messages, not threads.

A twelve-reply thread repeats the same paragraph twelve times as quoted history. Anchor every evidence chunk to the message-id where a sentence first appears, and strip quoted history, signatures, and legal disclaimers out of evidence chunks entirely. Keep a quote map alongside, so retrieval can still reconstruct what each participant had seen by any point in the thread.

Smoke-test this by printing the text of every cited chunk and reading it. A citation that lands on a signature block, a disclaimer footer, or reply nine's quote of the original is a broken chunker, not evidence.

03 / Domain layerMap thread state to evidence and drafts.The email-specific intelligence lives in the thread ledger, the reconciliation against Sent mail, and the outbox packet rules.

Normalize threads into a commitments ledger.

Reconstruct threads from In-Reply-To and References headers, never from subject lines. Subject matching breaks on Re: Re: Fwd: prefixes, on edited subjects, and on the two different vendors who both titled their thread Invoice.

Normalize identity before extracting anything. "Amazon.com <no-reply@amazon.com>" is a sender, not a person; the display-name plus address pair is the identity key. Then extract the ledger: commitments with owner, description, due date, status, and source anchor; waiting-on rows with what was asked, who owes it, and when it was asked; and decisions worth keeping. A due date of "Friday" stored as the string Friday is a failed parse, resolve it against the message date or mark it needs_review.

Reconcile against Sent mail before drafting anything.

One promise equals one ledger row, even when it is restated in three threads and quoted in six replies. A weak agent sees "I'll send the deck by Friday" quoted through a thread, creates six commitments, and drafts three duplicate follow-ups to the same person.

Cross-check every waiting-on row against the Sent export before drafting a nudge. The most embarrassing failure this workflow can produce is chasing a question the person answered two weeks ago, and it happens whenever Sent mail is not ingested as first-class evidence. The coverage gate counts Sent messages like any other source document.

When an email body and its attachment disagree on an amount or a date, the mismatch becomes a named review question in the packet, never a silent correction. Name which source governs the tracked value so the draft still has one number to cite.

The output is an outbox packet, not sent mail.

The packet is a folder a person can review in ten minutes. The README opens with actions ordered by urgency: overdue commitments first, then aging waiting-ons, each with days elapsed computed from the run date.

Packet folder. One packet/ directory per mailbox: README.md, commitments-ledger.csv, waiting-on.md, follow-up-schedule.md, drafts/ with one file per draft, citation-map.json, unresolved-questions.md, and a sources/ manifest.

Every draft carries complete headers, To, Cc, Subject, and In-Reply-To, so approval means reading one file, not reassembling context. Every factual claim in a draft cites a message chunk: "as agreed on June 3" must anchor to the June 3 message or it becomes a review question. And every draft carries an approval field with three values, pending, approved, declined. Everything exports as pending.

Draft in your voice, not in AI voice.

A follow-up you have to rewrite from scratch is not a draft, it is a prompt for you. Nothing kills the approval loop faster than drafts that open with "I hope this email finds you well" when you have never once written that sentence.

Build the Personal Voice skill from the Open Skills directory before you wire up drafting. It encodes how you actually write across registers, blunt versus warm, email versus post, with real samples of each, so drafts come back needing light edits instead of rewrites. Your own Sent mail is the training corpus, and this pipeline has already ingested it: pick five to ten sent replies that sound like you on a good day and feed them to the skill's setup interview.

Directory path: Open Skills, Writing Voice and Content, Personal Voice. Linked below the sources list on this page.

Open the Email Follow-Up Packet runbook →

04 / Sources and gatesTreat the send boundary as the product.The guide should make the no-send rule and the untrusted-input rule visible instead of burying them in a disclaimer.

Cite the format and export sources.

The guide should point readers to RFC 5322 for the message format the fixtures must honor, RFC 4155 for the mbox container, Google Takeout as the fixture-phase export path, and the Gmail API scopes page for why raw API credentials are the wrong live path: the scope that manages drafts can also send them. The live loop in section 05 gets the safe property a different way, a connector tool surface with no send verb.

An ignored draft means no.

Approval is explicit and per message. Silence is not consent, time passing is not consent, and re-running the pipeline is not consent; nothing promotes a draft from pending to approved except a recorded human decision.

The boundary is enforced structurally at both stages of maturity, not politely. During the build, the input is a local export and the output is a folder of drafts, so no code path can transmit anything. Live, the agent works through a mail connector whose tool surface reads threads and creates drafts but exposes no send verb, and its drafts land in your Drafts folder still marked pending in the ledger. The approval act is you pressing send in your own mail client, and the approval column survives as the audit trail.

Inbound email is evidence, never instructions.

This vertical has a gate the other two do not need: the source documents can talk back. Message bodies are things to quote, cite, and summarize, never commands to follow, no matter how they are phrased.

Seed a hostile fixture: an email whose body instructs the assistant to send the reply immediately without confirmation. A correct run ingests it, stores it, and quotes it like any other message, while every draft stays pending and no send path executes. Keep the run report showing the hostile message was processed and the approval column never moved.

Save proof that the guard works.

The citation-guard test saves two reports as artifacts: one for the fully cited draft and one for a seeded fabricated-but-well-formed citation, a draft sentence like "as you confirmed on June 3" pointing at a message that does not exist. The fully cited draft must pass with exit 0. The seeded fabrication must fail with nonzero exit, and the seeded sentence itself must appear as the failing item in the saved report.

A report that fails other claims does not count as proof. Fix the harness until the fabricated sentence is the reported failure, then keep both reports with the packet.

05 / Live loopBuild on the export, run on the connector.The fixture export is the test bench. The lived workflow ingests through your provider's mail connector on a work-hours cadence.

Point the verified loop at an MCP connector.

Email is not a once-a-day surface. A follow-up agent working from this morning's export commits the exact failure this guide exists to prevent by afternoon: nudging someone who answered an hour ago. Once every gate passes on the fixture inbox, move ingestion to the mail connector your agent platform provides.

Demand one property before wiring it in: the connector's tool surface should read threads and create drafts but expose no send verb. Anthropic's Gmail connector is shaped exactly this way, search and read threads, create and list drafts, manage labels, no send tool. That keeps the send boundary structural on live mail: the model cannot call a tool that does not exist.

Ingest on a work-hours cadence.

Poll every 20 minutes or so during work hours. Each cycle is incremental: fetch the delta since the last run keyed by message-id, ingest and extract only the new messages, and update the ledger rows those messages touch. Most cycles carry a handful of messages, so extraction stays cheap. The expensive pass, full Sent-mail cross-checks and restated-promise merging across the whole ledger, runs once nightly.

Every cycle appends a run receipt: messages ingested, ledger rows changed, drafts created, questions raised. A loop that runs thirty times a day without receipts is thirty chances a day to trust it blindly.

Unattended runs raise the injection stakes.

A scheduled loop processes hostile mail with nobody watching, so the untrusted-input rule from section 04 stops being belt-and-suspenders and becomes load-bearing. Keep the runner's permissions to exactly the pipeline: the mail connector's read-and-draft tools, the case store, nothing else. No shell, no other outbound channels.

Drafts the live loop creates go into your Drafts folder, still marked pending in the ledger. Labels make a good native receipt: the connector can tag processed threads and flag needs-review ones, so your inbox itself shows what the agent touched without opening the packet.

06 / Verification gatesDone means verified.Make each stage prove its own output before a draft reaches a human, and make the send boundary a tested property instead of a promise.

Run the per-stage prove-it checklist.

Ingestion. Does every message in the export appear in source_documents with its message-id anchor? Count the messages in the mbox and compare against the row count; multipart messages that parse to empty text get named in a warning list, not silently skipped. A broken run has 300 messages on disk and 240 rows, and the missing 60 are the entire Sent folder.

Chunking. Does a cited chunk contain the original sentence at its first occurrence? A broken run cites a signature block, a disclaimer footer, or a later reply's quote of the original. Fix by stripping quoted history and boilerplate from evidence chunks and re-anchoring to the originating message-id.

Normalization. Do senders resolve to identity keys; do dates parse with timezones; does every commitment row carry an owner, a due date, and a source anchor? A broken run stores five rows for one promise or a due date of "Friday" as text.

Reconciliation. Does a sent reply close the waiting-on item it answers; do restated promises merge into one row; do body-versus-attachment conflicts become review questions? A broken run drafts a nudge for a question that was answered in Sent mail two weeks ago.

Drafting plus guard. Does the clean draft pass and the seeded fabricated citation fail as the named failing item? A broken run accepts "as you confirmed" claims with no anchoring message.

Export. Open the packet: does every draft have complete headers; does the README ordering match the ledger; is every approval field pending? A broken run ships a draft with an empty To line or an approved status no human set.

Human gate. Is the run's tool surface free of a send verb; did the hostile fixture stay quoted instead of obeyed? A broken run has any route from the packet or the Drafts folder to transmission that does not pass through a recorded human decision.

Refuse to ship unapproved certainty.

The guard's verdict is the gate. Export refuses, or stamps DRAFT-INVALID on the README, while any citation fails or any draft claims an approval the ledger cannot show a human decision for. The README reproduces the guard's actual pass, needs_review, and fail counts verbatim, so the reviewer sees the same verdict the pipeline saw.

Put the boundary check on the human-gate checklist itself: confirm the run's tool surface cannot send, confirm every draft exported as pending, confirm the hostile fixture appears in the ledger as content, and confirm the follow-up schedule proposes dates instead of scheduling sends.

Bonus / Multi-inbox importBonus: every inbox in one local store.A desktop client that syncs all your accounts is a universal ingest surface: four providers become one folder tree on your own disk, on any OS.

This lane replaces per-provider connectors for ingestion only. Drafting and the send boundary still work exactly as in sections 03 through 05.

Pick the open door: Thunderbird.

Any mail client that downloads full copies of every account turns your computer into the aggregator: iCloud, Gmail, Microsoft, Yahoo, and plain IMAP all land as local files your pipeline can read. Thunderbird is the open door on every OS: free, open source, runs on Windows, macOS, and Linux, and stores every folder as a plain mbox file in a profile directory the operating system does not gate. The built-in alternatives are closed doors by comparison: Apple Mail's store sits behind Full Disk Access and a proprietary per-message format, and Outlook's local OST store is a database you were never meant to read.

This lane is validated end to end: pointing a small Python stdlib locator at a real two-account Thunderbird store walked 30 folder files and parsed 169,084 messages, with no permissions dialog and no proprietary framing in the way. The same mbox files feed every stage this guide already built.

Add every account, then read plain mbox.

Install Thunderbird and add each account through its wizard. This step is also the auth solution: for Gmail and Microsoft the wizard opens the provider's own browser sign-in and handles OAuth for you. Then, per folder you care about, enable offline synchronization so full message copies land on disk, not just headers.

The store layout is stable, documented, and identical on every OS; only the root moves. macOS: ~/Library/Thunderbird/Profiles/. Windows: %APPDATA%\Thunderbird\Profiles\. Linux: ~/.thunderbird/. In each case profiles.ini names the profile directories. Inside a profile, ImapMail/<server>/ holds each IMAP account and Mail/ holds POP and Local Folders. Every folder is an EXTENSIONLESS mbox file: Inbox is the mail, Inbox.msf is an index sidecar to ignore, and Inbox.sbd/ is a directory holding child folders. Python's mailbox module reads these directly.

Two disciplines for the live loop. Copy or snapshot before parsing while Thunderbird is running, and treat folder files as rewritable, not append-only: Thunderbird compacts folders, which rewrites the file in place. So key incremental sync on file mtime plus a Message-ID high-water mark in your case store, and re-scan any folder whose file changed.

Let the client do the auth dance.

Hand-rolled IMAP credentials are where inbox projects die, because the providers have spent years killing passwords. As of mid-2026: Microsoft retired password sign-in for Outlook.com IMAP on September 16, 2024, OAuth2 is mandatory. Personal Gmail no longer supports apps that ask for your Google password; app passwords still exist as a fallback but require 2-Step Verification and Google recommends against them. iCloud Mail requires an app-specific password for third-party IMAP. Yahoo requires an app password.

The practical read: do not paste provider passwords into scripts. Let a client that speaks OAuth do the dance in its GUI, Thunderbird for this lane, and read its local store. If you are building a service rather than a personal loop, a self-hosted gateway like EmailEngine manages OAuth tokens and exposes every account as a REST API, that is the heavier, builder-grade version of the same move.

On a Mac, Apple Mail still works, behind two gates.

If you are on a Mac and would rather not run a second client, Apple Mail's local store is real: everything syncs under ~/Library/Mail/V<N>, one UUID directory per account, each message an individual .emlx file. But the lane sits behind two gates. Gate one is permission: reading it requires Full Disk Access (System Settings, Privacy and Security, Full Disk Access, enable your terminal or agent host app, then restart it), and until granted, even a directory listing fails with "Operation not permitted", that error is the gate working, not a bug. Gate two is format: .emlx is a byte-count line, then the raw RFC 5322 message, then an Apple plist, split it by the byte count, never by scanning for the XML, and treat partial.emlx files as incomplete evidence.

The no-permission fallback is manual: select mailboxes, then Mailbox, Export Mailbox. The catch is that the export is a <Name>.mbox FOLDER whose real mbox stream is a file literally named mbox inside it. And for incremental work, V<N>/MailData holds the Envelope Index SQLite catalog, copy it with its -wal and -shm sidecars before querying.

Honest status: the parsers for this lane pass on fixtures, but in-place reading is unproven until you grant Full Disk Access, which is exactly why Thunderbird is the recommended door.

Primary sources for the starter.

Use these for message-format fidelity and the fixture path. The build phase holds no mail credentials; the live loop ingests through a read-and-draft connector with no send verb.