Back to Articles
Feb 24, 20262 weeks ago

Lessons from Reverse Engineering Excel AI Agents

NB
Nicolas Bustamante@nicbstme

AI Summary

This fascinating deep dive goes beyond the marketing hype to reverse engineer the actual architectures of three leading Excel AI agents: Claude in Excel, Microsoft's Copilot, and Shortcut AI. The author discovers that the underlying tool design and safety protocols, not the AI model itself, are what truly differentiate these systems and determine their reliability. Through rigorous testing, the article reveals critical trade-offs in areas like overwrite protection, how agents "see" spreadsheet data, and the implementation of verification loops to catch errors.

How different are these systems really? How much do they diverge from just giving a general purpose agent access to a spreadsheet?

Excel AI agents are everywhere right now. Every major platform is shipping one, and the demos all look the same: type a prompt, get a spreadsheet. I kept seeing new ones pop up and started wondering: is there actually a tech edge here, or is this going to be a pure distribution play where the biggest platform wins regardless of architecture?

I reverse engineered three production Excel AI agents: Claude in Excel (Anthropic), Microsoft's Copilot Excel Agent, and Shortcut AI.

I dug into their tool schemas, stress tested their error handling, mapped their verification loops, and pushed each one to its limits. It was a lot of fun!

I learned a lot. Claude's tool design is genuinely impressive. Shortcut has ambitious scaffolding, some of it quite clever. Microsoft's Copilot is not where it should be yet, but the "yet" matters more than the current state. I think Microsoft is gonna win ultimately. They are moving fast and have a lot of distribution.

Interestingly, these agents are not wrappers around an LLM. They are tool-calling agents with structured schemas, Python sandboxes, overwrite protection protocols, and carefully designed verification loops. The differences between them reveal fundamental tradeoffs in agent design that apply to every AI agent being built today, not just Excel.

Here's what I'll cover:

- The Three Architectures - 14 structured tools vs. 2 raw tools vs. 11 tools with a helper API

- How They See Your Spreadsheet - Lazy loading vs. eager loading, and why it matters more than the model

- The Overwrite Protection Spectrum - Tool-enforced vs. behavioral, and only one gets it right

- The Two-Tier Tool Hierarchy - Why every agent needs a safe path and an escape hatch

- The Blind Agents Problem - Two out of three can't see your spreadsheet

- The Python Sandbox Bridge - Two isolated worlds connected by one agent

- The Bloomberg Formula Trick - Writing formulas for add-ins you don't control

- The Self-Verification Loop - How each agent checks its own work

- Memory, Simulation, and What's Broken - Features that ship before they work

- The DCF Test - Same prompt, three agents, three very different models

- What This Means for Agent Design - Five questions every agent builder must answer

The Three Architectures

The first surprise: the model matters less than the tools.

All three agents I tested use frontier models. Claude in Excel runs Claude, obviously the only one locked to a single model provider. Microsoft's Copilot Excel Agent routes between Claude and GPT. Shortcut AI uses a mix of Anthropic and OpenAI models with routing abstracted away. Smart move from Microsoft and Shortcut to stay model agnostic: the model layer is commoditizing fast, and the ability to swap providers without rearchitecting is a real advantage.

What's not equivalent is the tool architecture. And that's where the real differences live.

Let me show you what each agent actually has access to.

Claude in Excel: 14 tools (11 spreadsheet + 3 non-spreadsheet)

Claude has the most opinionated tool design. Each operation gets its own tool with a specific schema:

Each tool has a typed schema. `set_cell_range` takes a `cells` parameter: a 2D array where each cell object can contain `value`, `formula`, `note`, `cellStyles`, and `borderStyles`. Plus `allow_overwrite`, `explanation` (shown in the UI), `copyToRange` (for pattern expansion with formula translation), and `resizeHeight`/`resizeWidth`. The tool validates every parameter before executing. If something is wrong, it returns a structured error, not a JavaScript stack trace.

Microsoft's Copilot Excel Agent: 2 tools, raw power

Microsoft took a radically different approach. Two tools total.

Every spreadsheet operation funnels through one generic tool that generates and executes raw Office.js. Write a value? Generate Office.js. Create a chart? Generate Office.js. Format cells? Generate Office.js. The tool schema itself is remarkably minimal: a single `program` parameter of type string. No timeout, no error handling mode, no return format options. The code string is the entire interface.

The system prompt does impose structure: a progressive pattern of load initial state, apply changes, verify results, return confirmation. But within that template, the actual operations are freeform JavaScript.

This makes Copilot the most token efficient architecture of the three for simple tasks. A single tool call can pack an entire section of a financial model into one `Excel.run` block: headers, values, formulas, formatting, all in one shot. Fewer LLM round trips, lower latency. But it comes at a real cost.

The problem with running everything through raw code generation is threefold. First, the agent has to produce syntactically correct Office.js every single time. No typed parameters, no schema validation, no structured error messages. When something fails, you get a JavaScript stack trace instead of a clear explanation of which parameter was wrong. Second, there is no way to enforce safety at the tool level. Overwrite protection, input validation, range checking: all of that has to live in the generated code itself or in the system prompt, neither of which is reliable across millions of agent sessions. Third, debugging is harder. A 40-line Office.js script that fails on line 32 gives you far less signal than a structured tool call that rejects bad input before execution.

That said, the architecture is straightforward to improve. There is nothing fundamentally complex about adding structured tools, parameter validation, or tool-level safety on top of the existing Office.js execution layer. Microsoft owns both Excel and the Office.js API surface. That is a massive platform advantage, and closing the gap is engineering work, not a research problem.

Shortcut AI: 11 tools, one generic + rich helpers

Shortcut sits in the middle, but with an interesting twist. They also have one generic execution tool for spreadsheets, but they've built a rich helper API on top of it. Plus ten support tools for everything else.

The key architectural insight: where Claude has 13 separate tools for different spreadsheet operations, Shortcut has 1 generic `execute_code` tool with a rich TypeScript API layered on top. It's architecturally closer to Microsoft's raw Office.js approach, but with much better developer ergonomics.

Shortcut also has the best UX of the three. Plan mode breaks complex requests into structured steps before executing. Queries are queued, so you can fire off multiple requests without waiting. And follow-up interactions present structured options that let you steer without re-prompting. Small details individually, but in practice they make multi-step workflows noticeably smoother than the other two.

Here's what the same operation looks like across all three:

WRITING "Revenue" TO CELL A1

Claude:

Microsoft:

Shortcut:

Three ways to write "Revenue" to a cell. The difference isn't the result. It's what happens when something goes wrong, and how many tokens it costs to get there.

Microsoft's approach is the most token efficient for simple operations: one tool call can write an entire section of a financial model. Claude's structured tools add overhead per operation but provide validation and automatic verification at every step, which pays off in complex models where a single wrong cell reference cascades across 50 formulas. Shortcut sits in between: cleaner developer ergonomics than raw Office.js, but without Claude's tool-level safety. For a quick formatting task, Copilot wins on speed. For a 500-row DCF with cross-sheet references, Claude's architecture is better suited to catch errors before they compound.

How They See Your Spreadsheet

Most people assume Excel AI agents can see your spreadsheet. They can't.

When you open Claude in Excel and type "analyze this data," Claude doesn't receive a snapshot of your workbook. It receives a tiny metadata summary. Sheet names, dimensions, which sheet is active, what cell you have selected. That's it. No values. No formulas. No formatting.

WHAT CLAUDE ACTUALLY RECEIVES PER MESSAGE

This is lazy loading. Claude knows a sheet called "DCF" exists with 45 rows and 12 columns, but it has no idea what's in any of those cells until it explicitly calls `get_cell_ranges` or `get_range_as_csv`.

Microsoft's agent does the opposite. It gets eager loaded with a preview of actual values from the used ranges:

WHAT MICROSOFT'S AGENT RECEIVES PER MESSAGE

Shortcut uses lazy loading like Claude: sheet names and used ranges, no values upfront.

The tradeoff is real:

For a quick "what's in cell B5?" question, eager loading wins. The agent already has the value. For a complex financial model with 50 sheets and 100K rows, lazy loading wins. You don't want 100K cells of context when the agent only needs to read 200 of them.

This matters more than most people realize. Context isn't free. Every token costs money, adds latency, and past a certain point, actually makes the model dumber. I wrote about this in detail in my article on the LLM Context Tax. Lazy loading is a context management strategy disguised as a technical implementation detail.

The Overwrite Protection Spectrum

This is the most important design decision in the entire architecture, and the one that separates good agent design from bad.

The question is simple: what happens when an AI agent tries to write data into cells that already contain data?

All three agents answer this question differently. The differences reveal fundamentally different philosophies about where safety should live in an AI system.

OVERWRITE PROTECTION: THREE PHILOSOPHIES

Let me walk through Claude's protocol because it's the most sophisticated. When Claude calls `set_cell_range` to write to a range that already contains data:

Step 1: Claude calls `set_cell_range` with `allow_overwrite: false` (the default).

Step 2: The tool detects existing data. Instead of writing, it returns an error: "These cells contain data: A1='Revenue', A2=1500000, A3='=A2*1.1'. Cannot overwrite without permission."

Step 3: Claude reads the error, sees what's there, and presents it to the user: "This range already contains a revenue projection. The values are [shows values]. Do you want me to overwrite them?"

Step 4: User approves.

Step 5: Claude retries with `allow_overwrite: true`. Now the write succeeds.

The critical insight: the blocking is in the tool, the consent is in the prompt. When `allow_overwrite` is omitted, the API physically refuses the write. But the tool doesn't check whether user consent was actually obtained. If Claude passed `allow_overwrite: true` without asking, the write would succeed. There's also an exception path: if the user's original request uses explicit overwrite language ("replace", "overwrite", "change the existing value"), Claude can skip the consent flow. Still, the default-deny at the API level means accidental overwrites require an active decision, not an accidental omission. This is the difference between a guardrail and a suggestion.

Microsoft's agent currently has no overwrite protection. I asked it directly: "When you write to cells that already have data, what happens?" The answer: "I just overwrite. There's no blocking mechanism or confirmation dialog."

Shortcut's system prompt says "Do not overwrite existing data, especially existing user-filled data unless explicitly requested by the user." But the API itself will happily overwrite anything. The protection exists only in the model's compliance with a text instruction.

If you're building any AI agent that modifies user data, this is the lesson: behavioral safety fails. Models skip instructions. They hallucinate. They get confused in long conversations. The only reliable safety is structural safety, baked into the tool interface itself.

The Two-Tier Tool Hierarchy

Every Excel agent I tested has some version of the same pattern: a safe path for common operations and an escape hatch for everything else.

Claude's system prompt has an explicit hierarchy: always prefer structured tools, only escalate to `execute_office_js` when the structured tools can't do it. Conditional formatting, data validation, sorting, filtering, named ranges, those go through the escape hatch because there aren't structured tools for them yet.

Shortcut's version is subtler. They have one `execute_code` tool, but the helper API inside it creates a de facto two-tier system. `sheet.setCell()` and `sheet.addChart()` are the safe path: clean, validated, well-documented in the system prompt. Raw Office.js calls inside `execute_code` are the escape hatch. One subtle difference: raw Office.js errors include a `.code` property (e.g., `"ItemNotFound"`) that the helper API strips away, making the escape hatch slightly more debuggable for experienced developers but less user-friendly.

Microsoft operates entirely in escape hatch mode. There is no structured safe path yet. Every operation, from writing "hello" to cell A1 to building a 20-chart dashboard, goes through the same raw Office.js tool. The agent generates a full JavaScript program every single time.

This pattern isn't unique to Excel agents. It appears everywhere in agent design:

THE TWO-TIER PATTERN ACROSS AI AGENTS

The principle: constrained tools for common operations (with validation, safety checks, and structured responses). Raw power for edge cases (when the structured tools can't do it). The ratio matters. Too many structured tools and the agent gets slow, making many small calls. Too few and you lose the guardrails. Claude's 10 structured spreadsheet tools + 1 escape hatch is a deliberate bet that guardrails are worth the extra tool calls.

The Blind Agents Problem

I asked each agent: "Can you see what the spreadsheet looks like? The formatting, the colors, the chart layouts?"

Claude: "No. I work entirely from structured data representations. I cannot see colors, visual layouts, or the rendered appearance of charts."

Microsoft's agent: "No. I can't see images or compare visually."

Shortcut AI: "Yes."

Wait, what?

Shortcut has a `take_screenshot` tool that captures actual pixels from the workbook and sends them to a vision LLM (either gemini-3-flash-preview or claude-haiku-4-5-20251001, depending on routing). It can see formatting, colors, chart layouts, cell alignment, visual anomalies.

THE VISION GAP

Think about what blindness means in practice. Claude can tell you a cell has font color #0000FF (blue). It can't tell you the blue is nearly invisible against a dark background. It can create a chart with correct data. It can't tell you the chart is overlapping a table. It can apply formatting rules perfectly. It can't tell you the result looks wrong.

When I asked both Claude and Microsoft's agent what their #1 desired improvement would be, both gave the same answer: visual feedback. They know they're blind. Shortcut's agent said its #1 wish is even more vision, the ability to look at the workbook after every action, not just on demand.

This is going to be a defining capability for the next generation of agents. Not just Excel agents. Any agent that modifies visual output (websites, documents, presentations, dashboards) needs to see the result. Structured data tells you what's there. Vision tells you if it's right.

## The Python Sandbox Bridge

Both Claude and Shortcut have Python sandboxes. Microsoft doesn't. This matters because some operations are dramatically easier in Python than in Excel formulas. Statistical analysis. PDF parsing. Data transformation. Complex calculations.

But here's the architectural challenge: the Python sandbox and the Excel environment are completely isolated from each other. They're two separate worlds.

THE TWO WORLDS

Shortcut has a clean bridge mechanism using a `store` variable:

1. `execute_code` (TypeScript) reads data from Excel, sets `store.data = someArray`

2. `bash_command` (Python) reads it via `from_store: {"data": "input.json"}` which materializes the data as a file at `/workspace/input.json`

3. Python processes it, writes output via `to_store: "result"` (saves stdout to store)

4. `execute_code` reads `store.result` and writes it back to Excel

Claude does the same thing but less formally. The agent reads data using structured tools, passes it to the Python sandbox via `bash_code_execution`, gets the result, and writes it back using structured tools. Claude is the intermediary shuttling data between two worlds that can't talk to each other.

Microsoft has no sandbox yet. No Python, no pandas, no PDF parsing. If you ask it to process a PDF and put the data into Excel, it can't. If you ask it to run a statistical analysis more complex than what Excel formulas support, it can't. The sandbox is the single biggest capability gap between Microsoft and the other two.

An interesting detail: Shortcut's sandbox runs on GCS (Google Cloud Storage) infrastructure. The filesystem at `/workspace/`, `/skills/`, and `/memories/` is cloud-mounted, not local. This means no append operations. You can't use `>>` or `tee -a`. You have to read the entire file, modify it in memory, and overwrite. A small quirk that reveals the infrastructure underneath.

The Bloomberg Formula Trick

One of the cleverest patterns I found is how Claude handles financial data terminals. Bloomberg, FactSet, Capital IQ, and Refinitiv all have Excel add-ins that resolve proprietary formulas. Claude can't access these terminals directly. But it can write formulas that the user's own add-in will resolve.

The trick: Claude writes `=BDP("AAPL US Equity", "PX_LAST")` into a cell. If the user has Bloomberg Terminal installed, the Bloomberg add-in resolves the formula and fills in Apple's last price. Claude doesn't need Bloomberg access.

It just needs to know the formula syntax. Smart!

If the formula errors out (because the user doesn't have Bloomberg), Claude automatically falls back to web search. The source hierarchy is strict and explicit in the system prompt: only company IR pages, official press releases, SEC filings, and regulatory disclosures are approved. Yahoo Finance, Seeking Alpha, Macrotrends, Wikipedia, and "any website that is not the company itself or a regulatory filing system" are rejected by default. Claude's system prompt also contains detailed formula documentation for Bloomberg (`=BDP`, `=BDH`, `=BDS`), FactSet (`=FDS`, `=FDSH`), Capital IQ (`=CIQ`, `=CIQH`), and Refinitiv (`=TR`), including field codes, usage limits, and syntax rules.

Shortcut supports 18+ financial data terminal syntaxes, each stored in dedicated skill files at `/skills/default/integrations/`: Bloomberg, FactSet, Capital IQ, Refinitiv, Morningstar, PitchBook, Preqin, AlphaSense/Tegus, Nasdaq Data Link, QuickFS, Calcbench, Visible Alpha, Moody's, ICE, and more. Microsoft's agent can also write Bloomberg formulas through raw Office.js, but without any of Claude's formula documentation or Shortcut's dedicated skill files. It can write `=BDP(...)` as a string to a cell, but it can't verify the add-in is installed, can't authenticate, and has no structured knowledge of terminal-specific syntax or field codes.

This pattern is more broadly interesting than it seems. It's an agent operating in an environment with other tools it can't directly control, writing instructions (formulas) that another system (Bloomberg) will execute. The agent is essentially programming another agent through the shared medium of the spreadsheet. We're going to see a lot more of this as agents start operating in environments populated by other agents.

The Self-Verification Loop

Financial models can't be wrong. A portfolio manager at a $2B fund who gets a wrong number from your AI agent doesn't come back. This makes verification architecture critical.

Each agent handles verification differently, and the differences reveal a lot about design maturity.

Claude: Verification baked into the tool response

When Claude writes formulas using `set_cell_range`, the tool automatically reads back the computed values and returns them as `formula_results`. Claude doesn't have to do anything extra. The verification data is right there in the response.

CLAUDE'S VERIFICATION (ZERO EXTRA COST)

The system prompt instructs Claude to check `formula_results` for errors like `#VALUE!`, `#REF!`, `#NAME?`, `#DIV/0!`, and unexpected values. This happens in the same round-trip as the write. No extra tool call. No extra latency. No extra tokens.

Shortcut: Manual verification cycle

Shortcut has to do it explicitly: write values, call `workbook.calculate()` to force Excel to recompute, then read back the values with a separate tool call to verify.

SHORTCUT'S VERIFICATION (MANUAL BUT BATCHABLE)

One to two tool calls, not three. The write, calculate, and read can share a single `execute_code` block since Shortcut's generic execution tool allows arbitrary code composition. Still manual and still requires the agent to choose to verify, but more efficient than separate tool calls.

Microsoft: Verify if you remember to!

Microsoft's agent can verify within the same tool call. The system prompt instructs a progressive pattern: load initial state, apply changes, verify results, return confirmation, all inside one `Excel.run` block. A write followed by `range.load('values'); await context.sync();` gives you the computed results without a separate tool call.

In practice, though, verification is optional and behavioral. The agent admitted it doesn't always verify: simple operations often skip the read-back step, and there's an implicit efficiency pressure since longer code means more potential for bugs in the code itself. With aggressive batching, verification often gets deprioritized in favor of speed.

VERIFICATION APPROACHES COMPARED

Claude's approach is the most elegant because it makes the right thing the easy thing. The agent doesn't have to remember to verify. Verification happens automatically. This is the same principle as the overwrite protection: if a safety-critical behavior depends on the model choosing to do it, eventually it won't.

Memory, Simulation, and What's Broken

Shortcut AI is the most ambitious of the three agents. It ships features that Claude and Microsoft don't have. Some of them work. Some of them don't.

Persistent Memory: Real, but primitive

Both Claude and Microsoft's agent have zero memory across sessions. Close the chat, everything is gone.

Shortcut has persistent memory. It's real. I verified it. But it's also the simplest possible implementation: writing to files.

SHORTCUT'S MEMORY ARCHITECTURE

/memories/

├── MEMORY.md Curated preferences and context

├── 2026-02-21.md Daily session notes

└── traces/

└── {workbook_id}/

└── {date}/

└── {thread_id}.json Raw conversation trace

Three layers. `MEMORY.md` is curated, things the agent decides are worth remembering. Daily notes are automatic. Traces are raw JSON dumps of every conversation.

The system prompt instruction is beautifully pragmatic: "Your context is limited. If you want to remember something, WRITE IT TO A FILE." No fancy vector database. No embeddings. No semantic search. Just files. It works because the agent can read them back, and it's honest about the limitation it's solving.

Simulation: Real, and impressive

Shortcut can create a copy of your workbook on a headless server and run operations against it without touching your real data. The tool is `prepare_headless_session`. It creates a snapshot, gives a subagent read/write access to the copy, and the subagent can experiment freely.

I verified this with a live demo. The simulation agent wrote data to cell C1 on the copy. The real workbook was unaffected. This is genuinely useful for complex operations where you want to test before committing.

Neither Claude nor Microsoft has anything like this so kudos to Shortcut!

Checkpoint/Restore: Ships, but broken

Shortcut claims checkpoint/restore capability, the ability to save a workbook state and roll back if something goes wrong. The API exists: `general.restoreCheckpoint(id)`. But during live testing, the agent admitted that checkpoint IDs are not surfaced back after creation. You can call the API, but you never get the ID you'd need to actually restore anything.

This is a feature that exists in the code but doesn't work in practice. I appreciate the honesty. Most products would have listed it as a feature on the marketing page without mentioning it's broken.

SEC EDGAR Pipeline: Surprisingly real

I was skeptical when Shortcut claimed "SEC EDGAR integration." Most agents just do a web search and scrape whatever comes up. But Shortcut has a dedicated pipeline:

SHORTCUT'S SEC EDGAR PIPELINE

Pre-built skill file at `/skills/default/sec-edgar/` with a dedicated Python conversion script. Playwright with real Chrome (not Chromium, SEC blocks headless Chromium with 403s) for HTML-to-PDF conversion. Why not parse the HTML directly? SEC filings contain inline XBRL (HTML with embedded XML tags) that creates noise for LLMs. Around 34% of filings have XBRL tagging errors, and the narrative content (MD&A, Risk Factors) isn't captured in XBRL at all. Chrome rendering resolves all of this into a clean visual representation, and the document reader subagent extracts from that. Pragmatic and clever.

Neither Claude nor Microsoft has anything comparable. Claude falls back to web search for SEC filings. Microsoft can't process PDFs at all.

But the SEC pipeline is just one example of something bigger.

Shortcut has a Skills library: a marketplace of community-created and default skills that encode domain expertise as reusable agent workflows. The default skills include SEC EDGAR, financial terminal integrations, PDF extraction, and PowerPoint creation. The community skills are where it gets interesting: `financial-modeling-skill` (254 installs), `formatting-refinement` (157 installs), `deck-builder` (120 installs), `lbo-best-practices` (99 installs), `model-grader` (97 installs), `excel-model-audit` (86 installs).

There are skills for amortization schedules under ASC 805, purchase price allocation, bank statement ETL, Chapter 11 bankruptcy time entries, and nonprofit NTEE code lookups.

Users can also upload their own templates which I like!

This is the most concrete example I found of what "skills as moat" actually looks like. Each skill is a structured workflow with domain-specific knowledge that makes the agent better at a narrow task than a general-purpose agent could ever be. And the install counts show compounding adoption: the more users install a skill, the more validated it becomes, the more new users trust it.

The DCF Test: Same Prompt, Three Very Different Models

I gave each agent the same prompt on a blank workbook: "Create a detailed 10-year DCF valuation model for Apple (AAPL). I need professional-grade output with assumptions, revenue build-up, free cash flow projections, terminal value, and an implied share price." Same starting point. No additional guidance.

Shortcut: The analyst who asks before building

Shortcut didn't start building. It asked three questions. Sheet layout: single sheet or multi-tab? Revenue granularity: segment-level or top-line only? Terminal value: perpetuity growth, exit multiple, or both? Each question came with a recommendation and a rationale.

The second question was the most telling. Shortcut recommended segment-level revenue (iPhone, Mac, iPad, Wearables, Services) over a blended top-line because "Services is growing 2-3x hardware and carries ~70% gross margins vs ~36% for products. Modeling segments separately gives you a much more credible margin trajectory." That's not a generic suggestion. That's a real modeling insight that changes the output.

After the Q&A, plan mode kicked in. A structured todo list appeared with 7 items. The agent researched Apple's financials, then built section by section: assumptions, segment revenue with individual growth rates per segment per year, income statement with product vs. services COGS split, FCF, terminal value, sensitivity table. It used FY2025 actuals as the base year ($416B revenue). After building, it took screenshots of the finished model and ran them through a vision LLM to verify the formatting looked correct. It saved my preferences to memory for next time.

I opened the file and audited every formula. Zero errors. Every single formula independently verified correct. The sensitivity table is self-contained: each cell recalculates the entire DCF at its specific WACC and terminal growth rate, with proper discounting of both the FCF stream and terminal value. Standard financial border conventions throughout (thin top + double bottom for totals). The only issues are minor: no freeze panes, and interest expense is hardcoded at 0.7% of revenue rather than broken out as an explicit input.

The implied share price: $187 vs. $263 current. The segment build is the reason this number is more defensible than the other two. Services revenue growing faster at higher margins drives a margin expansion story that a blended top-line assumption completely misses.

Claude: The methodical auditor

Claude also asked before building: seven clarifying questions covering base year data, growth rates, WACC, terminal growth, margin assumptions, buyback modeling, and sensitivity tables. Each came with a sensible default. I said "go" and let it run with defaults.

What followed was a long sequence of individual operations. Six web searches to nail down Apple's exact financials from official filings. Multiple calls to create sheets and get sheet IDs. Then section by section: title, assumptions header, assumptions values, projection headers, year headers, growth rates, revenue row, gross margin, COGS, operating expenses, EBIT, FCF bridge, discount factors, terminal value, equity bridge, sensitivity table. Each section was a separate `set_cell_range` call with schema validation.

The sensitivity table was where the architecture showed its seams. Claude's structured tools write to specific ranges. Building a 7x7 sensitivity matrix where each cell contains a formula that recalculates the entire DCF at different WACC and terminal growth combinations required complex cross-sheet formulas. The first attempt failed. Claude had to clear the range and rebuild with a different formula approach using Python to generate the correct formulas. This is the kind of complexity that structured tools aren't optimized for: the tool validates each cell, but it can't help the agent reason about the formula logic itself.

At the end, Claude read back every key output to verify: Revenue $391B, EBIT $123B, Year 10 UFCF $166B, PV of FCFs $846B, PV of Terminal Value $1,022B, Enterprise Value $1,868B, Implied Price $118.43. Every number checked against expectations. The auto-verification from `formula_results` caught errors along the way, but the final read-back was the agent choosing to do a full audit pass, structural safety plus behavioral thoroughness.

I audited the file. All formulas correct. The sensitivity table uses elegant SUMPRODUCT array formulas across two sheets, all 49 cells verified. One issue: Claude defined an "Annual Share Buyback Rate = 2.5%" as a blue input cell but never referenced it in any formula. Shares stay flat at 15,408mm for 10 years. For a company that buys back $90B+ annually, this meaningfully understates per-share value. The input exists, the wiring doesn't. This is the kind of bug that structured tools can't catch: the tool validated the cell write perfectly, but the agent forgot to use the value it created.

Implied share price: $118 vs. $265 current. Used FY2024 as base year and blended top-line growth tapering from 5% to 3%. No segment build means no Services mix shift, which explains why the valuation is lower than Shortcut's.

Copilot: The fast builder who doesn't ask

Copilot didn't ask a single question. It went straight to building. No clarification on assumptions, no layout preferences, no methodology choice. It decided for you: top-line revenue, single sheet, perpetuity growth terminal value.

The build was fast. A handful of massive Office.js scripts, each one constructing an entire section of the model: assumptions and WACC calculation in one block, revenue projections in another, FCF build in another, terminal value and equity bridge in another. Each `Excel.run` block was a complete program that created headers, wrote values, applied formulas, and set formatting in a single transaction.

Copilot also added something the other two didn't: editorial commentary. After the model, it explained why Apple trades above DCF fair value, citing the $110B buyback program, 15%+ Services growth, ecosystem lock-in, and AI optionality. Useful context, but not what was asked for. It's a reminder that with no structured tools constraining the output, the agent decides what "professional-grade" means.

Then I opened the file and audited every formula. This is where the architecture differences become tangible.

The sensitivity table has a structural flaw. Each cell references `$B$77`, which is the sum of PV of FCFs computed at the *base* WACC of 9.04%. When the table varies WACC across rows, it only re-discounts the terminal value. The 10-year FCF stream stays frozen at the base WACC. This is a conceptual error: the table understates how sensitive the implied price is to WACC changes. A correct sensitivity table (like Shortcut's and Claude's) recalculates all 10 discount factors at each WACC. This would not pass a professional audit.

The methodology notes at the bottom of the model contradict the actual inputs. The notes say revenue growth is 5%/4%/3%, the model uses 6%/5%/3.5%. The notes say terminal growth is 2.5%, the model uses 3.0%. The notes say operating margin is 31.5%, the model uses 32%. Three mismatches. The model was likely adjusted after the notes were written, and without a verification pass, the inconsistency shipped.

There's also a formula error in the FCF growth calculation (L58 divides by the current year instead of the prior year), the Bear/Bull scenario prices are hardcoded text strings rather than computed from the scenario assumptions above them, and the "Projection Period = 10" cell is formatted as a percentage, displaying as "1000.00%".

None of these are catastrophic individually. But collectively they paint a picture. Without auto-verification, without a Python sandbox to cross-check calculations, without vision to catch the formatting bug, errors accumulate. Copilot generates large Office.js programs that write entire sections at once. When a formula is wrong inside a 40-line script, nothing catches it. The tool executed successfully. The code ran. The values are just wrong.

This isn't a fundamental limitation. Microsoft owns Excel and the Office.js API. Adding structured tools with parameter validation, auto-verification that returns computed values after writes, a Python sandbox for cross-checking, these are engineering investments, not research problems. The gap between Copilot's current output quality and what the other two produce will be closed fast imo.

Implied share price: $123 vs. $265 current. Similar to Claude's result because it also used blended top-line growth without a segment build.

What the test reveals

The three implied prices ($187, $118, $123) aren't right or wrong. They're outputs of different modeling choices. And the most important modeling choice, segment-level vs. top-line revenue, was only surfaced by the agent that asked before building.

THE DCF TEST: HOW EACH AGENT WORKS

The segment build matters, but the file audit matters more. A finance professional reviewing these three outputs would notice the segment granularity first. Then they'd start checking formulas. Shortcut's model passes a cell-by-cell audit. Claude's model passes with one oversight (the unused buyback input). Copilot's model has a broken sensitivity table, mismatched notes, a formula error, and decorative features that don't compute.

The quality gap maps directly to the architecture. Shortcut's plan mode caught the modeling decision upfront. Its vision verified the output. Its structured todo list kept the build organized. Claude's auto-verification caught formula errors in real time via `formula_results`, and the agent chose to do a full audit read-back at the end. Copilot generated large Office.js scripts with no verification layer. When a formula inside a 40-line script is wrong, nothing in the architecture catches it. The tool executed successfully. The code ran without errors. The output is just incorrect.

Every architectural difference discussed in this article showed up in this one exercise. The agent with structured tools caught formula errors. The agent with vision caught formatting issues. The agent with memory will remember that I prefer segment-level builds next time. The agent without those capabilities was fastest but produced the least reliable output.

What This Means for Agent Design

These aren't just Excel agents. They're three different answers to five universal questions that every AI agent must answer:

THE FIVE QUESTIONS EVERY AGENT MUST ANSWER

These questions apply whether you're building an Excel agent, a coding agent, a customer support agent, or a research agent. The answers define your agent's personality: cautious or fast, granular or batched, safe or powerful.

I think about this differently because of what we build at Fintool. We output a lot of Excel today. But our bet is that the future is headless: the end user won't spend much time inside Excel or PowerPoint. These are delivery formats, not the workspace. That changes which tradeoffs matter. If the spreadsheet is the final product where users live and edit, tool-enforced safety and visual verification are critical. If the spreadsheet is an output artifact from a larger agentic workflow, token efficiency and raw speed matter more.

Here's my take after digging into these architectures:

Claude's tool-enforced safety is the most mature design. Overwrite protection that lives in the tool, not the prompt. Verification that happens automatically, not optionally. This is the approach that scales. When you're running millions of agent sessions, you can't rely on the model choosing to be careful every time. The architecture has to enforce it.

Shortcut's ambition points to where all agents are headed. Vision, persistent memory, simulation, multi-agent orchestration. These aren't gimmicks. An agent that can see its own output, remember your preferences, and test changes before committing them is fundamentally more capable than one that can't. Some of these features are half-baked today (checkpoint is broken, memory is file dumps). They won't be for long.

Microsoft's simplicity is a starting point, not a ceiling. Two tools, raw Office.js, the fastest path from prompt to output and the lowest token cost. The gap between where Copilot is today and where Claude's tooling is can be closed with engineering execution. Microsoft owns the platform and the API surface. If they invest in structured tools, tool-level safety, and a Python sandbox, they can get there fast. That's why I think they will win big.

The future is probably Claude's safety architecture with Shortcut's feature set. Tool-enforced guardrails plus vision plus memory plus simulation. We're not there yet. But the Excel agents I tested are closer to this future than most people realize, and the underlying engineering is more accessible than it looks.

So back to the question I started with: tech edge or distribution play?

Both, but not where you'd expect. The model is a commodity. All three agents run frontier models and the performance gap between them is negligible. The tool architecture, the schemas, validation layers, verification loops, overwrite protection, is impressive engineering and a moat today because most teams haven't built it yet. But it won't be for long. An elite team can build Claude's 14-tool architecture with auto-verification in a few months. The patterns are well understood. None of this is secret.

The real moat is everything above the harness. Skills (like Shortcut's marketplace of community-created workflows: LBO modeling, DCF best practices, model auditing, ASC 805 purchase price allocation, each encoding domain expertise that took years to develop). Persistent memory that learns your preferences over hundreds of sessions. User data that compounds: your formatting conventions, your model templates, your formula patterns. The agents that accumulate this context will be the ones users can't leave. The defensible layer is the one that gets better the more you use it.

By
NBNicolas Bustamante