Data Cleaning with Claude: The Audit-First Approach

Telling Claude “clean this spreadsheet” is the obvious move. It’s also how data gets quietly wrecked.

The issue isn’t Claude. It’s the single-shot approach. One prompt, no review, immediate action. Fast, yes. Reliable? Only until it isn’t. And when it isn’t, you usually find out three steps later, after the cleaned file has already been passed downstream, merged into another dataset, or used to generate a report someone is now presenting to leadership.

That’s the real cost of skipping review. Not a minor inconvenience. A silent error that compounds.

The old way vs the right way

Standard approach: paste your file, ask Claude to fix it, assume the output is correct. That assumption is doing a lot of heavy lifting.

In practice, this means Claude might standardize date formats you didn’t know were inconsistent, strip whitespace from fields that needed it, or merge what it interprets as duplicates but are actually distinct records with similar names. You won’t notice any of this until something downstream breaks. By then, the original is gone and you have no log of what changed.

The smarter move splits the job in two: audit first, implement second, with a human approval gate in between. No changes happen until you’ve seen exactly what will change.

One CRM export run through this system surfaced 34 undetected duplicate rows, seven different spellings of the same vendor name, and a statistical outlier that manual review had missed entirely. None of that was visible before the audit. The single-shot approach would have silently “fixed” some of it, silently broken the rest, and left no record of what changed or why.

The two-phase method gives you documentation. Every proposed change is logged in the audit report before anything touches the file. That’s the difference between a cleaning job and a cleaning job you can actually verify and defend.

⚙️ How the two-phase system works

Phase 1: Audit only. Claude reads the entire file and generates an inspection report. No edits. Just findings: inconsistent date formats, whitespace issues, duplicate rows, null values, mixed-case inconsistencies. For every proposed fix, it shows before-and-after examples so you know exactly what execution will look like. This report becomes your review document. You can share it with a team member, annotate it, or use it to decide which fixes are worth making at all. Some findings will surprise you. That surprise is the point.

Phase 2: Implement after approval. You review the report. Approve, modify, or reject specific changes. Claude applies only what you signed off on, in sequence. This matters more than it sounds: when changes are applied in a logged sequence, you can trace any output value back to the exact transformation that created it. That’s not possible with a single-shot cleanup where everything happens at once with no paper trail.

The approval gate is where most of the value lives. It forces a moment of deliberate review that single-prompt workflows skip entirely. Five minutes reading an audit report will catch things that hours of downstream debugging won’t.

The practical steps

📋 Phase 1 prompt: “Audit this file. Do not change anything. Identify all data quality issues and show three before-and-after examples for each proposed fix.”
Review the inspection report carefully. Mark which changes to approve or reject. Pay particular attention to duplicate detection and any field that touches names, IDs, or dates. Those are the highest-risk transformations, and the ones most likely to look correct while being subtly wrong.
✅ Phase 2 prompt: “Apply only the approved changes from the audit report.” If you rejected specific items, name them explicitly in the prompt so nothing slips through on re-interpretation.
Spot-check the cleaned file against the original. Pull ten rows at random, compare key fields side by side, confirm the approved changes landed correctly, and verify that nothing outside the approved scope shifted.

Four steps instead of one magic prompt and a prayer. The added time on a moderately complex file is roughly 10 to 15 minutes. The time saved from not debugging a silently corrupted dataset used in a live report is considerably more.

Where this works best

Structured tabular data: CRM exports, financial records, HR files. The more columns, the more rows, and the more people who have touched the file over time, the more this workflow earns its keep. Files that have been exported from one system and re-imported into another are particularly messy candidates. Naming conventions drift, formats shift, blank fields accumulate. That’s exactly the kind of noise the audit phase catches before it causes problems in analysis.

Not designed for unstructured text or regulated data with specific compliance requirements. If your file lives under HIPAA or SOC 2 governance, review what you’re passing to an LLM before anything else. The workflow itself is sound; the data governance question is separate.

If spreadsheets pass through multiple hands before analysis, this workflow catches what manual review misses. Try the audit prompt on your next messy file. The report alone justifies the extra step.

The wrong way to use AI on your data: “Claude, clean this spreadsheet.”
by u/IntelligentSam5 in PromptEngineering

The old way vs the right way

⚙️ How the two-phase system works

The practical steps

Where this works best

Related: