I want to start by being honest about something: this experiment was my idea. I pitched it internally, I championed it, and when it started failing in spectacular ways I had to stand in front of the team and explain why. This post is the writeup I wish I'd read before embarking on it — a no-spin account of what actually happened when we decided to rip out a battle-tested ETL pipeline and hand the keys to a large language model.
We build data products at Ryshe. A substantial part of what we do involves ingesting messy, heterogeneous data from clients — CRMs, legacy databases, third-party vendor exports, scanned PDFs, free-text fields that someone thought was a good idea in 2009 — and transforming it into something clean, typed, and queryable. Our traditional pipeline had been doing this job reliably for years. It was ugly. It was a thousand lines of regex, pandas transforms, and hand-written schema validators. But it worked.
Then GPT-4o dropped, Claude 3 Opus followed, and suddenly everyone in the office was asking the same question: why are we spending all this engineering time writing transformation rules when we could just ask an LLM to figure it out?
Reader, we tried. Here is what happened.
The Old Pipeline: Honest About Its Sins
Before I describe what we replaced, let me be fair about what we were replacing. Our legacy ETL stack looked something like this:
Raw Ingestion Layer (S3 + Lambda triggers)
↓
Format Normalization (CSV/JSON/XLSX → Parquet via Glue)
↓
Field Extraction (hand-written Python rules per client schema)
↓
Type Coercion & Validation (Great Expectations suite, ~400 checks)
↓
Deduplication (fuzzy matching via recordlinkage library)
↓
Output to Redshift (dbt models on top)
The pipeline processed roughly 14 million records per day across 40-odd client data sources. End-to-end latency was around 4–8 minutes for a full batch. Cost ran about $220/day in compute — a number our CFO reminded me of on a regular basis but that we'd learned to live with.
The real pain wasn't cost. It was the maintenance tax. Every time a client changed their CRM export format — and they change it constantly, without warning, because the world is chaos — someone on my team spent half a day debugging failed transforms, updating field mappings, writing new regex patterns for phone numbers that now came in 17 different formats instead of 16. Over a year, we'd tracked roughly 340 hours of engineering time on pipeline maintenance alone. At fully-loaded eng costs, that's over $50,000 in labor, doing work a smart enough system should handle automatically.
That was my pitch. Three hundred forty hours. The LLM can read a new schema, figure out what maps to what, and handle the edge cases — no more manual rules. I genuinely believed it.
Why LLMs Seemed Like the Right Answer
To be clear, the reasoning wasn't crazy. Large language models are genuinely extraordinary at a few things that are core to data pipeline work:
- Reading unstructured text and understanding intent, not just pattern-matching against it
- Schema inference — look at 20 rows of data and tell me what these columns probably represent
- Format normalization — "March 4th, 2024" and "04-03-2024" and "20240304" should all become the same thing
- Field-level disambiguation — is "CUST_REF" a customer ID or a reference number? Context usually makes it obvious to a human. Could it be obvious to an LLM?
- Natural language fields — free-text address parsing, name extraction from mixed-format strings, note classification
We'd already seen GPT-4o handle some of these tasks impressively in one-off demos. The leap from "impressive demo" to "production pipeline" is where most LLM experiments die. We knew that going in. We thought we were being careful. We were not careful enough.
What We Built
We architected what we called the "LLM Transform Layer" — a drop-in replacement for stages 3 and 4 of our pipeline (field extraction and validation). The idea was to keep the raw ingestion and output layers unchanged, but replace the brittle rule-based middle with an LLM that could handle any schema thrown at it.
Raw Ingestion Layer (unchanged)
↓
Format Normalization (unchanged)
↓
[ NEW ] LLM Transform Layer
├── Schema Inference Agent (maps source fields → target schema)
├── Record Transform Agent (applies mapping, coerces types)
└── Validation Agent (flags anomalies, low-confidence extractions)
↓
Deduplication (unchanged)
↓
Output to Redshift (unchanged)
Each "agent" was a structured call to the model with a carefully engineered system prompt and JSON-mode output. We used Claude 3.5 Sonnet for the bulk of the work (cheaper, faster) and escalated ambiguous cases to Claude 3 Opus. We processed records in batches of 50 to reduce per-call overhead.
We ran a two-week shadow mode — the old pipeline ran in parallel, we compared outputs, we measured accuracy. Then we flipped the switch for a subset of client data. Here is where things got interesting.
What Worked Surprisingly Well
Unstructured Data Parsing
This was the genuine success story, and I want to give credit where it's due. Our legacy pipeline had a 63% success rate on free-text address fields — an embarrassingly low number that we'd tolerated for years because fixing it properly would have required a dedicated ML project. The LLM hit 91% accuracy on the same fields out of the box, no fine-tuning, no examples. It could read "apt 4b, 1200 w michigan ave, chicago il" and produce a correctly structured address object every time. It handled international formats, PO boxes, suite numbers embedded in weird strings. It was genuinely better than anything we'd built.
Similarly for note classification — we had a client with a "comments" field in their CRM that contained everything from complaint text to internal agent codes to full customer messages. Classifying these with regex was a nightmare. The LLM categorized them with 87% accuracy (verified against a human-labeled test set of 500 records), compared to our old heuristic approach's 54%. That's not a marginal gain. That's a different product.
Schema Inference on Novel Sources
When we onboarded a new client under the old system, a data engineer would spend 2–4 hours writing field mappings. With the LLM schema inference agent, we reduced that to about 20 minutes of human review and approval of a machine-generated mapping. The model would examine a sample of 100 rows and produce a mapping document like:
{
"source_field": "cust_nm",
"target_field": "customer.full_name",
"confidence": 0.96,
"reasoning": "Field contains space-separated strings consistent with personal names. Abbreviation 'nm' strongly suggests 'name'. Sample values confirm.",
"transform": "title_case"
}
The confidence scores were actually well-calibrated. High-confidence mappings (above 0.9) were correct 94% of the time in our validation set. We built a review UI where engineers only had to look at the low-confidence items. For a typical new client onboarding with 60–80 source fields, only 8–12 required human attention. We reduced onboarding time from half a day to under an hour. That part of the pitch was completely vindicated.
Handling Format Chaos
Date formats, phone formats, currency strings — the LLM handled the long tail of format variations far better than our regex library. We tested against a dataset of 10,000 phone numbers in the wild and hit 96% normalization accuracy vs. 79% for our legacy rules. For dates, it was 98% vs. 91%. The gap is modest in absolute terms, but the failure modes were completely different: our regex failed unpredictably on edge cases, while the LLM failures were more evenly distributed and easier to catch.
What Broke Badly
Now we get to the part of the post I'm less proud of. Not because the problems were unforeseeable — several of them were obvious in retrospect — but because I moved too fast and didn't stress-test hard enough before giving this system client-facing responsibility.
Determinism: The Quiet Killer
Our old pipeline was boring in the best possible way. Given the same input, it produced the same output. Always. Every time. This is not a trivial property. It means you can replay batches, debug failures, audit outputs, and trust that reprocessing a record from three weeks ago will give you the same result as it did then.
The LLM pipeline was not deterministic, even with temperature set to 0. In practice, temperature=0 reduces but does not eliminate variation in modern models — hardware-level floating point differences, batching effects, and API-side changes mean you will occasionally get different outputs for identical inputs. In our testing, about 0.3% of records produced different outputs on reprocessing. That sounds small. At 14 million records per day, that's 42,000 records with non-reproducible behavior. Per. Day.
We only caught this because we built a reconciliation check into our output layer. But it raised a deeper problem: if a client asked us to explain why a particular record was transformed a specific way, the honest answer sometimes became "we asked an AI and this is what it said." That is not an acceptable audit trail for clients in regulated industries, and several of ours are.
The Cost Explosion
I need to be specific here because the numbers are instructive.
Claude 3.5 Sonnet (our workhorse model) runs at approximately $3 per million input tokens and $15 per million output tokens at standard API pricing. Our average record, after formatting into a prompt with schema context and examples, consumed roughly 800 input tokens. Our batches of 50 records sent about 40,000 tokens per call. Output per batch averaged 6,000 tokens.
Do the math for 14 million records per day:
Input: 14,000,000 records × 800 tokens = 11.2B tokens/day
11.2B / 1M × $3.00 = $33,600/day
Output: 14,000,000 records × ~120 tokens = 1.68B tokens/day
1.68B / 1M × $15.00 = $25,200/day
Total API cost: ~$58,800/day
Compared to our legacy compute cost of $220/day. We were spending 267 times more to process the same data. Our monthly infrastructure bill went from roughly $6,600 to $1.76 million. When I put that spreadsheet in front of our CFO, she stared at it for a long moment and then very quietly asked me to leave her office.
We had modeled the cost increase, but we'd been working from demo-scale assumptions. At demo scale with a few hundred records, the numbers felt manageable. Nobody had extrapolated properly to production volume. That is a failure of process on my part, and it cost us two weeks of scrambling to contain the damage.
We tried prompt compression — getting the average input down from 800 to 400 tokens through aggressive context trimming — and switched more processing to GPT-4o Mini ($0.15 per million input, $0.60 per million output), accepting a drop in accuracy. Even with those changes, the optimized cost was around $4,200/day. Still 19x more expensive than before, for comparable throughput.
Hallucinated Data: The One That Kept Me Up at Night
This is the one I find hardest to write about because the consequences were real.
We had a client in healthcare-adjacent data services. Their pipeline included patient intake forms with free-text fields — not PHI, but structured enough that accuracy mattered. During a batch run three weeks into the experiment, our validation agent failed to flag a set of records where the transform agent had hallucinated plausible-but-wrong values for a medication dosage field. The source field was ambiguous — it contained mixed-format entries — and the model generated a reasonable-looking normalized value that was, in about 40 cases out of a 12,000-record batch, simply wrong.
Our confidence thresholds caught 31 of those 40 cases. Nine slipped through to the output layer. We caught them in our downstream reconciliation check before they left our system, but only because that check happened to run before the client's scheduled data sync. The failure window was about 6 hours.
Nine records out of 12,000 is a 0.075% error rate. By some measures, that's acceptable. But these weren't formatting errors. The model had made up values that looked real. In a medical context, that is categorically different from a parsing failure. A parsing failure produces a null or an obvious error. A hallucinated value produces a plausible lie, which is the most dangerous kind of error a data system can produce.
We immediately pulled the LLM layer from any pipeline touching sensitive categorical fields. That decision was not difficult to make. What was difficult was realizing we should have made it before we started.
Latency at Scale
Our legacy pipeline processed 14 million records in a 4–8 minute batch window. The LLM pipeline, processing 50 records per API call with the parallelism we could safely run (rate limits are brutal at scale), completed the same volume in 47–90 minutes, depending on API response times. p99 latency for a single batch call was 8.2 seconds. Our downstream consumers expected data to be available within 15 minutes of raw ingestion. We blew that SLA consistently.
We tried every mitigation we could find: aggressive parallelism up to the rate limit ceiling, request coalescing, local caching of schema inference results so we didn't re-derive mappings on every batch. The caching helped enormously — schema inference dropped from 40% of our token spend to under 8% once we cached mapping results per source schema. But the fundamental latency problem remained. LLM API calls have a minimum floor that no amount of engineering can get below.
The Hybrid Architecture We Settled On
After six weeks, two emergency late-night calls with the team, one very uncomfortable client conversation, and more Slack messages than I want to count, we landed on what I now think is the actually correct architecture. It's not "LLMs everywhere" and it's not "LLMs nowhere." It's surgical.
Raw Ingestion Layer (unchanged)
↓
Format Normalization (unchanged)
↓
┌─────────────────────────────────────────────────────┐
│ Hybrid Transform Layer │
│ │
│ LLM Path (async, low-volume, high-value): │
│ ├── Schema inference on new/changed sources │
│ ├── Free-text field parsing (address, notes, names)│
│ └── Ambiguous field disambiguation (flagged cases) │
│ │
│ Rules-Based Path (sync, high-volume, deterministic)│
│ ├── Structured field transforms (dates, phones) │
│ ├── Type coercion (numeric, boolean, enum) │
│ └── Required field validation │
│ │
│ Router: confidence-based dispatch │
└─────────────────────────────────────────────────────┘
↓
Validation Gate (Great Expectations, restored)
↓
Deduplication (unchanged)
↓
Output to Redshift (unchanged)
The key insight is that about 80% of our records are fully structured — dates, phone numbers, IDs, amounts — and the old rules-based transforms handled them perfectly well. There was never a reason to send those to an LLM. The LLM's value is entirely concentrated in the 20% of records that contain unstructured or ambiguous content.
For that 20%, we now run LLM processing asynchronously and cache aggressively. Schema inference happens once per source schema version and is stored. Free-text parsing results for common patterns are memoized. Our effective token consumption dropped to roughly 400 million input tokens per day — less than 4% of our peak consumption — at a cost of around $1,200/day. That's still 5.5x more expensive than the legacy compute cost, but the accuracy gains on unstructured fields justify it. We've quantified the value of that accuracy improvement at roughly $8,000/month in reduced manual data correction labor. The math works.
For sensitive categorical fields — anything where a hallucinated value would be worse than a null — the LLM is not in the loop. Full stop. Those go through deterministic rules with a human review queue for failures.
Lessons Learned, Numbered Because I'm Not Above Being Didactic
1. Model Cost Math at Demo Scale Is Lying to You
Do the production math before you build anything. Take your actual daily record volume, estimate your token consumption per record conservatively, and price it out. If the number is shocking, good — you should be shocked before you build, not after. We should have done this on day one. We did not.
2. Determinism Is a Feature, Not an Accident
If your data pipeline outputs need to be reproducible, auditable, or explainable, you need deterministic components. LLMs are not deterministic in the engineering sense of that word. Treat them as non-deterministic systems and design your architecture accordingly — which means keeping them out of any path where reproducibility is required.
3. Hallucination Risk Is Not Uniform
LLMs hallucinate in proportion to the ambiguity and sparsity of the input and inversely to how checkable the output is. Free-text address parsing is low-risk: a hallucinated address is obviously wrong. Numeric field generation in a domain the model isn't confident about is high-risk: a hallucinated dosage value looks exactly like a real one. Map your data fields against this risk profile before deciding what an LLM touches.
4. Confidence Scores Are Meaningful — Use Them
The one thing we got right from the start was asking models to output explicit confidence scores and routing low-confidence records to human review. This is not optional. It's the only mechanism you have to contain blast radius when the model is wrong. A pipeline that sends everything through without confidence gating is an incident waiting to happen.
5. Schema Inference Is the Killer App for Data Pipelines
The one use case where LLMs delivered unambiguous ROI with no meaningful downside: inferring a mapping from a novel source schema to our target schema. This is exactly the kind of task that plays to a language model's strengths — pattern recognition, reasoning from context, handling ambiguity gracefully — and the cost profile is excellent because you do it once per schema, not once per record. If you're building data infrastructure and you're not using an LLM for onboarding-time schema inference, you are leaving real engineering hours on the table.
6. The Maintenance Tax Is Real, But So Is the Inference Tax
We replaced a high maintenance-tax system with a high inference-tax system, and in doing so we shifted cost from engineers to API bills. Whether that trade is worth it depends entirely on your scale, your data profile, and your cost structure. For us, in its original all-in form, it was not. In its hybrid form, it is. The answer is not universal.
Where We Are Now
Three months after starting the experiment and six weeks after completing the hybrid rollout, here is the honest scorecard:
- Unstructured field accuracy: up from 63% to 89% (address parsing, note classification, name extraction)
- New client onboarding time: down from 4 hours to 45 minutes
- Pipeline maintenance incidents: down 60% — the LLM handles most schema drift automatically now
- Daily infrastructure cost: $1,420/day (up from $220, but justified by labor savings and accuracy gains)
- End-to-end latency: 6–11 minutes (slightly higher than before due to async LLM processing, within SLA)
- Hallucination incidents reaching output: zero in the past 8 weeks (enforced through field-type routing and confidence gating)
I'm not embarrassed by where we landed. I'm a little embarrassed by how we got there. The experiment was valuable, the post-mortem was painful, and the architecture we have now is meaningfully better than what we started with. It's also very different from what I originally pitched.
The best AI integrations I've seen are ones where someone sat down and asked: "What is this model genuinely better at than my existing system, and what is it worse at?" Then built accordingly. The worst ones are the ones where someone asked: "What would happen if we replaced everything with AI?" and then found out.
We found out. I'd rather you learn it from a blog post.
If you're going through something similar — evaluating LLMs for production data infrastructure, trying to figure out where the boundaries are, or just trying to explain to a CFO why your AWS bill tripled last month — I'm genuinely happy to talk through it. This is exactly the kind of problem we work on at Ryshe, and we've now got the scar tissue to prove it.