The narrative around fine-tuning large language models has an appealing internal logic. You have a foundation model that knows everything but no one thing deeply. You have proprietary data that encodes your domain's nuance. Training the model on your data seems like an obvious path to a model that knows your domain as well as a foundation model knows everything else. The math seems right. The intuition seems solid.

In practice, for the overwhelming majority of production use cases, it doesn't work the way you expect — and the failure modes are expensive, subtle, and often invisible until you're already deep in.

I am not a fine-tuning skeptic on principle. There are real, valid applications where it's the right tool. But those applications are rarer than the discourse suggests, and the alternatives are more powerful than most teams appreciate. This post is an attempt to save you from making the mistakes I made — in detail, with real numbers, and with a framework for deciding when fine-tuning is actually warranted.

The Eight Experiments That Taught Me This

Between late 2023 and mid-2025, across two products and three client engagements, I ran or closely oversaw eight distinct fine-tuning efforts. Here is what each one was trying to accomplish and what actually happened:

  1. Domain-specific legal clause extraction. Goal: extract specific clause types from contracts with higher precision than a prompted base model. Outcome: marginal precision improvement (~4 points F1) at a cost of $6,200 in compute and six weeks of data labeling. A well-engineered few-shot prompt achieved equivalent precision in three days.
  2. Customer support tone adaptation. Goal: match a specific brand voice and deflection pattern. Outcome: the fine-tuned model learned the tone but catastrophically lost instruction-following reliability. Hallucination rate increased 38% compared to base model. Rolled back entirely.
  3. Code generation for a proprietary API. Goal: generate correct usage of internal SDK methods. Outcome: initial accuracy was strong, but the base model the client fine-tuned on was updated by the provider six weeks after deployment. The fine-tune became incompatible with the new base. $11,000 sunk and they restarted with a RAG approach.
  4. Medical terminology normalization. Goal: map clinical free text to ICD-10 codes with higher accuracy. This one partially worked — domain vocabulary did improve. But maintaining the training set as coding guidelines changed became a full-time burden. The model's effective accuracy degraded below baseline within four months of initial deployment without continuous retraining.
  5. Financial report summarization in a specific format. Goal: consistent structured output matching an internal template. Outcome: achievable in three hours with a well-crafted system prompt and output schema validation. No fine-tuning needed. This one never should have been scoped as a fine-tuning problem.
  6. Multi-language customer intent classification. Goal: classify support intent in six languages with higher accuracy than the base model. Outcome: The base model, with appropriate few-shot examples per language, performed within 2% accuracy of the fine-tuned version. The fine-tune cost $8,400 to reach that 2%.
  7. Embedding model fine-tuning for semantic search. This is the most honest success story in the list. Fine-tuning an embedding model (not a generative model) on domain-specific document pairs genuinely improved retrieval quality by a meaningful margin. I'll come back to this in the "when fine-tuning works" section.
  8. Reducing verbosity in generated responses. Goal: train the model to produce shorter, more direct answers. Outcome: achieved equally well through system prompt instruction plus a post-generation length check. Spent three weeks fine-tuning before someone on the team suggested trying the prompt approach first. That suggestion cost me thirty seconds to test.

The running total across these eight: approximately $40,000 in GPU compute, plus engineering hours that dwarf that number. Definitive successes: one (the embedding model). Partial successes with unacceptable maintenance cost: one. The remaining six: clear failures or unnecessary expenditures.

Why Fine-Tuning Fails: The Three Core Problems

1. The Data Problem

The most common reason fine-tuning efforts fail is that the training data is not what practitioners think it is. The amount of high-quality, labeled, domain-specific data required to meaningfully shift a foundation model's behavior is substantial — typically thousands of examples at minimum, and often tens of thousands for the kinds of nuanced behavioral changes teams are trying to achieve.

Most organizations dramatically overestimate the quality and quantity of the data they have. "We have years of customer support tickets" sounds like a rich training set. In practice, those tickets are inconsistently formatted, frequently incorrect, contain confidential information that needs to be scrubbed, and reflect historical behavior that may not represent what you want the model to do going forward. The labeling effort required to turn raw organizational data into a clean fine-tuning set is almost always significantly larger than initial estimates — and the quality of the resulting model is directly constrained by the quality of that labeling.

A rough heuristic: if you can't afford a dedicated person for at least six weeks doing nothing but data curation and labeling, you probably don't have the data infrastructure for a reliable fine-tune.

2. The Cost Structure Problem

Fine-tuning is not a one-time expense. This is the part that catches most teams by surprise. The compute cost of the initial training run is just the beginning. What follows is:

When you build on a foundation model's API, that provider absorbs the hosting, scaling, and infrastructure maintenance costs. The per-token rate includes all of that. When you fine-tune and host your own model, you've internalized those costs. For teams without ML platform infrastructure already in place, this is rarely a good trade.

3. Model Drift and Provider Updates

This is the least discussed failure mode and the one I've seen cause the most acute production incidents. Foundation model providers update their base models. Sometimes this is announced. Sometimes the update is minor enough that it doesn't trigger a formal version change. Either way, if you've fine-tuned on a specific base model version and that base is updated, your fine-tune may degrade in unpredictable ways.

More insidiously, even without provider updates, the distribution of inputs in production drifts over time. Your training data reflects historical patterns. Your users evolve. The result is that a fine-tuned model that performs well at launch can quietly degrade over months — and unlike a prompted model, where you can quickly adjust a system prompt, correcting a degraded fine-tune requires re-running the full training pipeline.

Model drift is the silent killer of fine-tuned deployments. You don't notice it until the degradation has already accumulated enough to cause visible user impact — and by then, the fix is weeks away, not hours.

What Actually Works: Four Alternatives

Prompt Engineering (Seriously)

I need to spend time on this because the word "prompt engineering" has been debased by association with the worst kind of cargo cult thinking — "just add 'think step by step' and it works great!" — to the point where serious practitioners sometimes dismiss it reflexively.

Rigorous prompt engineering is an actual discipline with measurable outcomes. It involves systematic evaluation of prompt variants against a fixed test set, structured documentation of what each prompt component contributes, and iterative refinement based on error analysis rather than intuition.

# Example: Structured prompt evaluation framework
import json
from typing import Callable

def evaluate_prompt_variant(
    prompt_template: str,
    test_cases: list[dict],
    scorer: Callable[[str, str], float],
    model_fn: Callable[[str], str]
) -> dict:
    results = []
    for case in test_cases:
        prompt = prompt_template.format(**case['inputs'])
        output = model_fn(prompt)
        score = scorer(output, case['expected'])
        results.append({
            'input': case['inputs'],
            'output': output,
            'expected': case['expected'],
            'score': score
        })

    return {
        'mean_score': sum(r['score'] for r in results) / len(results),
        'min_score': min(r['score'] for r in results),
        'failures': [r for r in results if r['score'] < 0.7],
        'n': len(results)
    }

In the experiments I described above, five of the eight fine-tuning efforts were solved equivalently or better by systematic prompt engineering. The time investment was 5-10% of what the fine-tuning approach required. The maintenance cost was a fraction. And the iteration cycle was days, not weeks.

Before you open a fine-tuning job, you should be able to demonstrate that you've exhausted the prompt engineering path with the same rigor you'd apply to model training. Most teams haven't.

Few-Shot Learning

Few-shot learning is placing exemplars — examples of the input-output behavior you want — directly in the context window. This is not the same as fine-tuning, but it achieves a similar effect: the model sees demonstrations of desired behavior and generalizes from them.

The advantage is immediacy and flexibility. You can update your examples without any training. You can A/B test different example sets by changing a config file. You can add domain-specific examples without any infrastructure beyond your existing inference pipeline.

The limitation is context window size and cost. If you need fifty examples to adequately constrain the model's behavior, that's a lot of tokens per inference call. For high-volume workloads, the economics can tilt back toward fine-tuning. But for most use cases, five to fifteen carefully chosen examples — selected from real cases that cover the distribution of inputs you actually see — dramatically outperform the kind of zero-shot or poorly-prompted approaches teams often compare against when justifying a fine-tuning project.

# Good few-shot construction: cover the distribution, not just the easy cases
def build_few_shot_examples(
    training_data: list[dict],
    n_examples: int = 10,
    strategy: str = 'diverse'
) -> list[dict]:
    if strategy == 'diverse':
        # Use clustering to select representative examples
        # that cover input distribution rather than just
        # high-confidence / easy cases
        embeddings = embed_batch([d['input'] for d in training_data])
        clusters = kmeans(embeddings, k=n_examples)
        return [training_data[c.nearest_index] for c in clusters]
    elif strategy == 'hard':
        # Select cases the model currently gets wrong
        scored = [(d, score_model(d)) for d in training_data]
        return [d for d, score in sorted(scored, key=lambda x: x[1])[:n_examples]]

Retrieval-Augmented Generation (RAG)

RAG is the most powerful alternative to fine-tuning for the specific problem of domain knowledge injection. The core insight is that you don't need to bake knowledge into model weights — you can retrieve it at inference time and inject it into the context.

This has several advantages over fine-tuning that compound over time. Your knowledge base can be updated without any retraining — add a document to your vector store and the model can access it on the next inference call. Retrieval is auditable — you can log exactly what context was retrieved and why, which is invaluable for debugging and for compliance in regulated industries. And the model's base reasoning and instruction-following capabilities are preserved, because you're not modifying its weights at all.

The failure modes of RAG are different from fine-tuning: retrieval quality, chunk sizing, embedding model choice, and context integration all require careful engineering. But they're engineering problems, not ML training problems — which means they're faster to iterate on and don't require GPU compute to fix.

# Minimal RAG pipeline skeleton
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

class RAGPipeline:
    def __init__(self, embedding_model: str, documents: list[str]):
        self.embedder = SentenceTransformer(embedding_model)
        self.documents = documents
        embeddings = self.embedder.encode(documents)
        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings.astype(np.float32))

    def retrieve(self, query: str, top_k: int = 5) -> list[str]:
        q_emb = self.embedder.encode([query]).astype(np.float32)
        scores, indices = self.index.search(q_emb, top_k)
        return [self.documents[i] for i in indices[0]]

    def generate(self, query: str, llm_fn) -> str:
        context = '\n\n'.join(self.retrieve(query))
        prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
        return llm_fn(prompt)

For knowledge-intensive tasks — answering questions about internal documentation, generating content that references proprietary information, reasoning over company-specific data — RAG almost always outperforms fine-tuning at a fraction of the cost and with dramatically better maintainability.

Constrained Decoding

Constrained decoding is the least well-known of these alternatives and, in specific scenarios, the most powerful. The idea is to constrain the model's output distribution at inference time — not by modifying weights, but by modifying the logit distribution over the vocabulary during generation.

Libraries like outlines and guidance make this practical. If you need guaranteed JSON output that conforms to a specific schema, constrained decoding will give you that guarantee with zero training. If you need the model to always choose from a finite set of labels, constrained decoding enforces that at the token level, eliminating an entire class of output format errors.

# Using outlines for guaranteed structured output
import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Define the schema the output MUST conform to
schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string", "enum": ["purchase", "refund", "support", "inquiry"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "requires_escalation": {"type": "boolean"}
    },
    "required": ["intent", "confidence", "requires_escalation"]
}

generator = outlines.generate.json(model, schema)
result = generator("Classify this customer message: 'I was charged twice.'")
# Returns: {"intent": "refund", "confidence": 0.94, "requires_escalation": false}
# Guaranteed valid JSON, every time, no post-processing needed

Teams often reach for fine-tuning when they're trying to make model outputs more reliable and consistent. Constrained decoding solves that problem more directly, more cheaply, and with stronger guarantees.

When Fine-Tuning IS the Right Answer

After all of that skepticism, I want to be precise about the cases where fine-tuning is genuinely the right tool.

Embedding Model Fine-Tuning

I mentioned this above — it's the honest success in my list of experiments. Embedding models are different from generative models in a way that makes fine-tuning significantly more tractable. The task is simpler (learn a distance metric, not generate text), the data requirements are lower (contrastive pairs rather than full instruction-output examples), and the model size is smaller (training is cheaper). If you have a specific retrieval or semantic similarity task and off-the-shelf embeddings are underperforming, fine-tuning an embedding model is often worth the investment.

Extreme Latency or Cost Constraints

If you need to run inference at very high volume with strict latency requirements that frontier API models can't meet — and if a smaller, fine-tuned open-source model can meet those requirements — fine-tuning a smaller model is a legitimate path. The trade-off is real and the economics can work. The key question is whether you've genuinely exhausted the latency optimization options on the API side (caching, streaming, batching, quantized models) before committing to the fine-tuning path.

Behavioral Modification That Prompt Engineering Cannot Achieve

There are behaviors that are deeply baked into model weights through RLHF and constitutional AI training that prompting genuinely cannot reliably override. Consistently producing extremely terse output, always responding in a specific structured format without explicit instruction, systematically eliding certain types of hedging language — these are examples of behavioral modifications that are difficult to achieve through prompting alone and where fine-tuning has a legitimate role.

Air-Gapped or On-Premise Deployment

If you're operating in an environment where sending data to an external API is not possible — defense, certain healthcare contexts, high-security financial applications — you're already committed to running your own model. In that context, fine-tuning a local model isn't an alternative you're choosing over an API; it's part of the baseline requirement.

A Decision Framework

Before your team opens a fine-tuning job, work through these questions in order. Stop when you find a solution:

  1. Can a well-crafted system prompt plus few-shot examples achieve acceptable performance? Build an evaluation set. Spend a week on prompt engineering with rigorous measurement. If you hit your target, you're done.
  2. Is the core problem a knowledge problem? If yes, build a RAG pipeline before considering fine-tuning. Knowledge-intensive tasks almost always benefit more from retrieval than from baking facts into weights.
  3. Is the problem output format or reliability? If yes, evaluate constrained decoding before fine-tuning. Grammar-constrained generation with outlines or guidance will solve most format consistency problems without any training.
  4. Do you have more than 10,000 high-quality, labeled examples ready to use? If no, gather your data first. Starting a fine-tuning effort before the data is ready is the most common and most expensive mistake in this category.
  5. Have you budgeted for retraining? Not just the initial run — the quarterly retraining cycle that maintenance will require. If the honest answer is no, you're underestimating the total cost of ownership.
  6. Is embedding model fine-tuning the actual need? If your problem is retrieval quality rather than generation quality, fine-tuning an embedding model is significantly more tractable than fine-tuning a generative model.

If you've worked through all six questions and the answers still point to generative model fine-tuning, then you're in one of the legitimate cases. Go build it carefully, with evaluation infrastructure in place from day one and a clear retraining cadence defined before you launch.

What This Means for How You Should Think About LLM Strategy

The deeper lesson from this experience is about the epistemics of choosing ML approaches. There's a pattern I've observed across multiple organizations: teams decide to fine-tune before they've rigorously tested alternatives, because fine-tuning feels like the "serious" thing to do. Prompt engineering sounds too simple. RAG sounds like a workaround. Fine-tuning a model sounds like real ML work.

That intuition is wrong and it's costing people real money. The goal is not to do impressive-sounding ML work. The goal is to build systems that perform reliably in production at acceptable cost. The approach that achieves that goal with the least complexity and the fastest iteration cycle is the correct approach, regardless of how it sounds in a technical presentation.

The teams I've seen build the best LLM-powered products share a common trait: they are radically pragmatic about approach selection. They start with the simplest thing that could work. They measure rigorously. They upgrade to more sophisticated approaches only when the evidence demands it. Fine-tuning, when they use it, is the result of a process of elimination — not a first instinct.

Forty thousand dollars and eight experiments to learn this lesson is probably about twenty experiments fewer than it took some of my peers. I hope reading this saves you from needing to run any of them at all.