Optimize LLM Few-Shot Prompts: Reference vs. Inline

Repeating the same system prompt across 100 few-shot examples feels organized. Each example is self-contained. Clean, right? The model sees full context every time. You can read any one example in isolation and understand exactly what it’s supposed to do.

That’s the conventional move. And it quietly breaks down around batch 50.

A prompt engineer ran a controlled experiment across four LLMs, testing inline versus reference formatting at batch sizes of 3, 16, 50, and 100. The task wasn’t just output quality. It was index alignment: can the model correctly map each example to its own data when all examples share a repeated header block? The gap showed up exactly where you’d expect long-context issues to kick in. Not in the first half of the batch. In the back half, where repetition has already stacked up and the model is working harder just to keep its place.

Inline vs reference: what actually changes

The inline approach pastes the shared block into every example:

<example index="1">
<turn role="system">You are a helpful weather assistant. Be concise and accurate.</turn>
<turn role="user">What's the weather in Rome?</turn>
<turn role="assistant">18°C, light rain.</turn>
</example>
<example index="2">
<turn role="system">You are a helpful weather assistant. Be concise and accurate.</turn>
...

The reference approach declares the block once, then points to it everywhere else:

<shared id="sys">You are a helpful weather assistant. Be concise and accurate.</shared>

<example index="1">
<turn role="system" var="sys"/>
<turn role="user">What's the weather in Rome?</turn>
<turn role="assistant">18°C, light rain.</turn>
</example>
<example index="2">
<turn role="system" var="sys"/>
...

Same information, different structure. At 3 examples it barely matters. At 50 it starts to diverge in a very specific, measurable way. And the divergence isn’t random noise. It’s directional: inline gets worse as batch size grows, reference holds steady much longer.

📊 What the data showed

The test used a sharp probe: each example had a unique random code embedded in its data but not in its visible text. One example’s output was corrupted (ALL CAPS). The model had to find the corrupted example by returning its code, which required accurate positional tracking. No text-search shortcuts available. The only path to the right answer was correctly mapping position to content across the entire batch.

Index alignment accuracy by batch size:

Batch 3: reference 1.00 / inline 0.97
Batch 16: reference 1.00 / inline 0.97
Batch 50: reference 1.00 / inline 0.84
Batch 100: reference 0.91 / inline 0.88

Overall across all runs: 0.98 versus 0.91. On weaker models at batch 50, inline dropped to 0.75 while reference held at 1.00. The failures clustered at the end of large batches, and they weren’t refusals. They were confident wrong-index citations. The model picked the wrong example and committed to it. No hedging, no uncertainty signal, just a clean confident answer pointing at the wrong data. That’s the failure mode that actually hurts in production, because it’s invisible unless you’re checking outputs against ground truth.

The likely reason: inlining the same block into every example bloats each one. Past a certain batch size, the model’s sense of positional structure starts slipping. Referencing keeps each example lean, so the index stays easy to track all the way through.

Three steps to switch

Find any block that repeats across examples: a system prompt, instruction set, output schema, or shared constraint. If you can copy-paste it from example 1 into example 47 with zero edits, it belongs in a shared declaration.
Pull it out and declare it: <shared id="your_id">...</shared>. Put this at the top of your prompt, before the first example, so the model encounters the definition before any reference to it.
Replace every inline copy with a pointer: <turn role="system" var="your_id"/>. One find-and-replace pass in your template handles this for the whole batch.

That’s the full change. Smaller prompt, cleaner structure, better index tracking at scale. If you’re generating batch prompts programmatically, this is also a natural fit: declare shared blocks as constants in your code and render pointers in the loop instead of interpolating full strings.

When this matters most

Strong models like GPT-5.5 and Claude Opus 4.8 were near-perfect throughout, so for small batches with top-tier models the difference is marginal. The effect kicks in clearly when:

🔁 Your batch has 16 or more examples
You’re using faster or smaller models where long-context tracking degrades
Index accuracy is load-bearing, like batched evals or multi-example review pipelines
Token count is a real constraint and you need a clean way to compress

If you’re building eval harnesses, textual backpropagation setups, or any system where one LLM reviews batches of another LLM’s outputs, this is worth dropping in today. It also applies to any multi-turn few-shot setup where you’re feeding the same persona, formatting rules, or output schema into every example. One small format decision, zero quality tradeoff, and your model stops getting lost at the back half of a large batch.

If your prompt repeats the same text across many examples, reference it once instead of inlining — small experiment across 4 LLMs
by u/dmpiergiacomo in PromptEngineering

Inline vs reference: what actually changes

📊 What the data showed

Three steps to switch

When this matters most

Related: