Few-shot Learning Failures: 3 LLM Testing Pitfalls

64% accuracy at 4 examples. Back down to 33% at 8.

That’s what happened when a Reddit poster tested Gemini 3 Flash on route optimization. They ran 8 LLMs across 4 task types at 0, 1, 2, 4, and 8 shot counts. The original poster found three failure patterns that challenge the assumption almost every prompt engineer makes: more few-shot examples equals better performance.

These aren’t edge cases. They’re systematic failure modes. And if you’re not testing for them, you’re shipping prompts blind.

The three failure patterns

1. Peak regression, the model learns, then unlearns

Performance climbs, peaks, then collapses back to baseline. Gemini 3 Flash on route optimization: 33% at 0-shot, 64% at 4-shot, 33% at 8-shot. If you only benchmark at the endpoints, you’d conclude examples don’t help at all. That’s the wrong takeaway. The real finding is that 4 examples is the sweet spot for that model-task pair. Testing at 0 and 8 makes the peak invisible. The likely mechanism: beyond a certain count, examples stop demonstrating a pattern and start constraining the model’s output space too tightly, forcing it to fit examples rather than solve the problem.

2. Model ranking reversal, your “best” model depends on your prompt design

Zero-shot benchmarks don’t predict few-shot performance. Gemini 2.5 Flash scored 20% at 0-shot and 80% at 8-shot. Gemini 3 Pro stayed flat at 60% across all shot counts. If you chose your model based on standard benchmarks, you may have picked the wrong one for your actual deployment conditions. Rankings flip when you add examples, and most teams never check. That flat 60% from Gemini 3 Pro also means something: consistent but capped. Depending on your task, a model with a higher ceiling at the right shot count beats a model that’s merely stable.

3. Example selection collapse, “better” examples can make things worse

The poster compared hand-picked examples against TF-IDF-selected ones (automatically choosing the most semantically similar examples per test case). On route optimization, TF-IDF collapsed GPT-OSS 120B from 50%+ down to 35%. The method designed to find better examples actually broke the model. High semantic similarity between your examples and test cases can over-anchor the model, killing its ability to generalize. More relevant-looking examples is not the same as more useful examples.

3 practical applications

📌 Test at intermediate shot counts, not just 0 and max. Run evaluations at 0, 2, 4, and 8 shots at minimum. The failure lives in the middle, not at the endpoints. If you skip intermediate counts, you miss the peak entirely and misread the performance curve. For tasks with complex structure (routing, multi-step reasoning, classification with edge cases), add a 6-shot checkpoint too.

📌 Benchmark models at the shot count you’ll actually deploy. If your production prompt uses 5 examples, evaluate candidate models at 5 examples. Zero-shot benchmarks tell you nothing about few-shot behavior. Pick the model that performs best in the conditions you’re actually running it in.

📌 Validate automated example selection against hand-picked baselines first. If you’re using retrieval-augmented few-shot (dynamic example injection by similarity), compare it to a manually curated set before shipping. High similarity does not guarantee better output. Verify this assumption or it will break in production.

Tips and pitfalls

This isn’t one researcher’s isolated finding. Tang et al. (2025) documented “over-prompting”, the pattern where performance peaks then declines as context grows. Chroma Research (2025) described “context rot”: adding more tokens can actively degrade output quality. The direction of the research is consistent.

The trap most teams fall into: optimize once, assume it holds, never retest. But these curves are model-specific and task-specific. There’s no universal sweet spot. The only way through is measurement.

One useful nuance from the community discussion: TF-IDF collapse happens because high similarity anchors the model too tightly to the examples instead of letting it reason from them. It’s a subtle but important difference between examples that demonstrate a pattern and examples that over-specify a case. A good hand-picked example set prioritizes diversity of pattern over similarity of surface form.

The original poster built an open-source tool (MIT license) to detect these patterns automatically. It tracks learning curves, flags collapse points, and compares example selection methods side by side. If you’re running any kind of systematic prompt evaluation, it’s worth adding to your workflow. You can find the link in the original Reddit thread.

The bottom line

Testing at 0-shot and max-shot tells you almost nothing about how your prompt will actually perform. The failure hides in between those two points, and it’s invisible until it’s live.

If you’re deploying prompts with few-shot examples and haven’t tested intermediate counts, you don’t know what you’re actually shipping. That’s a fixable gap, and now you know exactly where to look.

The full discussion is in r/PromptEngineering. If you’ve hit peak regression or example collapse on a specific model or task, drop it in the thread. The dataset is still small and more real-world cases would help the community.

Frequently Asked Questions

Q: How do you find the optimal shot count for your model-task pair?

There’s no universal answer , you have to test. Start by testing at 0-shot, 2-shot, 4-shot, and 8-shot to find your “peak” before regression happens. Track performance across these baselines; you might discover your model hits peak accuracy at 2 examples but tanks at 8. The sweet spot is model-specific and task-specific, so systematic testing is your only reliable method.

Q: How can you tell if your model is pattern-matching your examples instead of actually reasoning?

Try this: if all your examples feel too clean and representative (same entity types, similar phrasing), you’re probably format-training the model. Add one deliberately messy or edge-case example and re-test. If performance drops significantly, the model was pattern-matching. If it stays the same or improves, the model is likely reasoning through the actual task.

Q: Why does automated example selection (like TF-IDF) sometimes perform worse than hand-picked examples?

Semantic similarity makes examples feel helpful, but it anchors the model to a very narrow pattern. When your real query diverges even slightly, the model fails. Hand-picked examples with intentional variety across edge cases and structures teach broader reasoning instead of narrow format-matching.

Q: Should I pick my model based on zero-shot benchmarks alone?

No , model rankings can flip when examples are added. One model scored 20% at 0-shot but 80% at 8-shot, while another stayed flat at 60%. The “best” model for your task depends on how many examples you’ll use in production. Always benchmark your actual model-prompt combination, not just zero-shot performance.

Q: What happens if you add too many examples?

You risk “peak regression” , performance climbs with more examples, then collapses past a certain point. One case showed 64% (4-shot) → 33% (8-shot). If you only test the extremes, you’ll miss the sweet spot and wrongly conclude examples don’t help.

Adding few-shot examples can silently break your prompts. Here’s how to detect it before production.
by u/Rough-Heart-7623 in PromptEngineering