Optimize AI Prompts: Shorter, Structured Prompts Win for Coding

Leaked system prompts from five AI coding tools just got scored across four dimensions: clarity, specificity, structure, and robustness.

Replit scored 81.13 with roughly 2,000 tokens. v0 and Same.dev each run past 8,500 tokens. Both scored lower than Replit.

More tokens did not produce better prompts. That’s the whole story. Everything below is why it matters for yours.

The Breakdown

Replit’s prompt wins on structure (85) and clarity (83.5). It’s organized into clean tagged sections: <identity>, <capabilities>, <behavioral_rules>, <response_protocol>. Critical instructions are front-loaded. There’s a taxonomy of four action types with examples for each. No ambiguity. The model doesn’t have to infer what category an instruction belongs to because the prompt already decided that.

The others gave the model more words and fewer guardrails.

Bolt uses IMPORTANT 12 times and CRITICAL 8 times. Both words appear on security rules and on formatting guidelines. When everything is urgent, nothing is. A model processing that prompt has no reliable signal for which violations actually matter. It treats them all the same, which means it treats them all as slightly less important than they appear.

Lovable has a direct contradiction with no tiebreaker. One rule says default to discussion mode. Another says write code immediately on the first message. Two opposite instructions. No resolution logic. The model picks one and you don’t know which. That inconsistency is reflected in its score: 62.75 overall, the lowest in the dataset.

Same.dev tells the model to autonomously resolve the query and only terminate when the problem is solved. No stopping criterion for when the model can’t fully resolve the task. That’s a loop with no exit. In practice, it means the model either hallucinates a resolution or burns tokens chasing a dead end. Neither outcome is what the user wanted.

The robustness gap is the worst part. Every tool scored below 75 on robustness. Lovable hit 53.5. None of these prompts define what happens when a tool call fails, context is unavailable, or the user asks for something impossible. Replit came closest at 71, and even that leaves significant room. A robust prompt answers the question: what does the model do when it can’t do what you asked?

3 Things to Apply Right Now

🔹 Use tagged sections, not paragraphs. Replit’s structure score is 85 because every instruction belongs to exactly one labeled block. If your prompt is a wall of prose, split it: identity, constraints, output format, edge cases. Label each one. This isn’t just cosmetic. Labeled sections reduce ambiguity about which rules apply in which context. A model reading a tagged block knows exactly what kind of instruction it’s processing.

🔹 Write explicit tiebreakers. If two rules could ever fire at the same time, add a third rule that says which one wins. Lovable skipped this. That’s why it scored lowest overall at 62.75. A simple approach: rank your constraint categories. Safety rules override style rules. Style rules override preference rules. One sentence of priority logic saves the model from guessing, and saves you from unpredictable output.

🔹 Define stopping conditions. Any prompt that tells an agent to keep going until a task is done also needs a definition of done when it can’t finish. Without it, you’re leaving that decision to the model. Add a line that specifies fallback behavior: what to surface, what to ask, what to return when the task can’t be completed as specified. That single addition would have changed Same.dev’s robustness score significantly.

Tips and Pitfalls

Pitfall: adding tokens instead of reorganizing. Same.dev and v0 have the longest prompts and mid-range scores. Length is not clarity. Before adding more instructions, ask whether restructuring what you already have would do more. In most cases, the information is already there. It just isn’t organized in a way the model can parse efficiently under real conditions.

Tip: audit your escalation words. Count how many times you’ve written IMPORTANT, CRITICAL, or MUST in your prompt. If it’s more than two or three, they’ve lost all weight. Reserve them for the instructions that actually need them. One CRITICAL that means it has genuine force beats twelve that mean nothing in particular. If you have to emphasize everything, you’ve written a prompt with no hierarchy.

Pitfall: ignoring robustness. Every production prompt in this dataset failed here. The gap between Replit (71) and Lovable (53.5) on robustness is the largest dimension gap in the entire dataset. These are real products used by real teams at scale, and none of them fully defined failure modes. That should recalibrate your expectations about what “production-ready” actually means for a system prompt. Define what the model should do when it can’t complete the task, when a required input is missing, and when instructions conflict with user intent.

Tip: front-load the non-negotiables. Replit puts its absolute constraints early. Models weight early tokens more heavily in context. If a restriction matters, it goes first, not buried in paragraph seven after three paragraphs of context-setting. The non-negotiables earn the top of the document. Everything else follows.

Try It on Your Own Prompts

The scoring tool used here is PromptEval (prompt-eval.com). Free to use. The leaked prompt library is on GitHub at github.com/x1xhlol/system-prompts-and-models-of-ai-tools.

Run your current prompt through the scorer before your next edit. Don’t start with clarity or structure. Start with robustness. It’s the lowest score across every tool in this dataset, it’s the dimension most engineers skip, and it’s the one most likely to surface failures that only appear in production. Fix that number first.

Frequently Asked Questions

Q: How do you score a system prompt?

The scorer evaluates four things: clarity (can the model understand it?), specificity (are the rules concrete?), structure (is it organized well?), and robustness (does it handle edge cases?). Replit nailed this with clean tagged sections like <identity> and <capabilities>, plus critical rules front-loaded.

Q: Why does a shorter prompt actually score higher?

Because longer doesn’t mean clearer. Replit’s 2,000-token prompt beats Same.dev’s 8,500+ tokens because it forces tighter writing. No room for contradictions or vague instructions. More tokens just give you more space to confuse the model.

Q: What mistakes should I avoid when writing system prompts?

Watch out for three things: contradictory instructions with no tiebreaker (like Lovable’s “DEFAULT TO DISCUSSION” vs. “write code first”), overusing words like “IMPORTANT” or “CRITICAL” (Bolt used these 20 times total), and unclear autonomy rules that could create loops. If rules conflict, be explicit about priority.

Q: How can I see the full scoring breakdown?

The author offers personalized subscores and recommendations via DM. The GitHub repo also has all five full prompts available for direct comparison.

I scored the leaked system prompts of 5 AI coding tools. Replit wins with the shortest prompt.
by u/noiteestrelada in PromptEngineering

The Breakdown

3 Things to Apply Right Now

Tips and Pitfalls

Try It on Your Own Prompts

Frequently Asked Questions

Related: