SkillForge: Evaluate & Improve Your Prompt Engineering Skills

Coding skills have always had a feedback loop. LeetCode, contests, rankings, difficulty tiers. You grind through two-sum problems, then move to trees, then graphs, then dynamic programming. You know exactly where you are in the hierarchy because the hierarchy is explicit. There are timed challenges, leaderboards, and thousands of problems organized by difficulty and topic. You can measure yourself against millions of other developers. The loop is tight: attempt, fail, understand why, retry, improve.

Prompting has… vibes.

You paste something in, it works or it doesn’t, you tweak a word, it works better, you share a screenshot on Twitter, people say “nice,” and you have no idea why it worked or whether you could reproduce it deliberately. There’s no rubric. No ranking. No curriculum. Just accumulated intuition that doesn’t transfer cleanly from one model to another, one use case to another, one team to another.

That’s the gap a builder is trying to close with SkillForge, an experimental platform that evaluates your prompt engineering approach across real challenges. Not just whether the output looks right, but how you got there: structure, reasoning, constraint handling, workflow thinking, communication clarity. The evaluation isn’t “did the AI say something helpful.” It’s “did you architect the interaction well.” That distinction matters more than most people realize, especially as the tasks you’re delegating to AI get more complex and the stakes around reliability go up.

The twist: The builder isn’t even sure “AI skill” is measurable in any meaningful way. They shipped it anyway to find out. That kind of honest uncertainty is exactly what makes this worth watching. Most platforms in this space sell you confidence. SkillForge sells you a question. And right now, the question is more valuable than the confidence, because anyone claiming they’ve solved prompt skill measurement is almost certainly oversimplifying it.

Here’s what SkillForge puts you through:

🎯 Pick a real-world challenge: force strict JSON outputs, reduce hallucinations, design multi-step workflows, defend against prompt injection. These aren’t toy problems. They’re the categories of failures that actually break production AI features and the ones most people never practice until something ships broken.
📝 Submit your approach and watch what gets scored: structure, reasoning, and communication clarity, not just your final output. You can’t paste a one-liner and hope for the best. You have to show your thinking, which is where most people realize they haven’t developed thinking yet, just outputs.
🔍 See where you actually stand through criteria-based feedback, not “regenerate and hope.” Vague feedback (“your prompt was unclear”) is useless. Specific feedback (“your constraint definitions were ambiguous in edge cases”) tells you something you can actually fix.
Identify your weak spots. Knowing which dimension you’re weak in is worth ten hours of random practice. You stop working on what feels comfortable and start working on what actually needs attention.
Iterate with something to measure against. This is the part most AI learning is missing. You can compare version A of your prompt against version B with actual criteria, not just a gut feeling about which output looks better.

Pro tip: The most underrated skill on platforms like this will be prompt injection defense. Most people practice generating good outputs. Almost nobody practices building prompts that resist adversarial inputs. That gap is real and it matters the moment you ship anything to production. Think about it this way: if your AI feature can be manipulated by a user typing “ignore previous instructions” into a form field, everything else you built well is irrelevant. The output quality doesn’t matter if the prompt can be hijacked. Start treating injection defense the way developers treat input sanitization: not as an advanced topic, but as a baseline you ship with every time.

Beyond injection, the other underrated area is multi-step workflow design. Single prompts are the easy part. Chaining prompts together in a way that accumulates context without losing reliability, where the output of step two actually uses what step one produced correctly, without hallucinating connections that don’t exist, that’s where the real skill gap lives. Most prompt engineers don’t discover this gap until they’re already in production and something has already gone wrong. SkillForge’s challenge set directly targets this, which is why it’s worth taking seriously even in its early state.

It’s early. The builder is actively asking what should be measured and what would make evaluations feel credible rather than arbitrary. If you’re serious about prompt engineering, that conversation is worth joining. The people who show up early to shape how a skill gets defined tend to understand it better than anyone who shows up later to consume the polished version. Right now the platform is small enough that your feedback actually lands.

Test your skills at SkillForge and tell them what’s missing. 🚀

I built an experimental platform to measure prompt engineering skills
by u/Sudden-Assistant-36 in PromptEngineering

Related: