Extracting entities from raw text without formatting errors is straightforward once you give AI a strict schema to follow. A poster in r/PromptEngineering figured this out and shared the exact template that makes it work.
The author’s core idea: rigid JSON constraints put AI into what they call “compliance mode.” Instead of interpreting what you want, it just follows rules. For anything machine-readable downstream, that’s exactly what you need.
Why This Works
Open-ended extraction requests give AI too much freedom. It might return bullet points one time, a paragraph the next, and a half-JSON half-English hybrid the time after. All readable. None parseable.
The reason this happens is that language models are trained to be helpful, not consistent. When you ask “extract the entities from this text,” the model tries to guess what helpful looks like. Sometimes that’s a numbered list. Sometimes it adds a preamble like “Sure! Here are the entities I found…” Sometimes it uses a JSON-ish format but quotes strings inconsistently or adds trailing commas that break your parser. Every variation is technically a reasonable response to a vague request.
The schema constraint approach removes that ambiguity layer by layer. Here’s what each part of the prompt does:
- “Your output must be in valid JSON format” locks the structure
- “Follow this schema exactly” removes interpretation
- “If a field is missing, use ‘null'” handles edge cases before they become bugs
- “Do not include any conversational text” strips filler that breaks parsers
Each rule closes one more escape hatch. The AI has nowhere to go except the format you defined. The null fallback is particularly important: without it, the model might skip missing fields entirely, return an empty string, or invent a placeholder value. Any of those breaks downstream code differently. Specifying null gives you one consistent, checkable signal for “this data wasn’t there.”
Think of it like a form with required fields versus a blank text box. People filling out a form give you structured, comparable data. People given a blank box give you everything from three words to three paragraphs. Same question, wildly different outputs depending on the container you put it in.
🗂️ Prompt of the Day
Here is the exact prompt the original poster shared. Reproduce it as-is, replace the placeholder, and run it:
Extract the entities from the following text: [Insert Text]. Your output must be in a valid JSON format. Follow this schema exactly: {“entity_name”: “string”, “category”: “string”, “importance_score”: 1-10}. If a field is missing, use ‘null’. Do not include any conversational text.
Swap [Insert Text] with your source material. The three-field schema is intentionally minimal, which is part of why it works well as a baseline. Each field earns its place: entity_name is the extracted thing, category is its type, and importance_score gives you a signal for prioritization without requiring a separate scoring pass. You can extend the schema later, but starting here lets you validate that your pipeline handles the output correctly before you add complexity.
If you are using this in code, consider wrapping the prompt in a function that accepts the schema as a parameter. That way you can swap schemas across different extraction tasks without rewriting the core instruction logic each time.
Two Variations Worth Testing
The base prompt handles single entities cleanly. Two adjustments make it more powerful:
- Multiple entities: Wrap the schema in an array and add “return all entities found.” Output becomes a list instead of one object. This is the version you want for any text longer than a sentence, since most real content contains multiple entities. When you switch to array output, also add an instruction like “return an empty array if no entities are found” to handle edge cases where the source text has nothing worth extracting.
- Controlled vocabulary: Add “category must be one of: person, organization, location, product” to prevent the AI from inventing category names. Consistent strings make filtering downstream much easier. Without this, you end up with variations like “company,” “corporation,” “business,” and “firm” all meaning the same thing, which forces you to normalize before you can group or filter. A fixed list eliminates that problem at the source.
You can combine both variations in one prompt. The schema stays the same. You just add the array wrapper and the category constraint as additional instructions after the schema definition.
Use Cases
This pattern is useful anywhere the output feeds into something that isn’t a human reader:
- Pulling product names and attributes from customer reviews, then loading them directly into a database or analytics tool without a manual cleanup step
- Extracting entities from articles for knowledge graphs or tagging pipelines, where consistent category labels determine how content gets surfaced
- Parsing job descriptions for structured skills data so you can compare requirements across hundreds of postings without reading each one
- Turning unstructured research notes into a consistent database format, which makes it possible to query across notes rather than just read them
If a human reads it and decides what happens next, you can be looser with format. If code reads it, you need schema constraints. That distinction is worth keeping in mind whenever you are designing a prompt. The audience shapes the format requirement more than the content does.
The original discussion is live on r/PromptEngineering if you want to see the full context or share your own variations.
The ‘Syntactic Sugar’ Auditor for API Efficiency.
by u/Glass-War-2768 in PromptEngineering