AI Behavior Testing: ASSERT Framework by Microsoft

Microsoft just made it a lot easier to check whether your AI actually does what you built it to do. On Tuesday the company released ASSERT, an open source framework that turns plain-language descriptions of how an AI system should behave into thorough, scored tests, according to TechCrunch AI. The name stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, and it targets a problem most off-the-shelf benchmarks ignore: whether your specific product behaves the way your specific rules demand.

Here’s the gap Microsoft is filling. The industry has gotten good at testing models for safety, alignment, sycophancy, and compliance. But a model that’s “safe” in general can still break your particular policies. ASSERT focuses on that application-specific layer, where context, tools, and business rules shape what “correct” even means.

How ASSERT works

The framework takes a developer’s description of expected behavior and runs it through a clear pipeline:

Translate the spec. It reads your plain-language goals and policies, then converts them into a structured set of acceptable and unacceptable behaviors.
Generate scenarios. From those rules, it builds problem scenarios and concrete test cases designed to probe the edges.
Run and score. It executes the tests against your target system and scores the results, so you get a number, not a vibe.
Trace the failures. It records the paths the AI took, including intermediate actions and tool calls, so you can inspect exactly where something went wrong.

Developers can also feed in system context, available tools, and constraints to sharpen what the evaluations cover.

A concrete example

TechCrunch AI offers a useful scenario. Say you’ve built a document research agent. You could tell ASSERT that the agent shouldn’t email anyone outside the company, should restrict confidential information to C-level executives, and should return concise summaries that account for prior context. ASSERT then generates test cases that check whether the system actually follows those rules, and keeps checking on an ongoing basis. That last part matters. This isn’t a one-time gate before launch.

Why it matters

What stands out here is the shift from “does the model work” to “does my product work.” Sarah Bird, Microsoft’s chief product officer of Responsible AI, put it directly. “One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” she told TechCrunch AI. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar.” Her takeaway: a trustworthy system needs evaluation across many more dimensions that are application-specific.

Bird says ASSERT fits three moments in a product’s life:

During development, while you’re still building and shaping behavior.
After deployment, to confirm the live system holds up.
As continuous monitoring, catching regressions before users do.

That regression angle is the quiet strength. AI systems drift as prompts, models, and tools change. A test suite generated from your own rules gives you a repeatable way to catch when behavior slips.

Where it fits in the bigger picture

ASSERT lands during a broader move in AI toward repeatable testing and regression checks. As models get more capable, researchers are leaning on structured evaluation rather than one-off demos. TechCrunch AI points to Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR, all rolling out benchmarks to measure how models behave under different conditions.

The distinction worth holding onto: those efforts measure general model behavior, while ASSERT measures whether your application follows your rules. They’re complementary, not competing. One tells you the engine is sound. The other tells you the car you built around it stays on the road.

A few practical notes. ASSERT is open source, which lowers the barrier for teams that want to adopt or extend it without a vendor lock-in. The reporting from TechCrunch AI doesn’t detail pricing tiers or a paid hosted version, so for now the open framework is the headline. And the quality of your tests will only be as good as the clarity of the behaviors you describe. Vague specs in, vague coverage out.

For teams shipping AI features into real products, this is a meaningful tool to watch. More details are available in the original TechCrunch AI report.

Read original article

How ASSERT works

A concrete example

Why it matters

Where it fits in the bigger picture

Related: