DMR News

Advancing Digital Conversations

Microsoft’s ASSERT Framework Turns Plain Language Into Automated AI Tests For Application-Specific Behavior

ByJolyen

Jun 4, 2026

Microsoft’s ASSERT Framework Turns Plain Language Into Automated AI Tests For Application-Specific Behavior

Microsoft introduced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) on Tuesday. The open-source framework simplifies testing for application-specific AI behavior. ASSERT uses AI to turn high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests.

How ASSERT Converts Rules Into Test Cases

The framework takes plain-language descriptions of an AI model’s expected behavior and policies. It then turns those descriptions into a structured set of acceptable and unacceptable behaviors. ASSERT generates problem scenarios and test cases, runs them against the target system, and scores the results. The framework can also record the paths the AI system takes, including intermediate actions and tool calls. Developers can inspect those records to see where failures happen.

Developers can provide system context, tools, and constraints to further customize the evaluations. In one example, a developer could specify that a document research AI agent should not send emails to people outside the company. The same developer could specify that the agent should limit confidential information to C-level executives and provide concise summaries with prior context in mind. ASSERT uses those rules to generate test cases. Those test cases check whether the system follows the rules on an ongoing basis.

Filling The Gap Left By General Evaluations

The framework fills a gap that broader, more general evaluations cannot cover. General evaluations do not account for behavior shaped by an application’s context, policies, and tools. “One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” said Sarah Bird, chief product officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar … What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”

Bird said ASSERT can evaluate systems when they are being built, after deployment, and even for continuous monitoring.

Industry Shift Toward Repeatable Testing

The release occurs as the AI industry shifts toward repeatable testing and regression checks. Researchers are focusing on these methods as models grow more capable. Microsoft’s announcement details the framework’s technical specifications. Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR have rolled out benchmarks. Those benchmarks measure how models behave under different conditions. TechCrunch reported on the broader trend of AI evaluation benchmarks earlier this year.


Featured image credits: Roboflow Universe – Chayada
For more stories like it, click the +Follow button at the top of this page to follow us.

Jolyen

As a news editor, I bring stories to life through clear, impactful, and authentic writing. I believe every brand has something worth sharing. My job is to make sure it’s heard. With an eye for detail and a heart for storytelling, I shape messages that truly connect.

Leave a Reply

Your email address will not be published. Required fields are marked *