Patronus AI secures $17M to tackle AI hallucinations and copyright issues.

As companies race to implement generative AI, concerns about the accuracy and safety of large language models (LLMs) threaten to derail widespread enterprise adoption. Stepping into the fray is Patronus AI, a San Francisco startup that just raised $17 million in Series A funding to automatically detect costly — and potentially dangerous — LLM mistakes at scale.

The round, which brings Patronus AI’s total funding to $20 million, was led by Glenn Solomon at Notable Capital, with participation from Lightspeed Venture Partners, former DoorDash executive Gokul Rajaram, Factorial Capital, Datadog, and several unnamed tech executives.

Founded by former Meta machine learning (ML) experts Anand Kannappan and Rebecca Qian, Patronus AI has developed a first-of-its-kind automated evaluation platform that promises to identify errors like hallucinations, copyright infringement, and safety violations in LLM outputs. Using proprietary AI, the system scores model performance, stress-tests models with adversarial examples, and enables granular benchmarking — all without the manual effort required by most enterprises today.

“There’s a range of things that our product is actually really good at being able to catch, in terms of mistakes,” said Kannappan, CEO of Patronus AI, in an interview with VentureBeat. “It includes things like hallucinations, and copyright and safety-related risks, as well as a lot of enterprise-specific capabilities around things like style and tone of voice of the brand.”

The emergence of powerful LLMs like OpenAI’s GPT-4 and Meta’s LLaMA 3 has set off an arms race in Silicon Valley to capitalize on the technology’s generative abilities. But as hype cycles accelerate, so too have high-profile model failures, from news site CNET publishing error-riddled AI-generated articles to drug discovery startups retracting research papers based on LLM-hallucinated molecules.

These public missteps only scratch the surface of broader issues endemic to the current crop of LLMs, Patronus AI claims. The company’s previously published research, including the “CopyrightCatcher” API released three months ago and the “FinanceBench” benchmark unveiled six months ago, reveals startling deficiencies in leading models’ ability to accurately answer questions grounded in fact.

For its “FinanceBench” benchmark, Patronus tasked models like GPT-4 with answering financial queries based on public SEC filings. Shockingly, the best-performing model answered only 19% of questions correctly after ingesting an entire annual report. A separate experiment with Patronus’ new “CopyrightCatcher” API found open-source LLMs reproducing copyrighted text verbatim in 44% of outputs.

Key Findings from Patronus AI Research:

FinanceBench Performance: Leading models answered only 19% of financial queries correctly.
CopyrightCatcher Results: Open-source LLMs reproduced copyrighted text verbatim in 44% of outputs.
Safety Risks: Over 20% unsafe responses in high-priority areas of harm from open-source models.

“Even state-of-the-art models were hallucinating and only got like 90% of responses correct in finance settings,” explained Qian, who serves as CTO. “Our research has shown that open-source models had over 20% unsafe responses in many high-priority areas of harm. And copyright infringement is a huge risk — large publishers, media companies, or anyone using LLMs needs to be concerned.”

While a handful of other startups like Credo AI, Weights & Biases, and Robust Intelligence are building tools for LLM evaluation, Patronus believes its research-first approach leveraging the founders’ deep expertise sets it apart. The core technology is based on training dedicated evaluation models that reliably surface edge cases where a given LLM is likely to fail.

“No other company right now has the research and technology at the level of depth that we have as a company,” Kannappan said. “What’s really unique about how we’ve approached everything is our research-first approach — that’s in the form of training evaluation models, developing new alignment techniques, publishing research papers.”

This strategy has already gained traction with several Fortune 500 companies spanning industries like automotive, education, finance, and software using Patronus AI to deploy LLMs “safely within their organizations,” per the startup, though it declined to name specific customers. With the fresh capital, Patronus plans to scale up its research, engineering, and sales teams while developing additional industry benchmarks.

If Patronus achieves its vision, rigorous automated evaluation of LLMs could become table stakes for enterprises looking to deploy the technology, in the same way security audits paved the way for widespread cloud adoption. Qian sees a future where testing models with Patronus is as commonplace as unit-testing code.

“Our platform is domain-agnostic and so the evaluation technology that we build can be extended to any domain, whether that’s legal, healthcare, or others,” she said. “We want to enable enterprises across every industry to leverage the power of LLMs while having assurance the models are safe and aligned with their specific use case requirements.”

Still, given the black-box nature of foundation models and near-endless space of possible outputs, conclusively validating an LLM’s performance remains an open challenge. By advancing the state-of-the-art in AI evaluation, Patronus aims to accelerate the path to accountable real-world deployment.

“Measuring LLM performance in an automated way is really difficult and that’s just because there’s such a wide space of behavior, given that these models are generative by nature,” acknowledged Kannappan. “But through a research-driven approach, we’re able to catch mistakes in a very reliable and scalable way that manual testing fundamentally cannot.”

Patronus AI Key Research Findings

Research Initiative	Key Finding
FinanceBench	Leading models answered only 19% of financial queries correctly.
CopyrightCatcher	Open-source LLMs reproduced copyrighted text verbatim in 44% of outputs.
Safety Risks	Over 20% unsafe responses in high-priority areas of harm from open-source models.

With its robust research and cutting-edge technology, Patronus AI is poised to become a key player in ensuring the safe and effective deployment of LLMs across various industries. As enterprises continue to adopt generative AI, the importance of reliable and automated evaluation tools cannot be overstated. Patronus AI’s innovative approach and deep expertise offer a promising solution to the challenges posed by the rapid advancement of AI technology.

The journey ahead for Patronus AI involves not only scaling their operations but also continually pushing the boundaries of AI evaluation to keep up with the evolving landscape of generative models. By setting high standards for safety and accuracy, Patronus AI aims to pave the way for a future where enterprises can confidently harness the power of LLMs without compromising on quality or security.

Patronus AI secures $17M to tackle AI hallucinations and copyright issues.

ByYasmeeta Oon

Patronus AI Key Research Findings

Yasmeeta Oon

Related News

Cascade Raises $3.5 Million to Help Construction Firms Find Projects Earlier

AI Browsers Compete to Become the Main Interface for Online Tasks

The Healthy Insurance Dude Aims to Fix Broken System

Leave a Reply Cancel reply