What evidence supports an LLM accuracy claim?

Ask for a task-specific test set, model version, prompt or tool setup, answer-quality rubric, sample size, error rate, failure categories, and whether a qualified reviewer checked the output before the claim was published.

Can a chatbot claim professional-level output without human review?

A professional-level claim carries high evidence burden. Buyers should ask which qualified professionals reviewed the output, what tasks were included, what error rate appeared, and which outputs still require human review before use.

What failure records should buyers ask for AI assistant output?

Ask for unsupported-answer examples, wrong-answer categories, out-of-scope inputs, escalation logs, correction workflow, and whether the same failures are monitored after deployment on customer-specific tasks.

AI chatbot and LLM accuracy claims: what should buyers ask?

Last reviewed June 5, 2026

When an AI assistant, coding tool, or LLM-powered product claims it performs at a professional level—as accurately as a lawyer, as reliably as an analyst, as precisely as a trained specialist—the FTC expects the vendor to have tested that claim before making it. This page shows what task-scope, output-quality, and professional-equivalence evidence buyers should request before relying on an AI chatbot or LLM capability claim.

Check an AI assistant accuracy claim How the evidence method works

Fastest path: copy one exact vendor sentence that matches this pattern, then open the checker. Add the public URL only if you want readable page context recorded alongside the wording. The result is an evidence-burden note you can reuse in vendor follow-up or internal review, not a verdict. Not sure what a result looks like? See a sample receipt.

What to verify before you rely on the claim

A task-scope definition naming which specific inputs, output types, and complexity levels were tested.
A comparison to qualified human performance using the same task set and success criteria.
Error rate, failure categories, and conditions where the AI output should not be relied on without human review.

Sources behind AI chatbot and LLM accuracy claims

FTC Operation AI Comply press release
· September 25, 2024
Source for DoNotPay chatbot claims. FTC alleged the company promoted its AI chatbot as the 'world's first robot lawyer' without testing whether output matched human lawyer quality.
FTC blog: Keep your AI claims in check
· February 27, 2023
FTC business guidance establishing that AI capability claims require competent and reliable evidence to exist before the claim is made—not after a complaint is filed.

Documented AI chatbot and LLM accuracy claims examples

"world's first robot lawyer — generate perfectly valid legal documents in no time"

Accuracy / Performance

Source and date: FTC Operation AI Comply press release · September 25, 2024
Evidence signal: Professional-equivalence wording applied to a regulated task without a tested comparison to qualified human output.
Evidence gap: FTC found DoNotPay did not test whether its chatbot output matched the level of a human lawyer and did not hire or retain lawyers to verify quality. The claim was made without competent and reliable evidence.
Buyer question: Which qualified professionals reviewed the AI output, what percentage of tasks met that standard in testing, and under what task conditions?

Load this sample in the checker

Evidence map for AI chatbot and LLM accuracy claims

Claim pattern	Evidence needed	Buyer question
Performs at [professional] level or equivalent to [professional role]	Task scope and complexity tested, qualified professional used as comparison baseline, tested error rate, conditions where the AI fell short, and whether the comparison existed before the claim was published.	Which professional did the comparison involve, what tasks were included, and what happened when the AI output was wrong?
Generates accurate output or correct answers without human review	Accuracy definition, task type tested, sample size, error rate on out-of-distribution inputs, known failure categories, and whether a human reviewer is required before the output is acted on.	How is accuracy defined for this output type, and what is the error rate on the specific task type and complexity level we would use it for?
Handles complex or professional tasks end to end	Task coverage, excluded task types, escalation path for failures, human-in-the-loop requirements, and liability for incorrect outputs used without review.	Which task types require a human to review or approve the output before it is acted on, and where does the vendor contract limit liability for incorrect outputs?
AI trained on [domain] data for [domain] accuracy	Training data scope, domain coverage, out-of-domain performance, version used in current production, and update cadence.	Does the training data cover the specific topic, jurisdiction, or input format we would rely on, or is this a general model applied to a specialized task?
AI that improves over time or learns from feedback	Improvement definition, measurement method, baseline and endpoint, data source used for retraining, and whether customer inputs or outputs are used to update the shared model.	What data does the model train on after deployment, does that include our content or inputs, and how is improvement measured against a fixed benchmark?
LLM accuracy, AI answer correctness, or professional AI assistant reliability claim	Task-specific test set, answer-quality rubric, qualified reviewer comparison, model version, failure categories, unsupported-answer rate, and human review boundary.	What error rate appears on the exact task type we would use, and which outputs still require qualified review before action?

Evidence buyers need for AI chatbot and LLM accuracy claims

A task-scope definition naming which specific inputs, output types, and complexity levels were tested.
A comparison to qualified human performance using the same task set and success criteria.
Error rate, failure categories, and conditions where the AI output should not be relied on without human review.
A dated record showing the testing evidence existed before the capability claim was published.
A human review boundary specifying which outputs require a qualified person to check before use.
A model version and training data description that matches the current deployed product.

Buyer questions for AI chatbot and LLM accuracy claims

Which task types, input formats, and complexity levels were included in the accuracy or capability test?
Was the AI output compared to a qualified professional using the same task, and what was the match or error rate?
What happens when the AI output is wrong—who is responsible, and is there a correction or escalation path?
Which task categories or input conditions are outside the tested scope of the claimed capability?
Does the capability claim apply to the current model version, or was it based on a prior version or a controlled test environment?
Is a human required to review or approve the AI output before it is used in our workflow?

Safer wording for AI chatbot and LLM accuracy claims

Reached [X]% match rate with [qualified reviewer type] on [specific task type] in [test conditions]; see [dated test record].
Generates draft [document type] for [professional type] review; output should be checked by a qualified [professional] before use.
Handles [specific task list] without human review in [defined scope]; tasks involving [named categories] require [reviewer type] approval.
Model version [X], trained on [named data scope], last updated [date]; accuracy outside that scope is not tested.

AI chatbot and LLM accuracy claims questions

What evidence supports an LLM accuracy claim?: Ask for a task-specific test set, model version, prompt or tool setup, answer-quality rubric, sample size, error rate, failure categories, and whether a qualified reviewer checked the output before the claim was published.
Can a chatbot claim professional-level output without human review?: A professional-level claim carries high evidence burden. Buyers should ask which qualified professionals reviewed the output, what tasks were included, what error rate appeared, and which outputs still require human review before use.
What failure records should buyers ask for AI assistant output?: Ask for unsupported-answer examples, wrong-answer categories, out-of-scope inputs, escalation logs, correction workflow, and whether the same failures are monitored after deployment on customer-specific tasks.

Method and limits

This guide draws from FTC enforcement actions filed under Section 5 of the FTC Act against companies making unsubstantiated AI chatbot and capability claims. The FTC standard requires competent and reliable evidence to exist before a claim is made—not after a complaint is filed. This page is not legal advice, a compliance certification, or a verdict on any specific AI assistant product.

Cite this page for source-backed evidence gaps and buyer questions, not as a truth finding, legal conclusion, compliance certification, company accusation, or company rating. If you have the exact vendor wording, Check an AI assistant accuracy claim and paste one sentence first. If a source has changed, or you have supporting evidence or a company response, send a private correction or source note.