AI chatbot and LLM accuracy claims: what should buyers ask?

Last reviewed May 30, 2026

When an AI assistant, coding tool, or LLM-powered product claims it performs at a professional level—as accurately as a lawyer, as reliably as an analyst, as precisely as a trained specialist—the FTC expects the vendor to have tested that claim before making it. This page shows what task-scope, output-quality, and professional-equivalence evidence buyers should request before relying on an AI chatbot or LLM capability claim.

Evidence buyers verify

  • A task-scope definition naming which specific inputs, output types, and complexity levels were tested.
  • A comparison to qualified human performance using the same task set and success criteria.
  • Error rate, failure categories, and conditions where the AI output should not be relied on without human review.

Opens the checker for this claim type. Paste your vendor's exact wording there. Evidence questions only — not a blacklist or fraud detector. Not sure what a result looks like? See a sample receipt.

Sources this guide draws from

  1. · September 25, 2024

    Source for DoNotPay chatbot claims. FTC alleged the company promoted its AI chatbot as the 'world's first robot lawyer' without testing whether output matched human lawyer quality.

  2. · February 27, 2023

    FTC business guidance establishing that AI capability claims require competent and reliable evidence to exist before the claim is made—not after a complaint is filed.

Public claims with documented evidence gaps

"world's first robot lawyer — generate perfectly valid legal documents in no time"

Accuracy / Performance
Source and date
FTC Operation AI Comply press release · September 25, 2024
Evidence signal
Professional-equivalence wording applied to a regulated task without a tested comparison to qualified human output.
Evidence gap
FTC found DoNotPay did not test whether its chatbot output matched the level of a human lawyer and did not hire or retain lawyers to verify quality. The claim was made without competent and reliable evidence.
Buyer question
Which qualified professionals reviewed the AI output, what percentage of tasks met that standard in testing, and under what task conditions?

Match each claim pattern to the evidence buyers need

Claim pattern Evidence needed Buyer question
Performs at [professional] level or equivalent to [professional role] Task scope and complexity tested, qualified professional used as comparison baseline, tested error rate, conditions where the AI fell short, and whether the comparison existed before the claim was published. Which professional did the comparison involve, what tasks were included, and what happened when the AI output was wrong?
Generates accurate output or correct answers without human review Accuracy definition, task type tested, sample size, error rate on out-of-distribution inputs, known failure categories, and whether a human reviewer is required before the output is acted on. How is accuracy defined for this output type, and what is the error rate on the specific task type and complexity level we would use it for?
Handles complex or professional tasks end to end Task coverage, excluded task types, escalation path for failures, human-in-the-loop requirements, and liability for incorrect outputs used without review. Which task types require a human to review or approve the output before it is acted on, and where does the vendor contract limit liability for incorrect outputs?
AI trained on [domain] data for [domain] accuracy Training data scope, domain coverage, out-of-domain performance, version used in current production, and update cadence. Does the training data cover the specific topic, jurisdiction, or input format we would rely on, or is this a general model applied to a specialized task?
AI that improves over time or learns from feedback Improvement definition, measurement method, baseline and endpoint, data source used for retraining, and whether customer inputs or outputs are used to update the shared model. What data does the model train on after deployment, does that include our content or inputs, and how is improvement measured against a fixed benchmark?

Evidence to request

  • A task-scope definition naming which specific inputs, output types, and complexity levels were tested.
  • A comparison to qualified human performance using the same task set and success criteria.
  • Error rate, failure categories, and conditions where the AI output should not be relied on without human review.
  • A dated record showing the testing evidence existed before the capability claim was published.
  • A human review boundary specifying which outputs require a qualified person to check before use.
  • A model version and training data description that matches the current deployed product.

Questions to put in front of the vendor

  • Which task types, input formats, and complexity levels were included in the accuracy or capability test?
  • Was the AI output compared to a qualified professional using the same task, and what was the match or error rate?
  • What happens when the AI output is wrong—who is responsible, and is there a correction or escalation path?
  • Which task categories or input conditions are outside the tested scope of the claimed capability?
  • Does the capability claim apply to the current model version, or was it based on a prior version or a controlled test environment?
  • Is a human required to review or approve the AI output before it is used in our workflow?

Wording boundaries to compare against

  • Reached [X]% match rate with [qualified reviewer type] on [specific task type] in [test conditions]; see [dated test record].
  • Generates draft [document type] for [professional type] review; output should be verified by a qualified [professional] before use.
  • Handles [specific task list] without human review in [defined scope]; tasks involving [named categories] require [reviewer type] approval.
  • Model version [X], trained on [named data scope], last updated [date]; accuracy outside that scope is not tested.

Have your vendor's exact claim wording ready?

Check an AI assistant accuracy claim How the evidence method works