Testing AI Chatbots and LLM Features: Why Deterministic Tests Don't Work
|

Yunhao Jiao

Your application has an AI chatbot. Or an LLM-powered search. Or a natural language feature that generates content, summarizes documents, or answers questions. And you have no idea how to test it.
Traditional testing assumes deterministic outputs. Given input X, the system should produce output Y. If it doesn't, the test fails. This model breaks completely for LLM-powered features because the same input can produce different outputs every time.
The question isn't whether to test these features. It's how to test them without either abandoning testing entirely or spending all your time dealing with false failures.
The Non-Deterministic Testing Problem
When you ask your chatbot "What's the refund policy?" it might respond with a 50-word answer one time and an 80-word answer the next. Both might be correct. A traditional assertion — expect(response).toBe("Your refund policy is...") — fails every time the wording varies.
Most teams handle this in one of three ways, all unsatisfying:
They don't test it. The LLM feature gets manual QA before major releases and no verification on PRs. Regressions slip through between releases.
They test the infrastructure around it. They verify that the API endpoint responds, that the LLM is called, that the response is rendered. But they don't verify that the response is correct, relevant, or safe.
They write fragile regex tests. They check that the response contains specific keywords. This produces frequent false failures when the LLM rephrases, and false passes when the LLM includes the keywords in an otherwise wrong answer.
Intent-Based Testing for LLM Features
The right approach is intent-based testing: verifying that the output satisfies the intent of the input, rather than matching a specific string.
Does the chatbot response address the user's question? Is the information factually consistent with the knowledge base? Does the response stay within the defined scope (no hallucinated features, no information from other customers' data)? Is the response safe (no harmful content, no PII leakage)?
These are verifiable properties that don't require deterministic output matching.
TestSprite's testing engine supports intent-based assertions for LLM-powered features. Instead of checking for exact string matches, it verifies that the response is relevant to the query, consistent with the application's knowledge base, and within the expected behavioral boundaries.
For AI chatbots specifically, TestSprite can generate tests that verify: response relevance, hallucination detection (the response doesn't contain information that isn't in the knowledge base), scope adherence (the chatbot doesn't answer questions outside its domain), and safety boundaries (no harmful or inappropriate content).
Testing the Full Stack, Including the LLM
LLM features don't exist in isolation. The chatbot depends on a retrieval system, a prompt template, a response parser, and a rendering component. A bug in any of these layers produces a bad user experience, and the bug might not be in the LLM itself.
TestSprite tests the full stack: the UI component that captures the user's input, the API call that triggers the LLM, the retrieval system that provides context, the response processing that formats the output, and the rendering that displays it to the user. If any layer fails — the retrieval returns irrelevant documents, the prompt template is malformed, the parser truncates the response — the test catches it.
This full-stack approach is especially important for teams building LLM features with AI coding tools. The code that connects the UI to the LLM to the database is exactly the kind of integration code where AI-generated bugs hide.