Notes from the Field on LLM testing and validation

Large language models such as ChatGPT and Amazon Bedrock hold immense promise. But as we have worked with customers to adopt them over the last twelve months, we also see significant challenges. Below, I discuss some of the common challenges we see and how best to overcome them.

Large language models are unpredictable

This is a point that has been widely discussed and is a very common source of concern for implementers. For some use cases, it’s not a deal-breaker – for example, we see customers using LLMs for summarizing conversations or making product recommendations. In both these scenarios, perfection is not required and the stakes for making a mistake are typically low.

However, there are many other applications of LLMs where their potential is tantalizing but the potential risks are high as well. These include areas such as medical diagnosis, loan underwriting, and other use cases that demand interpreting and explaining complex policies. Here, the legal and regulatory requirements are a significant hurdle.

Consider the case of a medical insurer leveraging a large-language model to:

Determine whether a procedure is covered
If it is covered, quantify to what extent and with what exceptions and caveats
Explain the coverage or no-coverage decision to a customer

From our own testing, LLM-based bots do quite well at this task. Remarkably better than the previous generation of bots, even with just a modest amount of training. However, we see even slight variations in the structure of our queries will lead to significantly divergent responses. For mission-critical use cases, this can be a major roadblock to getting these bots past the prototyping stage.

Large language models are opaque

This leads to a second point, which is when a mistake does occur, it can be very difficult to explain why it happened.

This opaqueness has been a challenge for AI, as well as statistical-based models, for decades – how can it be used in regulated industries/use cases if one cannot explain exactly how it reached a conclusion? A classic example is the mortgage industry – the LLM may not know anything about the race or gender of potential mortgage applicants. But it may end up leveraging, internally, a characteristic that correlates highly to these traits. For example, the model decides that people who like swimming pools are higher default risks. Unbeknownst to the model (or the trainers of it), preference for swimming pools correlates disproportionately with a specific race or gender.

Though this may seem improbable, these sorts of hidden correlations occur all the time. And if the net impact is discriminatory, this can lead to significant legal and regulatory liability.

Large language models evolve over time

This dovetails with a final point, which is that the combination of opaqueness and unpredictability leads to outcomes that are inconsistent for the same input from one day to the next.

This can be due to a variety of factors – changes to underlying models, changes to parameters used for fine-tuning or the source data used in the training, the context that is provided with a query or prompt, as well as model temperature settings which affect the emotional range of the bot. And this is of course not all bad – in real life, it’s weird and off-putting if a person rigidly replies to similar questions in the exact same way every time they are asked. The same can be true for chatbots, where the new generation of technology allows for more natural and human-like interactions. However, this also creates potential issues for compliance and validation.

Comprehensive, consistent testing is the way forward

Of course, as CEO of a company that exists to make testing faster and easier, this statement might sound self-serving. It doesn’t make it any less true.

The fact is, there is no other way to ensure the performance of these complex systems. As innovative as generative AI may be, it does not obviate the methods and processes of traditional application development. On the contrary, at least in the case of QA, these practices only become more important. This is due to the fact that as AI does more and more of the work for us, the task of verifying that work becomes ever more important. This means that in working with LLMs, we can expect less coding, but more testing and validation.

The good news is that LLMs help to solve and mitigate the very problem they create. This is because we can leverage large language models to validate other large language models. At Bespoken, we use generative AI to:

Generate test cases rapidly that simulate potential queries from users
Classify the results from a bot for correctness and safety
Classify the results from a bot for correctness and safety

Provide confidence scoring on the correctness and safety of answers, allowing for human reviewers to know best in which areas their attention is most required

All of that said, the end result is a set of inputs and expected outputs – i.e., test cases. This is the essential component of any testing regimen. These still need to be carefully curated, updated, executed, and analyzed. This is at times tedious work, but it is the best and only way to employ generative AI with confidence.

And for more testing with less tedium, reach out to us at Bespoken. Whether your conversational AI leverages LLMs or “classic” NLU, we can help you do it faster, better, and more comprehensively.

All the best,

John Kelvie

CEO and Co-founder, Bespoken

Notes from the Field on LLM testing and validation

Large language models are unpredictable

Large language models are opaque

Large language models evolve over time

Comprehensive, consistent testing is the way forward

Tools

Solutions

Pricing

Docs

Blog

Customers

Large language models are unpredictable

Large language models are opaque

Large language models evolve over time

Comprehensive, consistent testing is the way forward

Notes From The Field - How Much Testing Is Enough?

Notes from the Field on LLM testing and validation

The Paradox of AI and Testing