Bespoken LLM Benchmark: Does ChatGPT know more than Google and Amazon?
Learn more
November 21, 2023 in Blog

The Paradox of AI and Testing

TL;DR Large language models hold immense promise. But as we have worked with customers over the last twelve months to adopt them, we also see significant challenges. Here are some of the common questions and obstacles we see so far, as well as how customers are addressing them.

One of the remarkable aspects of AI is how it has redistributed work – while it has made some things far simpler, it has created complexity in other areas that can be daunting to the point of being unapproachable. Testing is one such area.

Is this because there is less to test and validate? Not in the case of generative AI, where the potential scenarios to validate are theoretically infinite.

Even when dealing with speech recognition, a far less complex subject than LLMs, we see a hesitation on the part of customers to really dig in and thoroughly verify the behavior of their systems. Yet everyone we talk to will gladly acknowledge that there are far more use cases/test cases to verify. And the ROI of thorough testing and training can be immense.

For example, in the case of speech recognition, we have to consider:

  • The accent and dialect of the speaker
  • The environment where the user is speaking (people talking in the background, traffic sounds if someone is in the car, a noisy fan, etc.)
  • The quality of the speaker’s microphone and connection
  • The phrasing of how the speaker chooses to reply

All of these factors can turn even simple interactions into a multifaceted dilemma for quality assurance.

It also means that when we are talking about speech recognition (and AI generally), we are not talking about a simple pass/fail binary when it comes to test cases. Instead, we are looking at probabilities (e.g.., “this utterance is correctly accepted 95.6% of the time”).

Compare this to something like adding a button on a web page – confirming it is working correctly is just a matter of clicking on it. Did it work? Test passes, the end.

And so one of the paradoxes of AI is that we see people unsure of where to start with their own testing regimen. This uncertainty leads to procrastination and wishful thinking – many adopters simply hope that the AI gods will make their problems disappear.

And they may – we have seen it happen. Or they may make it worse – we’ve seen that happen many times too. A more certain, less wishful solution is to start testing now, even if it’s just a little bit, and not let the perfect become the enemy of the good. On paper, perhaps there are 10,000 scenarios to verify within an IVR application maybe even more if all factors are considered but how much more peace of mind will your team have if you start with automating the 10 or 20 most essential ones? That foundation, once established, can then be built on to improve the system and extend the breadth of test coverage.

In summary, don’t get trapped by the paradox of testing and AI – get started 🙂 Drop us a line to do it today.

All the best,

John Kelvie

CEO and Co-founder, Bespoken

Leave a Reply

Your email address will not be published. Required fields are marked *