close
close

To fully realize the value of GenAI, companies need scalable testing

To fully realize the value of GenAI, companies need scalable testing

But how do you scale your T&E practices when even the product developers admit that AI tools are a black box?

The conventional approach to T&E does not work for AI

“The evaluation of traditional AI models is systematic, analytical, and deeply rooted in statistics,” Ittner says. “You have a system that predicts tomorrow’s weather, and you run a statistical evaluation to see how often it gets it right or wrong.” But the open-ended nature of many applications of AI presents a unique challenge. For example, if you’re developing an application of AI focused on housing questions, you can fact-check a single result, but there’s no way to systematically ensure that every answer is 100 percent correct. “You put text in and you get text back,” Ittner says. “You can’t use statistical analysis to confirm that the model is working as expected because the inputs and outputs are too variable.”

In other words, even if you could scale T&E to meet the challenges of the AI ​​generation, you could never know if and when you were successful by simply increasing the volume of testing. “You need to use your expertise and knowledge of the use case to identify the biggest risks and then put the system to the test in different areas that you want to test,” says Ittner.

Until now, T&E has relied on human red teaming to develop and implement AI, in which testers simulate potential attacks to identify vulnerabilities in the system. It’s a best practice regularly used in cybersecurity and related fields. But Steven Mills, chief AI ethics officer at BCG, says that while red teaming can be an effective means of testing AI applications, it’s simply not enough. As companies move from prototypes and consumer products to enterprise-wide implementations, the challenges are increasing exponentially. “Human testers are critical, but manual red teaming will never be enough to meet the demands of widespread AI implementation,” says Mills.

The data scientists and engineers at BCG X saw a way forward: a T&E system that was flexible, intuitive, and powerful enough to comprehensively test and evaluate emerging AI systems. The only problem was that BCG X couldn’t find such a solution on the market.

So they set about creating it.

ARTKIT: Human-driven automation for T&E

Imagine if you could increase the efforts of human testers by a thousandfold on every invocation of an AI system. Then you could go to market with a product you believe in – a product that has already proven its business value and safety in the testing and trial process.

It’s an ambitious vision, but BCG X was determined to make it a reality. Their first challenge was to develop a toolkit that would accommodate the company’s current processes. “The toolkit had to run thousands of tests quickly while integrating seamlessly with existing systems,” says Ittner.

Because the product also had to be user-friendly, the team opted for a simple and flexible design that allows developers to quickly create customized T&E pipelines. “It’s like Lego bricks,” says Ittner. “You want to build your test bench from the elements very quickly, and we give the developers everything they need to do that.”