How Do You Test AI Systems?

Testing AI Systems can be difficult

Getty

Everyone who has ever worked on an application development project knows that you don’t just simply put code and content out in production, to your customers, employees, or stakeholders without first testing it to make sure it’s not broken or dead on delivery. Quality Assurance (QA) is such a core part of any technology or business delivery that it’s one of the essential components of any development methodology. You build. You test. You deploy. And the best way to do all this is in an agile fashion, in small, iterative chunks so you make sure to respond to the continuously evolving and changing needs of the customer. Surely AI projects are no different. There are iterative design, development, testing, and delivery phases, as we’ve discussed in our previous content on AI methodologies.

However, AI operationalization is different than traditional deployment in that you don’t just put machine learning models into “production”. This is because models are continuously evolving and learning and must be continuously managed. Furthermore, models can end up on a wide variety of endpoints with different performance and accuracy metrics. AI projects are unlike traditional projects even with regards to QA. Simply put, you don’t do QA for AI projects like you QA other projects. This is because the concept of what we’re testing, how we test, and when we test is significantly different for AI Projects.

Testing and quality assurance in the training and inference phases of AI

Those experienced with machine learning model training know that testing is actually a core element of making AI projects work. You don’t simply develop an AI algorithm, throw training data at it and call it a day. You have to actually verify that the training data does a good enough job of accurately classifying or regressing data with sufficient generalization without overfitting or underfitting the data. This is done using validation techniques and setting aside a portion of the training data to be used during the validation phase. In essence, this is a sort of QA testing where you’re making sure that the algorithm and data together in a way that also takes into account hyperparameter configuration data and associated metadata all working together to provide the predictive results you’re looking for. If you get it wrong in the validation phase, you’re supposed to go back, change the hyperparameters, and rebuild the model again, perhaps with better training data if you have it. After this is done, you go back and use other set-aside testing data to verify that the model indeed works as it is supposed to. While this is all aspects of testing and validation, it happens during the training phase of the AI project. This is before the AI model is put into operation.

Even in the training phase, we’re testing a few different things. First, we need to make sure the AI algorithm itself works. There’s no sense in tweaking hyperparameters and training the model if the algorithm is implemented wrong. However, in reality, there’s no reason for a poorly implemented algorithm because most of these algorithms are already baked into the various AI libraries. If you need K-Means Clustering or different flavors of neural networks or Support Vector Machines or K-Nearest Neighbors, you can simply just call that library function in Python scikit-learn or whatever your tool of choice is, and it should work. There’s just one way to do the math! ML developers should not be coding those algorithms from scratch unless you have a really good reason to do so. That means if you’re not coding them from scratch, there’s very little to be tested as far as the actual code goes – assume that the algorithms have already passed their tests. In an AI project, QA will never be focused on the AI algorithm itself or the code, assuming it has all been implemented as supposed to be.

This leaves two things to be tested in the training phase for the AI model itself: the training data and the hyperparameter configuration data. In the latter case, we already addressed testing of hyperparameter settings through the use of validation methods, including K-fold cross-validation and other approaches. If you are doing any AI Model training at all, then you should know how to do validation. This will help determine if your hyperparameter settings are correct. Knock another activity off the QA task list.

MORE FOR YOU

These 4 NYSE Stocks Just Broke Below Previous Price Support

Wealth Of Malaysia’s 50 Richest On Forbes List Up 2% To $83.4 Billion

WWE Raw Results, Winners And Grades On April 15, 2024

As such, then all that remains is testing the data itself for QA of the AI Model. But what does that mean? This means not just data quality, but also completeness. Does the training model adequately represent the reality of what you’re trying to generalize? Have you inadvertently included any informational or human-induced bias in your training data? Are you skipping over things that work in training but will fail during inference because the real-world data is more complex? QA for the AI model here has to do with making sure that the training data includes a representative sample of the real world and eliminates as much human bias as possible.

Outside of the machine learning model, the other aspects of the AI system that need testing are actually external to the AI model. You need to test the code that puts the AI model into production – the operationalization component of the AI system. This can happen prior to the AI model being put into production, but then you’re not actually testing the AI model. Instead, you’re testing the systems that use the model. If the model is failing during testing, the other code that uses the model has a problem with either the training data or the configuration somewhere. You should have picked that up when you were testing the training model data and doing validation as we discussed above.

To do QA for AI, you need to test in production

If you’ve followed along what’s written above then you know that a properly validated, well-generalizing system using representative training data and using algorithms from an already-tested and proven source should result in expected results. But what happens when you don’t get those expected results? Reality is obviously messy. Things happen in the real world that don’t happen in your test environment. Yet we did everything we were supposed to do in the training phase and our model passed meeting expectations, but it’s not passing in the “inference” phase when the model is operationalized. This means we need to have a QA approach to deal with models in production.

Problems that arise with models in the inference phase are almost always issues of data or mismatches in the way that the model was trained versus real-world data. We know the algorithm works. We know that our training model data and hyperparameters were configured to the best of our ability. That means that when models are failing we have data or real-world mismatch problems. Is the input data bad? If the problem is bad data – fix it. Is the model not generalizing well? Is there some nuance of the data that needs to be added to further train the model? If the answer is the latter, that means we need to go through a whole new cycle of developing an AI model with new training data and hyperparameter configurations to deal with the right level of fitting to that data. Regardless of the issue, organizations that operationalize AI models need a solid approach by which they can keep close tabs on how the AI models are performing and version control which ones are in operation.

This is resulting in the emergence of a new field of technology called “ML ops”, that focuses not on building or developing models, but rather managing them in operation. ML ops is focused on model versioning, governance, security, iteration, and discovery. Basically, everything that happens after the models are trained and developed and while they are out in production.

AI projects are really unique in that they revolve around data. Data is the one thing in testing that is guaranteed to continuously grow and change. As such, you need to consider AI projects as also continuously growing and changing. This should give you a new perspective on QA in the context of AI.

Follow me on Twitter or LinkedIn. Check out my website or some of my other work here.

More From Forbes

How Do You Test AI Systems?

These 4 NYSE Stocks Just Broke Below Previous Price Support

Wealth Of Malaysia’s 50 Richest On Forbes List Up 2% To $83.4 Billion

WWE Raw Results, Winners And Grades On April 15, 2024