Testing AI systems - possible or impossible?

What feelings does the term artificial intelligence (AI) evoke in you? For some, a future without the use of artificially intelligent systems is no longer imaginable. In this world, artificial intelligence helps autonomous driving to make a breakthrough, reliably makes medical diagnoses and autonomous robots care for our elderly.

For many other people, these are precisely the worst imaginings of the future. They fear that artificial intelligence will make us humans superfluous, causing millions of unemployed and gaining power over us.

The following article addresses artificial intelligence from a neutral point of view. Artificial intelligence is not good or evil, but simply a technology that can be used to solve certain problems.

First, we consider the definition of AI and current fields of AI applications. After that, we discuss challenges and possible solutions when testing artificial intelligence systems.

What is artificial intelligence?

The ISO/IEC 2382:2015 defines artificial intelligence as follows:

“Artificial Intelligence (AI) is a branch of computer science devoted to developing data processing systems that perform functions normally associated with human intelligence, such as reasoning, learning, and self‐improvement.”

Wikipedia provides the following explanation:

„ Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality..“

This definitions are not really clear. A lot of interpretations are possible. Therefore, since the umbrella term “artificial intelligence” is not clearly definable, it becomes even fuzzier with further categories. To help us find our way around, we have nevertheless summarised the terms compactly in the following overview:

Strong artificial intelligence (AI)

A strong artificial intelligence (AI) can think in a similar way as a human being. That includes strategic and emotional intelligence.

Here we can clearly say that such systems do not exist today and will not exist in the foreseeable future. However, these systems have been stimulating people’s imagination for decades. Accordingly, there are many books, films and basic research on this, but not more!

Weak artificial intelligence (AI)

All AI systems that are actually used and developed today correspond to weak artificial intelligence (AI). This means that a machine performs a precisely defined task faster, better and more efficiently. This is implemented in practice with machine learning (ML). Most AI systems currently in use are based on machine learning. Other technologies for weak AI are at the research stage.

Machine learning (ML)

Wikipedia explains machine learning (ML) as follows:

„ Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.“

Within machine learning, a distinction is made between the following 3 types:

Supervised machine learning
Unsupervised machine learning
Reinforcement machine learning

Supervised machine learning (ML)

The following figure shows the basic elements of supervised machine learning. A model is trained with known training data until an acceptable quality of the AI system’s results is achieved.

The domain for supervised learning is the field of predictions and recommendations. IT software examples are:

The prediction of specific customer behaviour, e.g. to grant a loan.
Forecasting which products will be bought and by whom
Categorising mails as spam or not
Online reading recommendations or displaying the right advertisements for internet users

In the embedded software sector, the automotive industry does broadly experiment with supervised machine learning. A high quality and correct environment recognition of a car is a key element for further steps towards highly automated and autonomous driving.

The result of supervised machine learning today is already very precise predictions for an almost endless number of use cases.

Unsupervised machine learning (ML)

This method is mainly used to divide data into clusters according to their similarity. The unknown data are provided to the algorithm. Then the cluster are build and the results are provided to the user. The following diagram illustrates this process:

Clusters are build for objects with 3 edges, for forms with 4 edges and all others. IT Software projects using unsupervised machine learning are:

Marketing: similar customers are assigned to specific target groups.
Medicine: disease patterns are grouped for the application of suitable therapies.
Generic application in image processing: Images are grouped into predefined clusters.

Reinforcement machine learning (ML)

Both supervised and unsupervised machine learning have their limitations in solving unknown complex problems. The learned target parameters are fairly static.

Reinforcement machine learning seems to be more suitable to make complex, independent decisions by machines.

In reinforcement machine learning, the results of an algorithm are fed back via an evaluation matrix that is complex in practice. In a new situation, the algorithm then uses the experience from the previous action to make a decision. Over several application cycles, decisions can thus be improved or well-founded decisions can be made for initially unknown scenarios. The following figure schematically shows the structure of a reinforcing machine learning:

Probably the best-known application example is Google’s AI algorithm which won against the best Go players. However, it is also worth noting that reinforcement machine learning has few real applications outside the gaming sector beyond the implementation of classical control algorithms.

Testing artificial intelligent Systems

If we now take a closer look at the 3 presented practice-relevant technologies for AI systems from a tester’s point of view, then each of the systems contains the following elements:

Design and structure of current industrial systems

Before we dig deeper into the testing of these new AI systems, let’s take a look at current, existing electronic control units in cars and aerospace. Here you find the following elements:

Today, essential properties of such control units are controlled by complex and extensive configuration data. This makes it possible to quickly implement changes to the functional behaviour of the control unit in a simple manner. This is enormously helpful, especially for the tuning of individual components at the vehicle level. The same applies to some systems used in civil aerospace.

In addition to these important advantages of highly configurable systems, there are also significant disadvantages. The functionality of such systems increases exponentially compared to non-configurable systems. A complete test is therefore obviously no longer feasible. In the aerospace industry, there are solutions for the most safety-relevant systems, to proof that the system is fully verified. However, the effort required for this is not economically justifiable in most other systems.

The source code itself can be divided into the following two areas:

Mathematical algorithms
Data processing

Testing and specifying mathematical algorithms is complex and error-prone in today’s classic ECUs. A complete test is practically impossible.

In many ECUs, however, the proportion of software that implements the mathematical algorithms is small(er) compared to software that “only” does data processing (e.g. formatting, copying data, error handling, controlling HW output signals, etc.).

The functionality of data processing software can be described quite well in requirements. The completeness of tests can be demonstrated quite well.

How does these conclusions impact the testing of AI systems?

When comparing AI systems (supervised/unsupervised machine learning systems) and conventional systems, we realise that data and mathematical algorithms are an essential part of both systems. Both topics are already a huge and partly unsolved challenge for testing of conventional systems.

Since data and mathematical algorithms play a key role in AI systems, Artificial intelligence systems increase the challenges of testing that already exist today:

Validation of the correctness of the system being developed.
Accuracy, complexity and completeness of mathematical algorithms
Exponential, functional variance of data-driven systems
Design and production of automated test systems at integration and system level

AI systems increase the possibilities of solving complex problems. On the one hand, this is a positive development. On the other hand, the proof of the correctness of a solution, i.e. its validation, is becoming more and more sophisticated. However, for supervised and unsupervised machine learning, this does not mean that the systems cannot be validated and verified. The following strategies can support the testing of AI systems:

Proposal to ensure data quality

The training data play a crucial role in ensuring that the system correctly implements the proposed task. The aerospace standard DO 200B “Standards for Processing Aeronautical Data” does not have any reference to AI systems, but it does offer transferable approaches as to how the quality of data can also be achieved for AI systems.

Strategies for providing full proof of functional correctness

As we have already noted, full testing is no longer possible due to the complexity and functional scope. However, what offers enormous help in aerospace in a very efficient way are additional, targeted analyses of the combination of algorithms, data and planned tests. Even if this methodology can no longer be expected to be complete in AI systems, major gaps can be identified. The information can then be used to implement a systematic, risk-based testing approach.

The ultimate end of manual testing

In addition to the methodology, the test benches also play an important role. Complex conventional systems are often still tested manually. For AI systems, it is obvious that this no longer works. Without a high level of test automation of system and integration tests, we will not be able to test an AI system in a meaningful way. The good news is that there is still a lot of potential here. Automating tests, especially at the higher levels, has been very expensive so far. There are two main reasons for this:

The demand for such systems is relatively low so far
A real standardisation of test system components has not yet taken place.

The need of AI system testing may help to overcome both issues.

Methodological improvements are possible

The testing of AI systems will hopefully also lead us to finally critically question some of the established methods and processes and design improvements. The specification and thus also the testing of mathematical algorithms has always been an almost unsolved problem. However, requirement engineering is not improved since years. There is certainly still methodological potential which, if used, could make a good contribution to testing AI systems.

Specifics of testing reinforcement machine learning systems

AI systems based on reinforcement machine learning can certainly also benefit from the above-mentioned validation and verification approaches. However, these systems differ fundamentally from the other two methods of machine learning in one respect: there is no longer a meaningful, comprehensible or predictable “right” or “wrong” when reinforcement machine learming is used. The systems are designed to make independent decisions. These decisions are based on virtually infinite experience after only a short runtime of the system. These systems are designed to make unpredictable decisions. The unavoidable consequence of this is that they are practically unverifiable. At least this is true for today’s validation and verification methods commonly used in practice.

We have to design completely new criteria for dealing with such systems. To do this, we should seek an active exchange with the games industry, because that is where we have the most experience with such systems.

Summary

Three types of machine learning dominate industrial AI applications:

Supervised machine learning
Unsupervised machine learning
Reinforcing machine learning

All three types belong to the weak AIs. A strong AI that would be able to emulate strategic and emotional intelligence, among other things, does not exist.

Testing the weak AIs mentioned above is certainly possible. However, the challenges known from testing conventional systems become more important when we want to test AI systems.

From today’s point of view, only the testing of systems that implement reinforcement-based machine learning seems impossible, since it is no longer possible to clearly define a “right” or “wrong” for the system’s behaviour.

Since the acceptance of AI systems as a whole will presumably depend heavily on their effectiveness, it is imperative to validate and verify this effectiveness before putting the systems on the market. In order to be able to master the challenges discussed here, we will have to focus much more on test engineering in the future than we have done in the past.

Testing AI systems – possible or impossible?