AI-enabled software systems (AIS) are prevalent in a wide range of applications, such as visual tasks of autonomous systems, extensively deployed in automotive, aerial, and naval domains. Hence, it is crucial for humans to evaluate the model’s intelligence before AIS is deployed to safety-critical environments, such as public roads.
In this paper, we assess AIS visual intelligence through measuring the completeness of its perception of primary concepts in a domain and the concept variants. For instance, is the visual perception of an autonomous detector mature enough to recognize the instances of pedestrian (an automotive domain’s concept) in Halloween customes? An AIS will be more reliable once the model’s ability to perceive a concept is displayed in a human-understandable language. For instance, is the pedestrian in wheelchair mistakenly recognized as a pedestrian on bike, since the domain concepts bike and wheelchair, both associate with a mutual feature wheel?
We answer the above-type questions by implementing a generic process within a framework, called B-AIS, which systematically evaluates AIS perception against the semantic specifications of a domain, while treating the model as a black-box. Semantics is the meaning and understanding of words in a language, and therefore, is more comprehensible for humans’ brains than the AIS pixel-level visual information. B-AIS processes the heterogeneous artifacts to be comparable, and leverages the comparison’s results to reveal AIS weaknesses in a human-understandable language. The evaluations of B-AIS for the two vision tasks of pedestrian and aircraft detection showed a F2 measure of 95% and 85% as well as 45% and 72% respectively in the dataset and model for the detection of pedestrian and aircraft variants.