Two Dogs Passing a Frisbee: Why Machine Learning struggles to Explain this Image


Natural Language Processing (NLP) is a branch of AI research that focuses on how machines understand human language. It is vital to many machine learning projects, as it allows us to understand and quantify how well humans can communicate with machines. Although NLP has taken great strides recently, the auto-generated caption on the photo above shows us that we are further away from common-sense AI than some might think. For example, an AI system generated this caption: “Two dogs are throwing frisbees at each other”. Measuring the common-sense reasoning abilities of our algorithms is very important, because it shows us how great the disconnect is between how machine learning performs during training and how it performs in the real world. A recent paper from researchers at USC suggests that caption generating AIs still lack the common sense to generate logical sentences for real-world situations.

A benchmark used to measure NLP by deep learning models showed that an algorithm can produce results at the 90% accuracy level — in other words, the model gives similar results to a person 90% of the time. This kind of performance might convince some people that we are accelerating into a world of humanlike AI faster than we thought possible. According to the authors of this paper, this might not be the case, however.

Rather than testing machine learning models on discriminative common-sense, which has been a focus of previous common-sense benchmarks, the team used a generative assessment. Discriminative problems often use multiple choice questions and a large dataset to test the reasoning ability of a machine. A simple question like “Where do adults use staplers?” can usually be answered correctly given a set of possible answers (for example: a. “In an office” or b. “In the forest”). A generative benchmark, on the other hand, asks these models to generate a natural language interpretation of a problem. This is where the dog and the frisbee comes in.

The team had some of the world’s top NLP algorithms generate a description of an image of a dog catching a frisbee. Most children would be able to explain this photo easily — something along the lines of “a person threw a frisbee and the dog caught it”. Even with the massive amount of data that these algorithms were trained on, they were not able to generate a logical narrative or explanation of the photograph. Machines sometimes produced output such as “a dog throws a frisbee at a football player” or “two dogs are throwing frisbees at each other).

This study confirms what many AI experts have been saying about machine common sense. We are further away from a general intelligence than it looks. Although it may appear that our current deep text prediction algorithms are as good as humans at producing natural-language text, they are actually just mimicking the inputs they were trained on, rather than doing humanlike reasoning. The implications of this study go beyond NLP. Especially relevant is the field of robotics, where agents must interact physically with the real world. As Yuchen Lin, a contributor to this paper put it: “Robots need to understand natural scenarios in our daily life before they make reasonable actions to interact with people”. To become truly human-level at generating captions, AI researchers will need to look beyond existing techniques and try new paradigms, such as self-supervised learning and causal model learning.

If you would like to learn more about other paradigms and new approaches to common-sense AI, please visit our website: There you can read about recent developments in resilient intelligence.