In the face of the difficult Turing test, the chat robot began to expose itself

User: Siri, help me to call an ambulance.

Siri: Okay, from now on I will call you an "ambulance."

When Siri first appeared in 2011, Apple quickly fixed this mistake. But a new contest shows that computers still lack basic common sense to avoid this embarrassing confusion.

The contest, called the "Winograd Schema Challenge", is a variant of the Turing Test, initiated by Hector Levesque, a computer scientist at the University of Toronto, Canada. The name is intended to pay tribute to Professor Terry Winograd, a professor at Stanford University and a pioneer in the field of artificial intelligence.

Left: Hector Levesk, right: Terry Winoglard

For more than 60 years, researchers have been using Turing tests to evaluate machine's ability to think like humans, but the criteria for evaluating artificial intelligence are too old. Many old Turing tests have too simple questions and cannot really test computers. The level of intelligence is in urgent need of upgrading. The â€œWinograd Model Challengeâ€, which began in 2014, is an improvement on the Turing test and requires artificial intelligence to answer some common-sense questions about sentence comprehension.

For example, in the challenge, there is such a test title: " Members of the city refused the demonstrators' permission to march because they were afraid of violence. " When ordinary humans looked at them, they would judge the clear logic of the sentence according to the context. However, for computers, It's hard to figure out who the "they" are referring to. Is it the city councilor? Or a demonstrator?

The facets of a typical "Winograd Model Challenge" include the following key parts:

First of all, there are two nouns with the same semantics (this section refers to: councillors and demonstrators)

Second, there is a vague pronoun that refers to the above two nouns (this question refers to them:)

Third, there is a special word. When the word is replaced with another word, the meaning of the vague pronoun changes. (In this question, if â€œfearâ€ is replaced with â€œclaimâ€, then â€œthemâ€ in the sentence The meaning of the change will change)

Then, the question that the computer needs to answer is: what is the pronoun that has a vague meaning, and gives two options for the computer to choose . So the computer is facing a problem of one of two alternatives.

According to the probability of statistics, the accuracy of the correct answer is 45% even if it is chosen randomly. However, the real result of this computer game is that the best score is 48%. Therefore, the accuracy of the computer's careful "thinking" is only a little higher than the accuracy with which people blindly choose. This has to be embarrassing.

The best two teams, one led by Liu Quan from the University of Science and Technology of China, and the other led by Nicos Issak of the Open University of Cyprus.

The prize for this challenge is up to 25,000 USD, but to get this bonus, the accuracy rate must reach over 90%. So even the two teams with the best results miss out on the prize money.

Gary Marcus, a psychologist at New York University who is one of the consultants for the competition, said: "The machine's performance is only slightly better than random selection. This is not my expectation." That's because it is extremely difficult to give computers common sense. . The time it takes to manually input such knowledge is inconceivable. It is also very difficult for computers to learn real-world knowledge by using statistical methods. Many of the computers in this challenge are trying to combine hand-coded grammar understanding with basic real-world knowledge.

In addition, people clearly discovered that Google and Facebook did not participate in this activity , and the researchers of these companies have repeatedly hinted that they have made great progress in natural language understanding. â€œThe two companies could have jumped into the waltz and entered the field at random, and they have achieved 100% accurate results, and proudly show off to the world. But if that is the result, I would also be very shocked.â€ Marcus said.

Researchers at big companies like Google, Facebook, Amazon, and Microsoft are turning their attention to natural language understanding. They use the latest machine learning methods, especially "deep learning" neural networks to develop smarter, more sensitive chat bots and personal assistants. In fact, as chat bots and voice assistants become more and more common, along with the tremendous progress in the field of image and speech recognition, it is very easy for people to have the illusion that machines have been very powerful in understanding languages. However, the real situation is not optimistic, at least the results of this competition are not satisfactory.

The two best teams in this competition use the most advanced machine learning methods . Liu Quanâ€™s team, which includes researchers from New York University in Toronto and the National Institute of Canada, uses deep learning to train computers to recognize the relationship between two events. For example, learning â€œplay basketballâ€ from thousands of articles. The relationship between "swimming" and "injury." After the match, Liu Quanâ€™s team claimed that the accuracy rate could reach 60% after patching the system to resolve a loophole in the competition issue. Leora Morgenstern from the organizer said that even if this result is confirmed, it is still much lower than human accuracy. .

The message revealed by the results of this contest is very important. "When AI started to support conversations, these problems would be exposed. For example, if I say, 'I want to buy a box for my guitar, so it must be very strong,' it's here.' "Does it refer to a box or a guitar?" said Charlie Ortiz, a senior researcher at Nuance, a company that specializes in the development and sales of speech recognition software and image processing software.

As smart home devices and wearables become more and more common, common sense reasoning will become more and more important. Marcus said: "When you ask about your watch, you don't want it to offer 50 options at this time, letting you slide the screen to make a selection. And when you start a conversation with your car and watch, you want to avoid typing. It is cumbersome to obtain a series of communication dialogues with intrinsic connections. People will naturally and repeatedly refer to the previously mentioned contents and will often appear vaguely, and this is a problem that computers urgently need to solve.

Let the computer understand us, there is still a long way to go.

Via MIT Technology Review