Give a Robot a Fish

A sign on a door located on the ninth floor of UCLA’s Boelter Hall reads, “Beware of Robot.” Inside, stationed at the center of the room, is Tony. He stands more than 5 feet tall with a black torso, dark red rolling base, two large arms, an Internet router on his back and an Xbox One Kinect mounted on his head.

Assembled part by part over the past year and costing more than $60,000, Tony has been programmed to open doors, fold clothes and assemble furniture. Surrounding him is a team of researchers who aim to eventually give him human-level cognition.

It’s an ambitious goal, but the UCLA Center for Vision, Cognition, Learning and Autonomy, or VCLA, where Tony lives, specializes in the intersection of cognition, artificial intelligence and vision. Over the years, the lab has received millions of dollars in research grants to develop intelligent computer systems that learn the way humans do.

Not everyone shares the lab’s enthusiasm for creating advanced artificial intelligence. SpaceX and Tesla Motors founder Elon Musk views AI as the biggest existential threat to humanity’s existence. Physicist Stephen Hawking said it could spell the end of the human race. Microsoft founder Bill Gates said he agrees with Musk and doesn’t understand why some people aren’t concerned. But despite this cautioning, Microsoft has business incentives to do heavy research in AI, and Musk funds AI research as a means of self-defense.

The field of AI makes strides each year, and with each high-profile milestone, it brings about a new wave of fear and anxiety – Deep Blue beating world chess champion Garry Kasparov in 1996, Watson beating Jeopardy! record-holder Ken Jennings in 2011 and, most recently, AlphaGo beating professional Go player and world champion Lee Sedol in 2016.

Other scientists are less concerned, such as Stanford University computer science associate professor and leading researcher Andrew Ng, known for his work at Google and the online learning platform Coursera. He has been famously quoted saying that worrying about killer AI is like worrying about overpopulation on Mars. In other words, it might be a problem someday, but it’s too far off to think about realistically.

In terms of developing intelligent robots, fellow Stanford associate professor Fei-Fei Li said, “We are closer to a washing machine than a Terminator.” Oren Etzioni, CEO of the Allen Institute of Artificial Intelligence, dispelled the notion that mastering even a complex game like Go was a harbinger of hyperintelligent AI, noting that AlphaGo can’t play chess. While high-profile AIs can be trained to do one task well, they aren’t yet capable of taking on different tasks.

The shallowness of deep learning

One of the current major trends in AI involves using massive amounts of data and a technique called “deep learning.” This term has become a buzzword in the industry and fuels some of the recent advances in AI, including AlphaGo, and powers a number of popular technologies, from Google’s search algorithms to Facebook’s news feed.

“Current machine learning is based on a certain model that is very much like a black box,” Zhu said. “Most people don’t understand why it works. If you read the papers, they can’t explain it.”

Deep learning and machine learning algorithms rely on having lots of data. Feed an algorithm millions of examples, and it can train a computer system to do certain tasks, such as classifying images or playing games like chess. Deep learning was inspired by neurons in the human brain and is alternatively known as “neural networks.” But Song-Chun Zhu, director of VCLA and professor of statistics and computer science, argued that there are several limitations to this approach.

“Current machine learning is based on a certain model that is very much like a black box,” Zhu said. “Most people don’t understand why it works. If you read the papers, they can’t explain it.”

Deep learning models only work with massive amounts of data and don’t work with a small number of examples, Zhu said. Additionally, a deep learning system trained in one task cannot generalize well to new tasks. While it can produce accurate results, the system can’t reason how it came up with the result.

Due to this inability to adapt, Zhu argued that learning from big data is not natural intelligence like humans have.

“Humans use small data. We only use a few examples and then we got it,” Zhu said. “It’s a mystery how we learn from (a small amount of) data or sometimes even zero data.”

Photo Credit: Austin Yu

Pursuing natural intelligence

Zhu’s lab seeks an alternative approach to building intelligence. The group uses neural networks as well, but the primary technique that the lab employs borrows from another field of AI – natural language processing or NLP – which aims to train computers to understand human text and language.

This approach is analogous to how grammar is often taught. In grade school, students are taught to identify parts of a sentence (e.g. noun, verb, verb phrase, etc.) and create sentence diagrams, like in Figure 1.

NLP is concerned with doing this automatically with computer programs, a process called “parsing.” The generated sentence diagram is called a “parse tree.” If the computer understands the rules of how sentences are composed (i.e., grammar), then it can attempt to parse the sentence.

VCLA’s approach to computer vision is to define a visual grammar and to parse images and videos with it. Figure 2 is an example image and a possible parse tree.

Courtesy of VCLA

Parsing scenes and images in this way is central to the lab’s approach, reflected in VCLA’s logo: a diagram of a parse tree. Deep learning, with help from a large data set, could similarly label what is in the picture and what is going on, but the lab’s image-parsing technique encodes more information about the relationships between the entities, such as spectators and players in a soccer game. In theory, the image parsing appears better for understanding the scene, but what about in practice?

Last one standing

In 2011, VCLA’s image-parsing approach was put in direct competition with deep learning. Defense Advanced Research Projects Agency, or DARPA, an agency within the U.S. Department of Defense responsible for funding scientific research, issued a challenge under its Mathematics of Sensing, Exploitation and Execution, or MSEE, program.

The task was to analyze several hours of video shot from different cameras and create a system that could answer human questions, such as “What are people doing?” and “How many people are standing at this time?”

DARPA accepted proposals from nine teams consisting of researchers from various universities, such as Carnegie Mellon University, UC Berkeley and Massachusetts Institute of Technology. Teams received grants of up to $6.23 million and were given four years to deliver said system.

By the end, only VCLA remained in the competition and was able to successfully deliver on its proposal, while the others either were disqualified by DARPA for not meeting deadlines or dropped out voluntarily because the task proved too difficult.

The challenge in some ways illustrates a paradox in AI and computer science. What’s hard for computers to do is easy for humans, and what’s easy for computers is hard for humans. Identifying what people are doing in a video is an easy task even for a child; getting a computer to do it is an incredibly hard problem. A person who can multiply large numbers rapidly in their head might be considered a math genius, while a computer doing the same thing is just serving a basic function. And tasks intuitive to humans, such as common sense reasoning about physical properties of the world (e.g. “If I tilt this table, the cup of water on it may spill”) and being able to recognize intent from action (e.g. “She walked to the fridge because she was hungry”) are still problems yet to be solved by computers.

“What you heard from the news, (that) Google did this, Microsoft did that, Intel did that ... (those are) short-term things. They are using techniques that have been invented in universities 10 years ago, 20 years ago,” Zhu said. “But what we are doing is what will happen in 10 years, 20 years – the natural intelligence.”

VCLA’s mission is to resolve these difficult problems through a unified, mathematically sound theory for all aspects of human intelligence, including reasoning and learning, that it can use to build intelligent systems.

“He is really the main person right now in the world who is expanding the ideas of computer vision to try to encompass the serious issues of interactions with artificial intelligence,” said David Mumford, a Fields Medal recipient and Zhu’s doctoral adviser from his time at Harvard University.

Gang Hua, a computer vision scientist at Microsoft Research, said he has been following Zhu’s work for years, and has been particularly impressed with VCLA’s deviation from the conventional approach to artificial intelligence.

“The most popular thing in the community may not be the most advanced thing,” Hua said. “I think his group has always been a little bit ahead of the game.”

Zhu’s colleagues, such as Mumford and Harvard collaborator Vahid Tarokh, said they agree with his critiques that deep learning does not generalize well, is dependent on having lots of data and is not well-understood. This view is somewhat controversial – given the success and popularity that deep learning has enjoyed in academia and industry – and puts Zhu outside the mainstream.

However, deep learning has still proven successful in a variety of applications, such as speech recognition, which have made their way into commercial products. Carey Nachenberg, an adjunct professor of computer science at UCLA who has given talks on campus about deep learning, said that he believes the neural network approach is still the most promising in achieving human-level cognition.

“I’d say deep learning is fairly well-understood,” Nachenberg said. “People understand how it works and it’s not a perfect model, but it’s a model of how real neurons work.”

Nachenberg also said that the amount of data required for the deep learning approach does in fact model human learning, because humans also take in huge amounts of visual input from birth and learn from large data sets.

Because of the momentum and widespread acceptance of deep learning, victories like the one VCLA had in the MSEE project provide significant institutional validation. The intelligent systems designed in Zhu’s lab are – in the perspective of the students involved – a step ahead.

“Most people are still working on detecting objects while we’re moving forward on determining intention,” said Yixin Zhu, a doctoral student and researcher in the lab.

Following the success of the MSEE project, DARPA approved another four-year, $5.23 million grant for VCLA focusing on human-robot collaboration work that will run through 2019. The Office of Naval Research has also awarded Zhu two Multidisciplinary University Research Initiative grants, totaling $20 million from 2010 to 2020.

As a result, he’s the principal investigator of nearly $35 million in joint research grants this decade, an almost unheard of sum for a statistics professor at a public university. His research group now boasts more than 40 researchers, including undergraduates, graduate students and professors.

Despite the success, Nishant Shukla, a doctoral student and researcher in the lab, said he feels like their group is the underdog.

“Currently, industry is embracing neural networks for arbitrary tasks,” Shukla said. “It’s not that we’re totally against it, it’s just there are a lot of parameters learned in these neural networks – all these numbers being learned. And as a human you don’t care about those numbers.”

The ideas and methods from the lab’s research have not been widely adopted and have yet to gain the kind of widespread recognition that deep learning enjoys. Results using neural networks and deep learning are published by others to much fanfare, but Zhu said he sees those as ultimately short-term plays.

“What you heard from the news, (that) Google did this, Microsoft did that, Intel did that ... (those are) short-term things. They are using techniques that have been invented in universities 10 years ago, 20 years ago,” Zhu said. “But what we are doing is what will happen in 10 years, 20 years – the natural intelligence.”

Grants like DARPA’s provide him with more tangible tools to execute this long-term vision.

For its research, the lab employs virtual reality devices, tactile gloves and the same physics simulation software used by Disney for its animated films. The most eye-catching piece of technology, however, is Tony.

Tony the autonomous robot

Graphic Credit: Austin Yu

Getting Tony to do things is not easy. In the same way that common sense reasoning is difficult for computers, even state-of-the-art robots have difficulty accomplishing simple tasks. One of Tony’s most impressive accomplishments to date is opening the door to a mini-fridge, taking out a can of soda and handing it to a human.

Photo Credit: Austin Yu

Programming a robot to understand the mechanics that humans take for granted, such as knowing the appropriate amount of energy to use and the proper grip orientation necessary for opening a door, is a complex research project.

As limited as Tony might seem, he represents VCLA’s commitment to tackling the problem of integrating AI, vision and robotics by building a system that can view the real world, understand what it sees and act on that understanding. Tony takes the lab’s intensely abstract and theoretical goal – finding a unified framework and representation of human intelligence – and brings it to life. Moreover, Tony grounds the work in reality, forcing any theoretical model of intelligence that the lab decides on to be programmable down to robot actions.

One project that demonstrates the end-to-end nature and philosophy of the lab is Shukla’s research project, which aims to teach Tony how to learn from human demonstration. The task he started with is folding clothes.

In the lab, students were recorded folding shirts at different angles and in different ways. The video input was parsed and translated into graphical structures that encode knowledge about time, space and causation. After several demonstrations, the robot learned the concept of what it means to fold.

Not only did the robot fold clothing it had never seen before, but it was also able to choose to fold clothes in ways it hadn't yet seen based on the principles of folding it had learned.

Once it was trained exclusively on videos of humans folding shirts, Tony was presented with a pair of pants to fold to demonstrate that it can generalize what it has learned. Not only did the robot fold clothing it had never seen before, but it was also able to choose to fold clothes in ways it hadn’t yet seen based on the principles of folding it had learned.

This approach is also used in several of the lab’s other projects. The input is some form of human demonstration, which can be videos of humans folding clothes, shaking hands, choosing a seat or even exploring a virtual environment. The video input, using computer vision tools and the model developed by the lab, is parsed into a representation meant to be understood by humans graphically and by robots through code. Once the system successfully demonstrates understanding of the simple task, the task is generalized.

Now that Tony is capable of learning how to fold clothes, Shukla intends to make it capable of manipulating objects more generally. The goal is for Tony to be intelligent enough to watch instructional cooking videos and replicate them.

The lab’s goals for Tony extend beyond being able to learn and complete tasks. Tianfu Wu, a research assistant professor of statistics at UCLA and a supervisor in the lab, said one of the other major projects in the lab is designing vision systems capable of observing “dark matter.”

In VCLA’s work, dark matter – a borrowed term from physics – refers to matter that cannot be physically observed but is inferred to exist. Humans can look at videos or photos of other people and infer motivations, moods, social norms and plans – things that are not visible but exist. A simple example Zhu has used is a scene with a ketchup bottle placed upside down. The “dark matter” here is the implied goal of making it easier to squeeze ketchup out of a bottle.

Designing an intelligence capable of that kind of reasoning is ambitious.

“This has been something that, well, not many people have had the guts to try,” Mumford said.

Wu said neither Zhu nor his lab are interested in small problems. Steven Holtzen, a graduate student who has been with the lab for more than two years, said it’s what makes the lab both challenging and exciting to work for.

“One of the things that’s interesting about this lab is that it’s not afraid to ask really intense questions,” Holtzen said. “It means there’s a pretty high bar for what counts as meaningful here.”