Overview: Are the sceptics right? Limits and potentials of deep learning in robotics
June 23, 2016
John McCormac discusses his takeaways from the RSS 2016 Workshop ‘Are the Sceptics Right? Limits and Potentials of Deep Learning in Robotics’ and highlights interesting themes and topics from the discussion.
As a PhD student in the Dyson Robotics Lab at Imperial, I was delighted to be given the chance by my supervisors Professor Andy Davison and Dr Stefan Leutenegger to attend RSS 2016. The conference began with an extremely interesting workshop on the role of Deep Learning in robotics. This post is by no means an exhaustive overview of that workshop, but I hope to highlight some of the themes and topics of the discussion that I personally found most interesting.
Where are the sceptics?
Overall it was an extremely interesting workshop, but I (and others) started asking ‘Where are the sceptics?’ Of the 150-200 audience members and panellists in the workshop only about 5-10 raised their hands when Pieter Abbeel directly asked the audience how many were sceptics at the end of the day. Many of those in the sceptics camp, when questioned, appeared still to endorse deep learning as being an important tool in the roboticist’s toolbox. One area of scepticism appeared to come from the lack of guaranteed behaviour in particular situations (discussed at more length below in the ‘Problems’ section), and closely tied to that notion is also the lack of interpretable uncertainty measures in many areas of DL (some solutions to this issue have already been worked on).
The dividing line between camps was obviously a fuzzy one, and some attempts were made to clarify the meaning of the term ‘sceptic.’ Larry Jackel had an interesting objective measure of a sceptic based on a famous bet (there were actually two bets, and he lost the other one) he won against Valdimir Vapnik, which was whether in ten years anybody in their right minds would be using neural nets. If anybody fell within this ‘sceptic’ camp in the audience, they did not openly voice that opinion.
A more expansive interpretation of the ‘non-sceptic’ camp came from Pieter when pressed by Oliver Brock. Pieter’s stance appeared to be that Deep Learning provided a new toolkit that had the potential to be a stepping stone beyond simply the ‘next task’ to a more general form of robotic intelligence. As a measure of success of Deep Learning, he in particular pointed towards the sheer number of breakthroughs that had occurred in the field of artificial intelligence within the past five years. There were some who worried this phase of optimism was potentially another boom before a bust (think AI winter) but Pieter argued the bust was much less likely to occur now given large companies already use these tools directly to generate revenue (Google’s Ad targeting system was specifically mentioned by Pieter).
The prevailing mood was a pragmatic approach to DL; that the roboticist’s task is to provide priors and model constraints to structure problems which can then be tackled with deep learning. Oliver Brock drew a comparison between the current state of deep learning in robotics and alchemy (he even described himself as a ‘proud alchemist’) – everybody has their black-magic methods of designing architectures. He argued the field as a whole should be aware of this, and should strive towards chemistry, or the ‘periodic table of robotics.’ He argued that researchers should keep in mind and highlight in papers the general applicability of their approach to other problems areas. Ashutosh Saxena’s vision of what this combination of networks and priors may look like in a robotic setting was principled factor graph approach where each factor can be a learned RNN. He believes forming this canonical factor graph structure also allows people to share components of different subtasks, both through time for a single robot and between different robots.
Dieter Fox was very enthusiastically in support of closely coupling deep architectures with prior models. He claims he was converted from sceptic to enthusiast by the robustness of techniques for hand pose estimation. He claims that although purely model-based approaches have the promise of confidence bounds, often in practical problems it is very difficult to achieve this, and the resulting algorithms themselves are very brittle. He highlighted ICP’s requirement of a good initialisation as an example of this brittleness for hand tracking, and said that single frame detection with a CNN appeared much more robust in this domain at least. He did not weigh in heavily on a question by Ben Upcroft on whether tracking was even relevant anymore and could be replaced entirely by detection, but instead focused again on the benefits of tighter coupling between model-based approaches and DL.
John Leonard began the introduction by saying that he was very much a sceptic until five months ago when he saw the work Larry Jackel has been doing with NVIDIA on autonomous driving (raw pixels to control commands) and stated that one ‘should ignore deep learning at their peril’. He did also highlight the importance of the work that has already went into traditional probabilistic inference in robotics, and that it could continue to form an important role in research to come.
Where is the training data?
Many talks touched upon the theme of where people get their training signal for these networks. There was a plethora of synthetic datasets in use for various experiments and validations. For RL tasks Pieter Abbeel highlighted the OpenAI gym for providing a platform for synthetic RL robotic challenges.
Experimentation on synthetic toy datasets before moving into more complex simulations and real-world data was the favoured approach by many. The work on Progressive Neural Networks by Raia Hadsell from Google Deepmind was a perfect example of this progression. The work tackles continual learning by stacking and sharing columns of features learned in previous tasks and makes them available for future tasks (with locked weights to ensure the network does not catastrophically forget), but with additional columns of capacity to learn attributes specific to the current task. This work began by training on Atari games such as pong to explore and experiment with the architecture. It then progressed to a simulated 3D robotic arm control problem, because the ability to generate orders of magnitude more training data as well as large hyperparameter searches, is far easier in the synthetic setting that for a real arm. The features from this synthetically trained network was then forced to bridge the ‘reality gap’ (the term used by Raia) to a real robotic arm task, where it learned faster and performed better on than a comparable system without the pretraining and PNN architecture.
Work presented by Theophane Weber Attend, Infer, Repeat: Fast Scene Understanding with Generative Models, also from DeepMind, aimed towards modelling 2D and 3D scenes in an object-centric latent space with Variational Auto-Encoder style generative models, also started with multi-MNIST toy datasets before moving to 3D synthetic scenes, and it too has very recently been tested on real images with a corresponding 3D model.
The projects from Dieters Fox’s lab followed a similar approach. Work on modelling fluids within a scene by Connor Schenck began by training on images from a blender fluid simulation before progressing to the real world. To move to the real world setting a very cool and clever method was used for collecting ground truth labellings. The fluid was heated and the video was monitored with a thermal camera to gain a robust pixel-wise labelling of the fluid vs. everything else in the scene.
Dieter also showed some very interesting work using dynamic fusion as a source of dynamic correspondences. He used this sort of training data to build siamese networks which labelled close points on a 3D dynamic model of a person in an embedded feature space and then used the networks labelling of this space on a new video to fit the old 3D dynamic fusion model to the new scene. He also stressed that in the robotic domain temporal information can provide an excellent supervisory signal for ‘free,’ his example for this was the SE3 physics prediction work by Arunkumar Byravan which is trained on information in the future. This is the same approach taken in the work presented by Ingmar Posner on End-to-End Tracking and Semantic Segmentation Using RNNs where it was the future that was the training signal for tracking objects through occlusions. This paper went on to win the best paper award.
For large-scale data collection in the domain of autonomous cars, John Leonard was also excited about the announcement by Tesla that it had collected hundreds of millions of miles of driving training data by recording both camera and sensor inputs (as well as I believe the driver’s actions). It was this sort of data that was used by NVIDIA and Larry Jackel to produce their autonomous ‘pixel-to-torque’ deep learned driving system.
Open problems for DL in robotics
One problem that was often touched upon was the issue of incorporating deep learned systems into real-world robotic products. Whether in domestic robotics, autonomous driving, or any other mass produced robotic application. On this there was some heated debate, the majority of people (all of the panel – Deiter Fox, Pieter Abbeel, Raia Hadsell, Ashutosh Saxena, Walter Scheirer, Larry Jackel, and Oliver Brock) felt that as enumerating every single possible scenario in testing was impossible, some set of standard failure tolerances would have to be set (e.g. number of crashes per million miles). Scheirer made the analogy to fault tolerances in a factory where you rely on statistics in testing. Nicholas Roy (who called DL ‘hyperparameterised function approximators’ everytime he spoke which became something of a running joke in the workshop) was very outspoken on this matter and seemed to be in favour of a more model-based approach as taken in airplane engineering and safety. Although many argued that in the domain of automobiles where the environment has the potential to become much more complicated a similar approach would not appear feasible, and given the huge number of avoidable human fatalities it would possibly be of great benefit even without the same level of rigid testing.
Walter Scheirer’s talk was an important datapoint for the question of robustness, it was all about the fragility of CNNs. I often found this in my research, for example, I give a picture with a cat and a dog to be classified and the CNN returns that it’s most likely a hair dryer. The network was obviously trained for single image classification, but the lack of robustness can sometimes be quite disappointing. He borrowed ideas from psychophysics to evaluate neural networks, testing how robust they were to image blurring and occlusion injection. It resulted in many state-of-the-art networks producing terrible results without hugely adverse degradations in the image from a human perspective and he, therefore said much of the hype about ‘super-human’ performance is really about bad metrics which fail to evaluate of the robustness of the algorithms. Pieter highlighted this issue as an important next step in the practical usability of DL and he argued it was likely to come about in the next five years, a robust switch that can highlight when a network is no longer equipped to deal with the current environmental ‘distribution’.
Pieter highlighted among other open problems, an interesting problem in reinforcement learning, namely the hierarchical nature of learning, and sparse rewards. If your agent spends a long time searching around the action space and finally figures out how to walk, only to walk into something dangerous or off a cliff, it would negatively impact all of the hierarchy of actions below it– do you give up on walking entirely?
Some other problems that were highlighted in the discussion panel were dealt with by quite funny one-word retorts. One person questioned the ability of these systems to surpass human level performance if we primarily had human guided training data, to which Oliver Brock replied ‘AlphaGo.’ Another question was whether online training was an important area of research, and whether we care that a network’s weights are trained offline and may never learn more as the move on. Raia responded with ‘Yes.’ She then did elaborate that this is the core challenge of much of the RL research that they are working on at Google Deepmind.
The optimistic future for DL in robotics
Pieter ended with a ‘crazy’ thought that outlines a very exciting picture for the future of deep learning in robotics. He argued that our brain is not significantly more evolved than when the primary tasks demanded of it were to simply find food, make shelter etc. and a lot of the amazing advancements of intelligence possibly just ‘fell out’ of this task set. This thought underlined a sense of optimism that deep learning still has a lot of mileage to be had in robotics and that the solutions found there could have much greater implications for artificial intelligence in general.