This post is by Daniel Shank, a Senior Data Scientist at Talla. He recently gave a talk at The Machine Learning Conference in San Francisco on Neural Turing Machines. A recording and full transcript for his talk can be found below.
Full transcipt and slides for his talk:
Hi everyone, I’m Daniel Shank, a Senior Data Scientist at Talla, and I’m here to talk about an exciting new machine learning architecture called Neural Turing Machines.
First, I’m going to go into a very general overview of what the architecture is about, then I’m going to move from there into discussing why this architecture is important and how it's relevant to the future of machine learning, and then I'm going to talk about why we don't use these things everywhere right now. For example, if it works so well, why aren’t we putting it into production immediately, and then I’m going to talk about a recent paper or follow-up architecture of Neural Turing Machines that was published in Nature quite recently and provides some exciting extensions to this model.
Neural Turing Machines, and why do we care about them? Well, in order to explain why they’re so important, we have to actually explain what a normal Turing machine is.
A Turing machine is a simple model of the computer. Like modern computers, it encapsulates the idea of having an external memory as well as some sort of processor. Essentially, a Turing machine consists of a tape with instructions written on it and the device that can read up and down the tape. Based on what it reads on the tape, it can decide to move in a different direction to write a new symbol or erase a symbol, and so on.
What’s a Neural Turing Machine? In short, a Neural Turing Machine is a kind of neural network, but it draws inspiration from the architecture of a Turing machine to attempt to learn to perform tasks that computers can perform very well but that machine learning models don't actually perform so well. Essentially, it consists of a neural network controller, the analog of the device that reads the tape, or the processor, if you will, as well as an external memory. It’s persistent across all the input that it reads. Like an LSTM or other related model, it's a recurrent neural network, which means that it reads from
variable-like input—many of you may be familiar with that—but essentially, unlike a Turing machine, in addition to having a memory, it can also be fed sort of continuous input and provide output.
The key idea here is that neural Turing machines are basically differentiable Turing Machines, and this is important because many of the algorithms and things that we do every day on a computer are actually very hard for a computer to do because computers work in absolutes. They work in zeroes and ones. They work in “either” and “or” or on integers. Most neural network models, and most machine learning in general, don’t actually do that. They work with real numbers. They work with smoother sort of curves, and that makes them actually a great deal easier to train, because it means that what you see and hope that they provide, that you can easily track back how to improve them by tweaking the parameters. This is very hard when you have a sharp function like an XOR or an AND gate in a CPU. Neural Turing Machines have taken all of the functions of the basic Turing machine and found smooth analogues. So essentially, instead of moving left to right, for instance on a tape, they might decide to move slightly left or slightly right, and this lets you do some pretty incredible things.
On the list of things that they can perform, we have some exciting ones, as well as some not immediately exciting ones …
… But the short of it is that they can learn simple algorithms. Consider taking input and copying it. This seems really normal, but this is actually also something that's very hard for a current neural network to do because doing it sufficiently requires learning an algorithm. Neural Turing Machines can take input and output and learn algorithms that map from one to the other. This actually is quite exciting because it means essentially trying to replace a programmer. We're not there yet, but they’re really cool. This means that once they have learned that algorithm, they can take a given input, and they can extrapolate based on that algorithm to any variable output. You’ll see in a second why that’s cool. They're also very good at language modeling. Those of you not familiar with language modeling know it as autocomplete. It's guessing a word from a context in a sentence or document. And they’re also able to perform quite promisingly on Facebook's bAbI dataset, which is designed to encourage researchers to improve the general cognitive inference capabilities of neural networks.
This is an example of a Neural Turing Machine performing a copy and repeat task. What it has learned to do is to take the relatively short sequence that’s repeated some number of times. As you can see, it starts to make a mistake. Eventually, the targets are up above, and the outputs are below, but overall it looks pretty good.
In contrast, you compare this to an LSTM, which is actually a very powerful neural network model. They tend to fall over pretty quickly. The reason for this is that they are learning something, but they're not exactly learning an algorithm. They're trying to attack the whole problem at once, so they don't realize that the same thing they did the first two times is what you're supposed to do every time after that.
A fun example of something they can do is that they can actually recognize balanced parentheses. This is particularly fun because it's an algorithm that involves using a stack, so essentially you can keep track of left parentheses as they come in and then try to match them up with the right parentheses. This is something that a neural network might be able to do but would accomplish in a more statistical manner, and Neural Turing Machines can actually learn to do this much in the way a human programmer might.
Alright, we're going to get to the bAbI data set and why we care so much. The bAbI dataset is essentially a series of stories followed by questions, and the questions are all designed to require some sort of common sense inference ability to answer. The example that I have here is a simple one where you're trying to reason using position. The key insight here is if you're asking where someone is, they’re where they last went. It seems very obvious, but it can get actually get a bit more complicated—involving people picking up objects or moving around in rows. The stories also involve real reasoning about relative sizes of objects and so on. The idea behind this is that a system that could perform well at all these different kinds of tasks would get close to being a more general knowledge inference system, as everyone has wanted to design, basically, forever.
These are just some results on bAbI. A note here: this is essentially bAbI with training wheels. Given each of those stories, like I showed you earlier, what you'll see is that the training set actually gives a hint as to what in the story is actually important for answering the question. It turns out that this is what makes this problem very difficult, which is that not only do you have to make logical inferences; you have to realize what is relevant. And you can see that they're actually very competitive here. Many sub tasks or areas of sections of bAbI reaching over ninety-five percent accuracy, which is considered the benchmark for passing a given area.
They can do all these cool things: they can learn algorithms, they can figure out what a small story means, but why don't we use these things all the time? Well, they come with a couple of problems because their architecture is so interesting.
First off, once you've specified the general architecture, there are still a lot of decisions to make when implementing it. For instance, how many vectors can you read and write at a given time step for every given unit of input? And these are very important; it's not simply a matter of improving your accuracy. If you fail to get this right, it's very likely that it will never reach a reasonable result. The number of parameters is quite large, which puts a great deal of stress on your RAM, for one thing, and they don't benefit so much from GPU acceleration, which, as we just saw, is a very important part of machine learning. The reason for that is this is so sequential, so they're harder to parallelize because everything that they’re doing is based on the previous input. It's harder to break out these sections into easily distributed computations. And they're hard to train, which is something a bit different from the sum of everything I've just mentioned.
They tend to be very numerically unstable. This is partly because of the tasks that they're actually designed to perform. Because they're learning algorithms, they don't tend to make small mistakes; they tend to make very large mistakes. If you make a mistake in an algorithm, all of your output will be incorrect. What that means is that when you're training them, they have a hard time figuring out the necessary algorithm all the time. Most neural networks, if you feed them enough data and enough time, they come up with some kind of answer. Neural Turing machines frequently get stuck. They’ll often learn to, for instance, just reproduce one value that they’ll repeat, you know, over and over and over again. And this is because using memory is hard. Not only do they have to learn to remember what's necessary for solving something later on, they have to remember not to accidentally forget it, which is an extra layer of added complexity. And so, in order to get around this, you end up using some smart optimization techniques, which are common for recurrent neural networks, but you sort of have to throw the kitchen sink at them in order to get them to work. All this adds up to them being kind of hard to use on an everyday basis.
There are ways to combat numerical instability. One very common trick, especially using LSTMs, is gradient clipping. Gradient clipping essentially says that no matter how responsible we think a given parameter is for a bad result, we limit the degree to which we change it. This helps us prevent STOP [SE4] from wiping out our parameters whenever we produce a bad result. When the machine makes a mistake, we don't completely throw out everything that we've learned.
Loss clipping is essentially an extension of gradient clipping. It's the same basic idea, which is essentially that neural Turing machines can be very far from their objective, and we put a cap on how much we can change the parameters in total, like what is the total loss. We frequently combine these two methods. Essentially, we need to bound the lot of valid changes to the parameters in several different senses.
Alright, another one that's actually kind of an interesting dimension is Graves’ RMSprop. RMSprop is actually an extension of the common back propagation algorithm, which you may be familiar with. It’s the key to training all neural networks these days. RMSprop is a system for smoothing out gradients. So essentially, over a sequence of data points, what you do is you kind of take an average over the effect on any given parameter. Graves’ RMSprop is actually a variation of this, where instead of just taking an average, (the math didn’t come out quite like I like) what it’s essentially doing is taking a running estimate of the variance. It’s normalizing and making sure that extreme values of loss don't actually decay too much, or don't decay the parameters that much. It’s a very smart thing to do, and it's also particularly interesting since people, in order to find this out, essentially had to dig through Alex Graves’ papers. And if you're just implementing normal RMSprop, it usually doesn’t work very well. Thankfully, there are some alternatives.
The Adam optimizer is common and bundled with most machine learning programs these days, though honestly, you frequently have to try multiple algorithms in order to figure out which one will help you out of your rut. Like Graves’ RMSprop, it basically smooths the gradients. It’s a little bit more complicated, so I won’t address exactly how that works here, but in general, it's a good fallback.
Another thing is attention to initializing the values, particularly the memory. Some people have actually taken advantage of the bias the memory provides by trying to teach the neural Turing machine to learn more quickly. But in general, this is actually kind of a problematic feature of their architecture. Because what starts out in memory biases the rest of their calculations, it can completely sink the model if you start with bad values. Like many of these other techniques, they're helpful for performance in terms of general neural network abilities. But in this case, if you don't do it right, it very likely won’t converge, you won't get a reasonable result, so you frequently have to try different values for starting points in order to optimize these properly.
Another trick, and it’s a fun one, is curriculum learning, which is just feeding in easier data first. You start with sequences, you're learning to copy something of up to length, let’s say, five. When they do well at that, you feed them lengths 10, and then at some point you reach, let’s say, 20, and then at that point you're fairly confident that you'll be able to generalize and do, as we saw earlier, copy sequences of up to 100 or even larger.
Now I’m going to move on to talking about a recent extension. Very exciting; it came out in Nature a couple weeks ago, if I recall correctly. It extends neural Turing machines in some interesting ways and helps get around some of these difficulties.
Differentiable neural computers, or recent models , are essentially neural Turing machines with a couple of modifications. They sort of dropped the index-shift-based addressing. I was talking about moving up and down through the memory or traversing a tape. They no longer do that; they try to search directly for a given vector in the memory based on something that they're currently seeing. They also have the ability to allocate and deallocate memory, which, if you've ever worked with a lower-level programming language, should make sense. In the same way that it's difficult to keep track of where things are in regions of memory so you don’t make mistakes when programming, it makes it a lot easier to be able to flag certain regions of the memory as off-limits so that you don't accidentally delete them later, which helps with the optimization. They also have the ability to sort of natively remember sequences of input. Neural Turing machines, when they're learning to copy, learn processes by which they write all their input in order in memory, and then go back to the beginning and read it out. In this case, they have a sort of temporal memory where they can think back to the last thing they did, and the thing before that, and the thing before that, which means they can essentially walk back through a linked list of everything that they need to do.
This is a summary of the differentiable neural computer architecture. What you’ll see on the left is just diagrams standing in for the variable-length input and output, and then, the read and write heads are what allow it to pull from the memory. Here, you'll see it's simply pulling in to read vectors at a time and writing one in[SE6] . And the memory is sort of, as I was describing, with the addition of the temporal links along the right. You’ll actually see those arrows if you trace them backwards, is the way that the model can learn to walk back through and repeat things that its recently either read or written.
(Audience question) Not so much. That's actually kind of the reason this is hard. So essentially, if you're looking at the parameters for a given algorithm, what you'll find is that the nearby parameters, if you're thinking of it just as a big space, are not necessarily going to give you similar algorithms—they’ll just be junk. They kind of have to learn algorithms in pieces. For instance, they’ll learn that it is helpful to remember everything they see as a first step for learning to copy something, and then, they learn that outputting it in various ways is helpful, and that gets closer. Then, they learn about outputting it exactly is what they want, so, it's kind of hard to talk about confidences, so, good question.
(Audience question) Alright, so he said, “Are there confidences on the algorithms that are learned?” which if I understand correctly, is essentially saying: can you say this is the correct algorithm with some probability? Not so much. I said no, basically, but yeah…
Here, we have a diagram that shows a differentiable neural computer in operation also performing a copy task. What you'll see is that the green bars here are when it's writing into its memory, and the pink are when it's reading out. Usually, you'll see if you're reading along the left (this is time as you're feeding it in some sequence) is that what's happening is that it's freeing up an area of memory because it's going to need to write to it, and it’s allocating it, so the bars in the bottom essentially show you that activity. And then, when it's done, so it actually starts to write out, it says “we're done; we can reallocate this.” You get this sort of alternating pattern of “I need this memory. Ok, done with this memory. I need this memory…” and it actually learns to do that, which makes things a great deal easier and is part of what makes these so exciting.
As an example of something that I never knew computers have been able to do that kind of sets them apart a bit, is sort of logical inference tasks, but with a combination of sort of fuzzy reasoning. What you see here is a family tree, and they feed the model segments of the tree, essentially saying, “Mary is the daughter of Jodie,” or “Simon is Steve’s son.” Then, they ask the question “How is Freya related to Fergus?” or “How is Fergus related to Freya?” The answer is maternal great-uncle, so they're able to fill in the blanks here. There are some interesting examples you can see if you look at the paper, such as learning to find shortest paths and graphs traversing the London subway system and things like that, which are interesting. I don't have a slide for that here, but if you're interested in that stuff, I highly recommend you go take a look at it. http://www.nature.com/nature/journal/v538/n7626/abs/nature20101.html
They also do very well at bAbI. bAbI has come a great deal farther in the past year or so. What you're actually going to see is that most of the state-of-the-art models are passing all but a handful, but these are doing very well. What you'll notice here is that they're passing all but two areas, and they’re competitive with some of the models coming out of Facebook, which are designed to do very well—this particular dataset is their dataset. This is very exciting; as you can tell, there are only a few areas left whittled down, and this is a great leap forward in terms of getting neural networks to learn tasks that we’ve been unable to touch before.
Thank you, everyone. There are some references here; I can only really speak to the Theano, Lasagne, and Go implementations. I would recommend the last paper on the list, which is a way of getting these things to work using a great deal less memory, which is getting closer to actually using them in production.