AI/ML professionals: Get 500 FREE compute hours with Get it now.

All blog posts

Most Common Neural Net PyTorch Mistakes

Disco compute

Mid 2018 Andrej Karpathy, director of AI at Tesla, tweeted out quite a bit of PyTorch sage wisdom for 279 characters.

most common neural net mistakes: 1) you didn’t try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? 🙂

This post will go point by point to see how these mistakes can manifest in a PyTorch code sample. You can follow along by checking out the code in this github repository:

Tools used in this write-up

  • PyTorch – the open source deep learning framework by Facebook. It’s spreading like wildfire in academia, and it’s so brilliant that Tensorflow changed everything in version 2 so it looks and feels more like PyTorch code.
  • MissingLink – a deep learning platform. We’ll only be using the experiment management aspect. When the training script runs, the web dashboard is used to visualize our metrics and progress.

Common mistake #1 you didn’t try to overfit a single batch first.

Andrej says we should overfit a single batch. Why? Well, when you overfit a single batch – you’re making sure the architecture works. I’ve wasted HOURS training on a giant dataset, just to find out it’s only 50% accurate because of a minor bug. The results you’ll get are a good guess for the optimal performance of your architecture when it perfectly memorizes the input.

Maybe that optimal performance is zero, because an exception gets thrown mid-way through. But that’s fine because we find it out quickly and fix it. To summarize, here’s why you should start out by over-fitting on a small subset of your dataset:

  • Uncover silly bugs
  • Estimate the best possible loss/accuracy of current architecture
  • Fast iteration to improve the aforementioned.

In PyTorch datasets, you’re usually iterating over a data loader. Your first attempt might be to index the train_loader.

But you’ll immediately see an error because DataLoaders want to support network streaming and other scenarios in which indexing might not make sense. So they don’t have the __getitem__ method which makes the [0] operation fail. Your next try might be to convert the loader to a list that does support indexing.

But that means you’re going to evaluate the entire dataset which is going to burn your time and memory. So what else can we try?

In a Python for-loop, when you type this:

You effectively get this:

Calling the “iter” function to create an iterator, and then calling “next” on it multiple times in a loop to get the next item. Until we’re done when a StopIteration is raised. So in that loop we’ll just call next, next, next… To emulate that behavior but just get the first item we could use this:

We call “iter” to get an iterator, but we only call the “next” function ONCE. Notice I assign the next-iter result into a variable called “first” for clarity. I call this the “next-iter” trick. In the following snippet you can see the complete example with a train data loader:

This is how to modify the loop to utilize the first-iter trick:

You can see above and in line 204 how I multiply the first_batch 50 times to make sure I over-fit to a specific degree. In the repo you’ll notice I’ve left a few different levels of over-fitting (e.g. using the same data for train and test, or testing on the real test data) commented out so you can try out the various degrees and inspect the results.

Common Mistake #2: you forgot to toggle train/eval mode for the net.

Why does PyTorch care when we’re training the model versus when we’re evaluating it? The biggest reason is drop-out. This technique randomly removes neurons during training. The idea is that this removes crutches the network is relying on.

Imagine if the red neurons on the right were the only ones contributing to correct results. Once we remove the red neurons, it forces the other neurons to train and learn how to be accurate without the reds. This drop-out improves performance on the eventual test – but it negatively affects performance during training because the network is limping. Keeping this point in mind when I run the script and look at the accuracy on the MissingLink dashobard.

In this specific example there seems to be a drop in accuracy about every 50 iterations.

If we inspect the code – we see we do set the training mode immediately in the train function at line 125.

The problem is a bit tricky to notice. If you can see on line 148 we call the test function.

Inside the test function we set the mode to eval! That means that if we hit the test function during training – we will be stuck in eval mode until the next time the train function is called during the next epoch. This causes the drop-out to only occur once every epoch which causes the performance dip we saw in the chart.

The fix is easy – we move the model.train() one line down and into the training loop. The ideal is to have the mode set as close as possible to the inference step, to avoid forgetting to set it. With the fix, our chart looks a bit more reasonable, without the spikes. Notice how training accuracy is lower than validation accuracy because drop-out is taking place.

Common mistake #3: you forgot to .zero_grad() (in pytorch) before .backward()

When calling “backward” on the “loss” tensor, you’re telling PyTorch to go back up the graph from the loss, and calculate how each weight affects the loss. That’s the gradient for each node of the computational graph. Using this gradient we can optimally update the weights. Answering how much to increase or decrease every weight in the graph.

This is what it looks like in PyTorch code. The “step” method there at the end will update the weights based on results from the “backward” step. What might not be obvious from this code is that if we keep doing this over and over many batches, the gradient will explode, the step we take will keep growing and growing.

To avoid the step growing uncontrollably, we use the zero_grad method.

This might feel a bit overly explicit, but it does grant precise control over the gradients. One way to make sure you didn’t fudge this is to always have these three functions go together:

  • zero_grad
  • backward
  • step

In our code sample, this is what it looks like when we don’t zero_grad at all (comment out line 135). The neural network starts getting better because it’s improving, but the gradients eventually explode and all the updates become more and more garbage until the network is useless in the end.

This is what it looks like when you zero_grad right after you call backward. Nothing is happening because we just erased the gradients, so the weights aren’t getting updated. The only variations left are the dropout.

I think it might’ve made sense to reset the gradients automatically every time the step method is called. But for now I’m happy with this iconic trio.

Disco compute

A reason to keep zero_grad out of backward is to call backward multiple times for each time you call step() for example if you can only fit one sample into memory per batch, so one gradient would be too noisy and you’d want to aggregate a few gradients per step. Another reason could be to call backward on different parts of the graph – but in that case you might as well add the losses up and call backward on the sum.

Common mistake #4: you passed softmaxed outputs to a loss that expects raw logits.

Logits are the activations of the last fully connected layer. And softmax is those same activations but after a normalization. I charted below real number examples from the code we’re using. The top list is logits values, you can see some are positive and some are negative. The second list is the log-soft-maxed values. They’re all negative. Looking at the bar chart – they’re practically the same, the only difference is scale. But because of this subtle difference – all the math breaks apart.

But why is it such a common mistake? In the PyTorch official MNIST example, look at the forward method. Right before the end you see we have the last fully connected layer, self.fc2, and then a log_softmax. This is the PyTorch MNIST example that every beginner will go through and see the log_softmax.

But when you look at the official PyTorch Resnet or AlexNet models, you’ll see they don’t have a softmax at the end. They send out the raw fully connected layer – they send out the logits.

This difference is not clarified in the docs. If for example you look at the nll_loss function, there’s no mention whether its input should be logits or a softmax. Your only hope is the code sample in which you see – nll_loss takes in a log_softmax.


If you prefer a more visual and audible version of this post, check out the webinar that covers exactly that. The webinar was called out by Andrej Karpathy himself in a blog post where he covers a detailed methodology for training neural networks.

PyTorch and deep learning are excitingly powerful and with that power come great gotchas. Hopefully these notes help you in your own projects. I’m proud to be part of such an open community where knowledge and expertise are shared so freely by even the most elite practitioners. If you’d like to get more where that came from – connect with me on Twitter.

Further reading

Disco compute

Train Deep Learning Models 20X Faster

Let us show you how you can:

  • Run experiments across hundreds of machines
  • Easily collaborate with your team on experiments
  • Reproduce experiments with one click
  • Save time and immediately understand what works and what doesn’t

MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.

Request your personal demo to start training models faster

    Thank you!
    We will be in touch with more information in one business day.
    In the meantime, why not check out how Nanit is using MissingLink to streamline deep learning training and accelerate time to Market.