Deep Learning Frameworks

TensorFlow Speech Recognition: Two Quick Tutorials

TensorFlow allows you to build neural network models to recognize spoken words. These models typically use the Recurrent Neural Network (RNN)  architecture which processes inputs organized as a sequence.


TensorFlow supports all RNN variants including static RNN with a uniform length for all input sequences, dynamic RNN with the ability to have inputs of different lengths, and static bidirectional RNN.


In this page:

  • What is speech recognition?
  • Quick Tutorial #1 – Simple audio recognition
  • Quick Tutorial # 2 – Speech recognition examples for several RNN models


What is Speech Recognition?

Speech recognition software is a program trained to receive the input of human speech, decipher it, and turn it into readable text. This software filters words, digitizes them, and analyzes the sounds they are composed of. The digital representation of these sounds undergoes mathematical analysis to interpret what is being said.


Speech recognition applications include call routing, voice dialing, voice search, data entry, and automatic dictation.


Speech recognition software and deep learning

Traditionally speech recognition models relied on classification algorithms to reach a conclusion about the distribution of possible sounds (phonemes) for a frame.


Today, thanks to deep learning, neural networks are used to perform isolated word recognition, phoneme classification, audiovisual speech recognition, speaker adaptation, and audio-visual speaker recognition.


How it works

Speech recognition software uses Natural Language Processing (NLP) and deep learning neural networks to break the speech down into components that it can interpret. It converts these components into a digital state and analyzes segments of content. The software trains on a dataset of known spoken words or phrases, and makes predictions on the new sounds, forming a hypothesis about what the user is saying. It then transcribes the spoken words into text.


However, recognizing sound is not enough. To be useful, speech recognition software needs to be able to know, for example, the difference between proper names and regular words (for example “Cook” in “James Cook” is a name), and to differentiate between homophones (words with the same pronunciation but with distinct meaning).


Thus, a challenge of speech recognition is creating an intelligent process not only to ‘hear’ speech but also to interface and reason over sources of knowledge and hierarchical relationships that make up ideas in the real world.

Quick Tutorial #1: Building a Speech Recognition Network that Recognizes Ten Different Words

Let’s take a look at how to build a basic speech recognition network in TensorFlow, which can recognize ten distinct words.


This tutorial shows how to develop a model that can classify a one-second audio clip as one of the following:


“silence”, “unknown”, “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”


To build the voice classification model follow these steps:


The following steps are summarized, for the full tutorial, see TensorFlow documentation.

1. Training

To begin, go to the TensorFlow source tree and run:

python tensorflow/examples/speech_commands/

The training data downloads to your machine, note the dataset is over 1GB. When done, you will see the following logging information:

I0730 16:53:44.766740   55030] Training from step: 1
I0730 16:53:47.289078   55030] Step #1: rate 0.001000, accuracy 7.0%, cross entropy 2.611571


The script will begin by downloading the Speech Commands dataset, which is made up of over 105,000 WAVE audio files of individuals saying thirty distinct words. The archive is over 2GB, so this task may take time, but you should view progress logs, and this is a one-off step. Once the downloading is complete you will see the following logging information:

I0730 16:53:44.766740   55030] Training from step: 1
I0730 16:53:47.289078   55030] Step #1: rate 0.001000,
accuracy 7.0%, cross entropy 2.611571


This indicates that the initialization process is complete and the training loop has started. You will see the outputs for each training step. For example, the following line appears after 100 steps:

I0730 16:54:41.813438 55030] Saving to "/tmp/speech_commands_train/conv.ckpt-100"


This saves the currently trained weights to a checkpoint file. If your training script is interrupted, you can look for the last saved checkpoint and then restart the script with:


--start_checkpoint=/tmp/speech_commands_train/conv.ckpt-100 as a command line argument to start from that point.


2. Confusion Matrix

This information is logged after four hundred steps:

I0730 16:57:38.073667   55030] Confusion Matrix:
 [[258   0   0   0   0   0   0   0   0   0   0   0]
 [  7   6  26  94   7  49   1  15  40   2   0  11]
 [ 10   1 107  80  13  22   0  13  10   1   0   4]
 [  1   3  16 163   6  48   0   5  10   1   0  17]
 [ 15   1  17 114  55  13   0   9  22   5   0   9]
 [  1   1   6  97   3  87   1  12  46   0   0  10]
 [  8   6  86  84  13  24   1   9   9   1   0   6]
 [  9   3  32 112   9  26   1  36  19   0   0   9]
 [  8   2  12  94   9  52   0   6  72   0   0   2]
 [ 16   1  39  74  29  42   0   6  37   9   0   3]
 [ 15   6  17  71  50  37   0   6  32   2   1   9]
 [ 11   1   6 151   5  42   0   8  16   0   0  20]]


To interpret the confusion matrix, you first need to understand the labels being used, which in this case are “silence”, “unknown”, “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, and “go”.


Each column specifies the set of samples that were predicted to be each label. So, the first column equals all the audio clips that were predicted to be silence, the second all those predicted to be unknown words, the third “yes” etc. Each row represents clips by their correct, ground truth labels.


3. Validation

After the confusion matrix, you should see a line like this:

I0730 16:57:38.073777 55030] Step 400: Validation accuracy = 26.3% (N=3093)


We recommend separating your data set into three categories.  The largest (in this case roughly 80% of the data) is used for training, a smaller validation set (10%) is set aside for the assessment of accuracy, and a training set (10%) is used to assess the accuracy on completion of training.


The training script automatically divides the data set into these three categories, and the logging line seen above depicts the accuracy of the model when it is run on the validation set.


To better understand training, validation and test sets and the concepts of over-fitting and under-fitting in neural networks, see our in-depth guide to Neural Network Bias.


4. Final model

After a few hours of training, the script should have finished all 18,000 training steps. A final confusion matrix is displayed, with an accuracy score reflecting how well the model did when applied to the testing set.


With the default settings, you should achieve an accuracy of 85%-90%.


5. Run the model in an Android app

Audio recognition is useful on mobile devices, so we will export it to a compact form that is simple to work with on mobile platforms. Run this command line:


python tensorflow/examples/speech_commands/ \
--start_checkpoint=/tmp/speech_commands_train/conv.ckpt-18000 \


Quick Tutorial #2: Speech Recognition with Sequence-To-Sequence Neural Networks

Let’s take a look at a more advanced speech recognition example with sequence-to-sequence (seq-to-seq; see our in-depth guide on Recurrent Neural Networks). In this type of neural network, both input and output is a sequence of signals, which is very suitable for spoken words.


Pannous have provided a set of models with code examples which illustrate how to perform speech recognition using seq-to-seq neural networks.


Installing the set of models:


  • Clone the code via Github
  • Get the prerequisites for Pyaudio by running requirements portaudio from
  • Install pyaudio by running pip install pyaudio


Architecture examples and code in the package


Demo TypeImport stepsGetting startedFull Code Example
Simple spoken digit recognition demo, with 98% accuracy1. import tflearn
2. import pyaudio
3. import speech_data
4. import numpy
# Training Step: 544  | total loss: 0.15866
# | Adam | epoch: 034 | loss: 0.15866 – acc: 0.9818 — iter: 0000/1000# Classification# Overfitting okay for now
Simple speaker recognition demo, with 99% accuracy (on digits sample)1. import os
2. import tflearn
3. import speech_data as data
# | Adam | epoch: 030 | loss: 0.05330 – acc: 0.9966 — iter: 0000/1000
# ‘predicted speaker for 9_Vicki_260 : result = ‘, ‘Vicki’# Classification# demo_file = “8_Vicki_260.wav”


Densely Connected Convolutional Neural Networks example1. import tensorflow as tf
2. import layer
3. import speech_data
4. from speech_data import Source,Target
# BASELINE toy net # Densely Connected Convolutional Networks  # advanced ResNet # CHOOSE MODEL ARCHITECTURE



Running Speech Recognition at Scale on TensorFlow with MissingLink

In this article, we explained how to create Recurrent Neural Networks to perform speech recognition in TensorFlow. When you start working on RNN projects and running large numbers of experiments, you’ll run into some practical challenges:

tracking experiments

Tracking experiment progress, source code, and hyperparameters across multiple RNN experiments. To find the optimal model you will have to run hundreds or thousands of experiments over time, and managing them will become a hassle.

running experiment across multiple machines

Running experiments across multiple machines—for bidirectional RNNs operating on long data sequences and processing audio, real projects will require running experiments on multiple machines and GPUs. Provisioning these machines and distributing the work among will consume valuable time.

manage training datasets

Manage training data—RNN projects focused on audio and speech recognition can have very large datasets of Gigabytes, Terabytes or more. Moving data between training machines will take time and slow you down when trying to run multiple experiments.

MissingLink is a deep learning platform that can help you automate RNN experiments on TensorFlow, so you can concentrate on building winning speech recognition experiments. Sign up for free to see how easy it is.

Learn More About Deep Learning Frameworks