Important announcement: Missinglink has shut down. Click here to learn more.

Deep Learning Frameworks Cover

Computer Vision

OpenCV Deep Learning

OpenCV (Open Source Computer Vision Library) is an open source library that helps users realize computer vision tasks. Released in 2017, version 3.3 of OpenCV surpasses its deep neural network module. Today, organizations running Convolutional Neural Network (CNN) and other neural network-based computer vision architectures, are using OpenCV.

This article provides a description of OpenCV, and the deep learning module in OpenCV. We’ll also outline the role of OpenCV in deep learning for computer vision, walk you through the OpenCV deep learning execution process, including step-by-step OpenCV Tutorials.

What is OpenCV?

OpenCV (Open Source Computer Vision Library) is an open source library used to perform computer vision tasks. It offers over 2500 computer vision algorithms, including classic statistical algorithms and modern machine learning-based techniques, including neural networks. OpenCV boasts a community of almost 50,000 developers and over 18 millions downloads.

OpenCV is used by huge companies like Google, Yahoo, Microsoft and Intel, research bodies, governments and also startups and individual users. While it used to be difficult to learn and use, usability and documentation are gradually improving.

OpenCV applications include:

  • Detecting and recognizing faces
  • Identifying objects
  • Classifying human actions in videos
  • Tracking camera movements
  • Tracking moving objects
  • Extracting 3D models of objects
  • Producing 3D point clouds from stereo cameras
  • Stitching images together to produce an image of a scene
  • Finding similar images from an image database
  • Removing red eyes from images
  • Following eye movements
  • Recognizing scenery adding markers to enable augmented reality (AR)

Systems, languages and frameworks supported

  • Supports Windows, Linux, Android and Mac OS
  • Provides interfaces in C, C++, Python, Java and MATLAB
  • Can use MMX and SSE instructions, CUDA and OpenCL support is in development
  • Supports TensorFlow, Torch/PyTorch and Caffe (Keras is partially supported via conversion to TensorFlow)

What is the Deep Learning Module in OpenCV?

In 2017, OpenCV released version 3.3 and overhauled its Deep Neural Network Module, and OpenCV is now widely used to run Convolutional Neural Network (CNN)  and other neural network-based computer vision architectures.

Let’s clarify the role of OpenCV in a deep learning computer vision project:

  • OpenCV is not used to train the neural networks—you should do that with a framework like TensorFlow or PyTorch, and then export the model to run in OpenCV.
  • OpenCV is used to take a trained neural network model, prepare and preprocess images for it, apply it to the images and output results. You can also use it to combine neural networks with other computer vision algorithms available in OpenCV.

OpenCV deep learning execution process:

  1. Load a model from disk.
  2. Pre-process images to serve as inputs to the neural network.
  3. (run other computer vision algorithms on the input images if necessary)
  4. Pass the image through the network and obtain output classifications.
  5. (run other computer vision algorithms on the outputs if necessary)

In the remainder of this article, we’ll summarize two excellent tutorials that will help you learn to use OpenCV with deep neural networks, using the Faster R-CNN and classic CNN architectures.

OpenCV Tutorial #1 - Mask R-CNN in OpenCV

Mask R-CNN is an extension of the Faster R-CNN architecture (see our in-depth guide on using Faster R-CNN with TensorFlow ). It works by identifying Regions of Interest (ROI) within an image and then focusing the classification process on those regions. This is a deep learning image segmentation technique.

The Mask R-CNN process is as follows:

  1. Input an image and ground-truth bounding boxes for Regions of Interest
  2. Extract the feature map
  3. Apply an ROI align method (more accurate than the ROI pooling in the original Faster R-CNN architecture), branching into two processes:
    1. Fully connected layers terminating with class labels and bounding box predictions
    2. A full convolutional process resulting in a “mask” that defines the shape of the object identified in the Region of Interest

The following steps are summarizedsee the full tutorial by Adrian Rosebrock. The tutorial uses OpenCV and Mask R-CNN to classify objects within images, using the COCO dataset with 90 image classes.

Prerequisite: Before following this and the other tutorials, install OpenCV on your workstation.

  1. Load COCO dataset and colors into OpenCV

Load the COCO dataset class labels that we used to train the Mask R-CNN network (remember, networks used in OpenCV must be pre-trained).

labelsPath = os.path.sep.join([args["mask_rcnn"],
LABELS = open(labelsPath).read().strip().split("\n")
  1. Load the Mask R-CNN Model and pass an image

Obtain the Mask R-CNN weights obtained during training, and model configuration, and load the model:

weightsPath = os.path.sep.join([args["mask_rcnn"],
configPath = os.path.sep.join([args["mask_rcnn"],
net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath)

Load an image into a blob and do a pass through the neural network. Run ROI alignment and get the bounding box coordinates, then for each object in the image, perform pixel-wise image segmentation.

image = cv2.imread(args["image"])
(H, W) = image.shape[:2]

blob = cv2.dnn.blobFromImage(image, swapRB=True, crop=False)
start = time.time()
(boxes, masks) = net.forward(["detection_out_final", "detection_masks"])
end = time.time()
  1. Loop over objects in the image and extract pixel-wise segmentation for each object

For each object extracted in the ROI align stage, get the predicted class, and if confidence is high enough, compute the bounding box coordinates relative to the size of the image.

for i in range(0, boxes.shape[2]):
	classID = int(boxes[0, 0, i, 1])
	confidence = boxes[0, 0, i, 2]
	if confidence > args["confidence"]:
		clone = image.copy()
 		box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
		(startX, startY, endX, endY) = box.astype("int")
		boxW = endX - startX
		boxH = endY - startY

Now convert the mask from a boolean to an integer with values 0-255, and show the extracted ROI with its mask:

if args["visualize"] > 0:
	visMask = (mask * 255).astype("uint8")
	instance = cv2.bitwise_and(roi, roi, mask=visMask)
 	cv2.imshow("ROI", roi)
	cv2.imshow("Mask", visMask)
	cv2.imshow("Segmented", instance)

Extract only the masked region of the ROI, randomly select a color to visualize it, and create a transparent overlay:

roi = roi[mask]
		color = random.choice(COLORS)
		blended = ((0.4 * color) + (0.6 * roi)).astype("uint8")
		clone[startY:endY, startX:endX][mask] = blended

Finally, draw the bounding box on the image, together with the predicted label and probability, and show the output image:

color = [int(c) for c in color]
cv2.rectangle(clone, (startX, startY), (endX, endY), color, 2)
text = "{}: {:.4f}".format(LABELS[classID], confidence)
cv2.putText(clone, text, (startX, startY - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

cv2.imshow("Output", clone)
  1. Run the OpenCV code and visualize object segmentation on an image

Here is a commands you can use to execute the OpenCV code above and generate a visualization of the image:

$ python --mask-rcnn mask-rcnn-coco --image images/example_01.jpg

An example of the output:

See the full tutorial for all the code and details on how to do the same thing for video streams in OpenCV.

OpenCV Tutorial #2 - OpenCV CNN for Sign Language Recognition

This tutorial uses a classic Convolutional Neural Network to classify an image of letters in sign language, from the MNIST dataset. OpenCV will be used to apply the pre-trained CNN. We use Google Colab as the deep learning environment.

Test data will be live streaming video from a webcam – our model will identify letters in sign language based on live footage.

These steps are summarized—see the full tutorial by Arshad Kazi.

  1. Load the dataset

Download the MNIST sign language dataset here, load it into Colab, and visualize some of the image:

from keras.datasets import mnist
(X_train, Y_train) , (X_test , Y_test) = mnist.load_data()

display(X_train.head(n = 2))
display(X_test.head(n = 2))
  1. Preprocess images

Create an image using the train_X and test_X pixel values. Divide the array into 28×28 pixel groups.

X_train = np.array(X_train.iloc[:,:])
X_train = np.array([np.reshape(i, (28,28)) for i in X_train])
X_test = np.array(X_test.iloc[:,:])
X_test = np.array([np.reshape(i, (28,28)) for i in X_test])
num_classes = 26
y_train = np.array(y_train).reshape(-1)
y_test = np.array(y_test).reshape(-1)
y_train = np.eye(num_classes)[y_train]
y_test = np.eye(num_classes)[y_test]
X_train = X_train.reshape((27455, 28, 28, 1))
X_test = X_test.reshape((7172, 28, 28, 1))
  1. Build, compile and train the model

We use a CNN model with two Conv2D and MaxPooling layers, followed by fully connected layers. Define the model in Keras, compile it and check accuracy.

classifier = Sequential()
classifier.add(Conv2D(filters=8, kernel_size=(3,3),strides=(1,1),padding='same',input_shape=(28,28,1),activation='relu', data_format='channels_last'))
classifier.add(Conv2D(filters=16, kernel_size=(3,3),strides=(1,1),padding='same',activation='relu'))
classifier.add(Dense(128, activation='relu'))
classifier.add(Dense(26, activation='softmax'))

classifier.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy']), y_train, epochs=50, batch_size=100)

accuracy = classifier.evaluate(x=X_test,y=y_test,batch_size=32)
print("Accuracy: ",accuracy[1])
  1. Download the model to input it into OpenCV

Use these commands to download the trained model to your computer.'CNNmodel.h5')
weights_file = drive.CreateFile({'title' : 'CNNmodel.h5'})
drive.CreateFile({'id': weights_file.get('id')})
  1. Define inputs from a webcam in OpenCV

Create a “window” in OpenCV to take the input from our webcam. The input should be converted to 28×28 grayscale, because this is how we trained our model.

Here is how to capture the image from the webcam:

def main():
    while True:  
        cam_capture = cv2.VideoCapture(0)
        _, image_frame =

Crop, convert to grayscale, blur and resize:

        im2 = crop_image(image_frame, 300,300,300,300)     
        image_grayscale = cv2.cvtColor(im2, cv2.COLOR_BGR2GRAY)
        image_grayscale_blurred =cv2.GaussianBlur(image_grayscale, (15,15), 0)
        im3 = cv2.resize(image_grayscale_blurred, (28,28), interpolation = cv2.INTER_AREA)

Expand dimensions to 1x28x28x1:

        im4 = np.resize(im3, (28, 28, 1))
        im5 = np.expand_dims(im4, axis=0)
  1. Generate predictions!

To predict an alphabet letter from an input image, we’ll use integers rather than alphabet letters (1 = A, 2 = B, etc).

Pass the input image into the classifier:

def keras_predict(model, image):
    data = np.asarray( image, dtype="int32" )
    pred_probab = model.predict(data)[0]

Use Softmax to obtain a probability for each alphabet letter, and select the letter with higher probability:

    pred_class = list(pred_probab).index(max(pred_probab))
    return max(pred_probab), pred_class

That’s it! Your model can now be used to read sign language in live video footage.

Running CNN and R-CNN with OpenCV in the Real World

In this article we explained how to use OpenCV to run pre-trained deep learning algorithms, specifically Convolutional Neural Networks (CNN) on image and live video footage, using the OpenCV frameworks.

When you start working on computer vision projects, processing and generating predictions for real images, audio and video, you’ll run into some practical challenges:

  • tracking experiments

    Tracking experiment progress, source code, and hyperparameters across multiple computer vision experiments. CNNs can have many different architectures and modifications. Testing each variation will require running and tracking large numbers of experiments.

  • running experiment across multiple machines

    Running experiments across multiple machines—computer vision algorithms are computationally intensive to train, and also to apply to large numbers of images using OpenCV. Most projects will require multiple machines or GPU hardware. Provisioning these machines and distributing experiments efficiently can be difficult.

  • manage training datasets

    Manage training data—OpenCV projects usually involve live video, and training sets can get huge, up to Gigabytes or Petabytes of data. Copying this data to training machines and replacing it each time as you tweak your dataset and neural network can be very time consuming.

MissingLink is a deep learning platform that can help you automate these operational aspects of neural networks, so you can concentrate on building winning experiments and running them with OpenCV.

Learn more about the MissingLink platform.

Train Deep Learning Models 20X Faster

Let us show you how you can:

  • Run experiments across hundreds of machines
  • Easily collaborate with your team on experiments
  • Reproduce experiments with one click
  • Save time and immediately understand what works and what doesn’t

MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.

Request your personal demo to start training models faster

    Thank you!
    We will be in touch with more information in one business day.
    In the meantime, why not check out how Nanit is using MissingLink to streamline deep learning training and accelerate time to Market.