One of the byproducts of our digitally transformed world is the accumulation of large quantities of data. Online transactions, medical records, social media posts, emails, instant messages, and connected sensors are just a few examples of the kinds of data being captured and stored on a daily basis.
Scientists and research organizations have been exploring how to leverage big data for artificially intelligent applications since the 1970s. Nonetheless, until fairly recently, the big data issues for enterprises remained how to store it cost effectively, how to retrieve it efficiently when needed, and how to protect it from unauthorized access. The growth of the cloud opened up a whole new realm of cost-effective data storage and retrieval solutions, but big data was still largely perceived by enterprises as a passive asset that did not contribute significantly to their bottom lines.
Eager to extract value from their big data, enterprises enthusiastically embraced the emerging applications that leveraged advanced data science methods in order to gain actionable business intelligence and marketing insights. Today, machine learning and deep learning have gone mainstream and are disrupting the way we use big data to accelerate business outcomes and improve the quality of our lives.
In this article, we examine in depth how machine learning in general, and deep learning in particular, are transforming big data from a quantitative concept—something that’s measured in terabytes or zettabytes—into a qualitative concept that’s measured in the value it brings to businesses and our daily lives.
Big Data: The Exploding Datasphere
Much has been written about the exponential growth of data, a phenomenon illustrated clearly in this graph from IDC’s April 2017 report, Data Age 2025: The Evolution of Data to Life-Critical.
IDC also notes that not only the velocity but also the type of data being created is undergoing a transformation. The relative share of entertainment and non-entertainment image and video content in the global datasphere is shrinking in light of the growing importance of productivity-driven data (such as enterprise files on PCs or servers, log files, metadata, etc.) and embedded data (from wearables, IoT devices, autonomous systems, and so on).
The rise of cloud storage (vs. storage on local devices or servers) has already made big data more accessible to a wide range of advanced digital applications, including marketing and business intelligence, human resources, healthcare, manufacturing, and smart cities, to mention but a few. IDC forecasts that the quantity of data subject to analysis will grow by a factor of 50 between now and 2025, and the amount of that analyzed data that will serve AI-based cognitive systems will multiply by a factor of 100.
If in 2015 we each engaged in a little more than 200 data-driven interactions per day on average, by 2025 IDC expects that number to grow by more than 20-fold, to 4,700+ interactions per day. Additionally, the need for super-reliable, real-time big data processing and analysis will increase as AI applications become more life-critical, like self-driving cars or medical decision support systems.
Extracting Value from Data
The challenge over the coming years is how to effectively extract value from huge and diverse data sets. One of the important ways this challenge will be met is AI-based cognitive systems, leveraging big data to disrupt how we do business and how we conduct our daily lives.
Artificial intelligence is evolving rapidly, with many advances in how big data can be most effectively leveraged in order to build robust systems.
Machine learning algorithms rely on human-mediated feature engineering in order to build their models. If, for example, the task is to have a computer identify in real time whether a cat is present in an image or field of view, the data scientist will identify a robust set of spatial and geometric features that can be used to determine the edges that define a cat. Classic machine learning algorithms are fast to run and tests like chi^2 make it relatively easy to understand the value of each feature. However, you would need a data scientist with domain expertise relevant to your specific dataset—in this case, a cat specialist who understands the most important features for cat identification. It might be very hard, as well as expensive, to find a data scientist who can handle your unique dataset.
Deep learning and artificial neural networks, on the other hand, have less of a need for domain expertise. The promise of big data and deep learning is the automated feature engineering whereby neural networks iteratively review the train data and learn to arrive at their own criteria for deciding whether or not a cat is present in the image. Rather than manually extracting features, neural networks extract meaning from complicated data by learning trends and features that in many cases are not intuitive to humans. In our cat detection example, the first layer would detect shapes, the second layer specific features such as ears or eyes, and the third layer faces and categories. Detection of these features is automated and can save time and human resources when building a model. Your goal as a data scientist is to find the optimal deep neural network architecture that would be the most relevant for your dataset and adjust the architecture and hyperparameters to the specific problem at hand.
Tuning a deep learning model requires immense sets of tagged data, as well as considerable processing power—typically provided by graphics processing units (GPUs). Deep neural networks can also leverage diverse types of data across multidimensional layers that represent a nested hierarchy of related concepts. The answer to one question leads to another set of related questions.
Deep learning, therefore, is a very promising AI approach for extracting maximum value from big data. Enterprises that have access to unique labeled data sets will have a distinct competitive edge. However, there are still considerable challenges in training deep learning algorithms, as described below.
- Data selection and cleansing: The data driving the algorithms and decisions must be relevant, comprehensive, and of high quality. Once the appropriate data sets have been aggregated, data scientists have to detect and then correct or remove corrupt, incomplete, or inaccurate records. Although there are data scrubbing tools to assist in this process, it still requires a great deal of human intervention.
- Data tagging: In supervised Deep Learning the training data has to be tagged, or labelled, with the correct target answer. Although it is possible to purchase tagged training data sets, enterprises that have unique training data must undertake the tedious and error-prone process of manual tagging.
- Data versioning: . Because data changes have significant impact on the model performance, it is imperative to painstakingly track source data and metadata versions in order to evaluate the impact of data changes on model performance from experiment to experiment.
- Experiment versioning: In addition to managing the data sets themselves, deep learning experiments also require careful orchestration and tracking of many other elements such as logs, code, hyperparameters, and compute resources.
- Running the experiments: Training typically involves a highly iterative and workload-intensive series of experiments. A great deal of time and effort is spent copying the right data to the right target machine, as well as scaling training machines up and down to meet dynamic compute requirements.
- Documentation and evaluation: Careful documentation is a prerequisite for being able to learn from each experiment cycle and plan the next iteration of machine + data + parameters in order to improve the inference and prediction models.
In short, training deep learning models is a highly iterative process that requires constant and careful attention to data cleansing and versioning, to planning and scaling workload-intensive experiments, and to documenting and evaluating results in order to improve the inference and prediction models from experiment to experiment. Today these tasks are primarily manual, making them time-consuming, frustrating and error-prone. Frameworks that could automate these processes would dramatically enhance the productivity of deep learning workflows.
Accelerating Deep Learning Through Automation
In order to overcome the challenges described in the previous section, there is a need for data science automation tools that can accelerate continuous deep learning training and deployment processes.
Thankfully, frameworks have emerged that relieve data scientists of the need to directly implement the complex mathematical computations that underlie the deep learning training and inference algorithms. Some of the better known frameworks are: TensorFlow, an open-source software library for high-performance numerical computation; Keras, a high-level, user-friendly neural network API written in Python, to quickly model and run deep learning prototypes on TensorFlow and similar open-source deep learning libraries such as Microsoft’s Cognitive Toolkit (CNTK) or Theano; Apache MXNet, a library that provides optimized numerical computation for distributed ecosystems and multi-GPU training; Caffe, a deep learning framework that encourages experimentation, with models and optimization defined by configuration rather than hard-coding; and Apache Spark, an open-source platform that scales compute across numerous nodes, with support for streaming data thus making it relevant for stream training or inference.
By shielding teams from the math that is “under the hood” of deep learning algorithms and by providing open-source neural network architectures, these frameworks have certainly made deep learning more accessible. However, to better equip data scientists to meet the growing demand for smart, AI-based solutions that thrive on big data, there is still a need for comprehensive deep learning automation frameworks like MissingLink.ai that support continuous integration and deployment through:
- Managing and tracking data in order to save time exploring which data is needed for training, copying data to the training machines, etc.
- Managing resources across multiple experiments, simultaneously.
- Automated scaling of training experiments.
- Auto-documenting experiment results so that researchers can quickly evaluate and plan the next iteration.
A Final Note
We live in an exciting time in which we’re finally learning how to harness the big data we generate for real-life, real-time applications that optimize how we sell, purchase, produce, drive, maintain our health, and much more. There are more and more tools that free data scientists from time-consuming manual workflows and promote scalability as they train and deploy ever more complex deep learning models. To this end, MissingLink provides a scalable, robust deep learning management platform that automates and accelerates key processes such as data aggregation and cleansing, data and experiment versioning, documentation, and on-demand provisioning of compute resources. Visit our website to learn more.