Introducing Snorkel

How this Tiny Project Solves One of the Major Problems in Real World Machine Learning Solutions

Building high quality training datasets is one of the most difficult challenges of machine learning solutions in the real world. Disciplines like deep learning have helped us to build more accurate models but, to do so, they require vastly larger volumes of training data. Now, saying that effective machine learning requires a lot of training data is like saying that “you need a lot of money to be rich”. It’s true, but it doesn’t make it less painful to get there. In many of the machine learning projects we work on at Invector Labs, our customers spend significant more time collecting and labeling training dataset than building machine learning models. Last year, we came across a small project created by artificial intelligence(AI) researchers from Stanford University that provides a programming model for the creation of training datasets. Ever since, Snorkel has become a regular component of our machine learning implementations.

If we think about the traditional process for building a training dataset it involves three major steps: data collection, data labeling and feature engineering. From the complexity standpoint, data collection is fundamentally trivial as most organizations understand what data sources they have. Feature engineering is getting to the point that is 70%-80% automated using algorithms. The real effort is in the data labeling stage.

Labeling training data many times involves domain experts manually processing large semi-unstructured or unstructured datasets. This is typically known as strong supervision labeling and tend to produce very high-quality datasets but also results cost prohibited for most companies. Alternatively, weak supervision labeling relies on programmable heuristics that produce noisy labeling data. One of the most popular weak labeling techniques is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels. While less accurate, weak labeling techniques are more feasible from the cost perspective.

In our experience at Invector Labs, large training datasets are often a combination of strong and weak supervision methods. While that might sound obvious conceptually, doing it effectively in real world machine learning scenarios can become nothing short of a nightmare. In 2016, AI researchers from Stanford University introduced a new paradigm known as data programming that allow data engineers to express weak supervision strategies and generate probabilistic training labels representing the lineage of the individual labels. The ideas behind data programming were incredibly compelling but were lacking a practical implementation.

Enter Snorkel

Snorkel is the first implementation of Stanford’s data programming paradigm for weak supervision training models. Snorkel uses a set of programmable labeling functions express different weak supervision strategies and then generates a model based on the effectiveness of the different strategies. In that context, Snorkel streamlines the job of a data engineer by creating a weak supervision model that can be used to build an effective training dataset.

From the functional standpoint, the Snorkel project was created with three fundamental goals as a guideline:

1) Bring All Sources to Bear: The system should enable users to opportunistically use labels from all available weak supervision sources.

2) Training Data as the Interface to ML: The system should model label sources to produce a single, probabilistic label for each data point and train any of a wide range of classifiers to generalize beyond those sources.

3) Supervision as Interactive Programming: The system should provide rapid results in response to user supervision. We envision weak supervision as the REPL-like interface for machine learning.

Labeling functions are the core concept behind Snorkel. Conceptually, a labeling function receives a candidate object as an input and produces an output indicating whether the object matches a specific labeling criterion. For instance, the following labeling function checks whether a specific chemical can be the cause of particular disease.

The Snorkel workflow is divided in three main stages:

I. Writing Labeling Functions: In this phase, data engineers author labeling functions that express various weak supervision sources such as patterns, heuristics, external knowledge bases, and more.

II. Modeling Accuracies and Correlations: After the labeling functions are ready, Snorkel learns a generative model which estimates specific accuracies and correlations. The generative model is essentially a re-weighted combination of the user-provided labeling functions.

III. Training the Discriminative Model: In this stage, Snorkel produces a set of probabilistic labels that can be used to train a model machine learning model.

The three main stages of the Snorkel workflow are illustrated in the following figure:

Snorkel in Action

Using Snorkel is relatively easy. Let’s take an example that tries to extract spouse relationship from the news. The first step is to create a Snorkel session:

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

After that, we need to create a series of labeling functions that detect possible spouse relationships as shown in the following code:

spouses = {'spouse', 'wife', 'husband', 'ex-wife', 'ex-husband'}
def LF_husband_wife(c):
return 1 if len(spouses.intersection(get_between_tokens(c))) > 0 else 0

The next step is to use the labeling functions as part of labeler using the following code:

from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LFs)
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

At this point, we train a generative model on the output of the training labels:

from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train, epochs=100, decay=0.95, step_size=0.1 / L_train.shape[0], reg_param=1e-6)

Finally, we generate the probabilistic training labels which can be used to train a specific classification model.

train_marginals = gen_model.marginals(L_train)

Snorkel is a relatively new project but one that brings a tremendous amount of value to machine learning implementations. By using probabilistic, weak supervision labeling Snorkel can help to reduce the time and effort that it takes to create efficient training datasets. Snorkel works seamlessly with popular Python-based deep learning frameworks such as TensorFlow, PyTorck or Caffe2. Currently, Snorkel is part of the ambitious DAWN Project by Stanford University and the code is available on Github.


Introducing Snorkel was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *