First, sources will overlap and conflict, and to resolve their conflicts we need to estimate their accuracies and correlation structure, without access to ground truth. However, two key challenges arise in doing so effectively. Ideally, we would combine the labels from many weak supervision sources to increase the accuracy and coverage of our training set. While these sources are inexpensive, they often have limited accuracy and coverage.
Other forms include crowdsourced labels, rules and heuristics for labeling data, and others. The most popular form is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels. However, the bulk of practitioners are increasingly turning to some form of weak supervision: cheaper sources of labels that are noisier or heuristic. Other practitioners utilize classic techniques like active learning, transfer learning, and semi-supervised learning to reduce the number of training labels needed. Some big companies are able to absorb this cost, hiring large teams to label training data. Moreover, we observe from our engagements with collaborators like research laboratories and major technology companies that modeling goals such as class definitions or granularity change as projects progress, necessitating re-labeling. For example, reading scientific papers, analyzing intelligence data, and interpreting medical images all require labeling by trained subject matter experts (SMEs). Such training sets are enormously expensive to create, especially when domain expertise is required. However, deep learning has a major upfront cost: these methods need massive training sets of labeled examples to learn from-often tens of thousands to millions to reach peak predictive performance. These learned representations are particularly effective for tasks like natural language processing and image analysis, which have high-dimensional, high-variance input that is impossible to fully capture with simple rules or hand-engineered features.
A central driver has been the advent of deep learning techniques, which can learn task-specific representations of input data, obviating what used to be the most time-consuming development task: feature engineering. In the last several years, there has been an explosion of interest in machine learning-based systems across industry, government, and academia, with an estimated spend this year of $12.5 billion.
In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides \(132\%\) average improvements to predictive performance over prior heuristic approaches and comes within an average \(3.60\%\) of the predictive performance of large hand-curated training sets. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to \(1.8\times \) speedup per pipeline execution. In a user study, subject matter experts build models \(2.8\times \) faster and increase predictive performance an average \(45.5\%\) versus seven hours of hand labeling. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Labeling training data is increasingly the largest bottleneck in deploying machine learning systems.