A system for quickly generating training data with weak supervision

Last update: Jan 02, 2023

Overview

Programmatically Build and Manage Training Data

Announcement

The Snorkel team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkel—you can check it out here or join us in building it!

The Snorkel project started at Stanford in 2015 with a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed. Given this premise, we set out to explore the radical idea that you could bring mathematical and systems structure to the messy and often entirely manual process of training data creation and management, starting by empowering users to programmatically label, build, and manage training data.

To say that the Snorkel project succeeded and expanded beyond what we had ever expected would be an understatement. The basic goals of a research repo like Snorkel are to provide a minimum viable framework for testing and validating hypotheses. Four years later, we’ve been fortunate to do not just this, but to develop and deploy early versions of Snorkel in partnership with some of the world’s leading organizations like Google, Intel, Stanford Medicine, and many more; author over thirty-six peer-reviewed publications on our findings around Snorkel and related innovations in weak supervision modeling, data augmentation, multi-task learning, and more; be included in courses at top-tier universities; support production deployments in systems that you’ve likely used in the last few hours; and work with an amazing community of researchers and practitioners from industry, medicine, government, academia, and beyond.

However, we realized increasingly–from conversations with users in weekly office hours, workshops, online discussions, and industry partners–that the Snorkel project was just the very first step. The ideas behind Snorkel change not just how you label training data, but so much of the entire lifecycle and pipeline of building, deploying, and managing ML: how users inject their knowledge; how models are constructed, trained, inspected, versioned, and monitored; how entire pipelines are developed iteratively; and how the full set of stakeholders in any ML deployment, from subject matter experts to ML engineers, are incorporated into the process.

Over the last year, we have been building the platform to support this broader vision: Snorkel Flow, an end-to-end machine learning platform for developing and deploying AI applications. Snorkel Flow incorporates many of the concepts of the Snorkel project with a range of newer techniques around weak supervision modeling, data augmentation, multi-task learning, data slicing and structuring, monitoring and analysis, and more, all of which integrate in a way that is greater than the sum of its parts–and that we believe makes ML truly faster, more flexible, and more practical than ever before.

Moving forward, we will be focusing our efforts on Snorkel Flow. We are extremely grateful for all of you that have contributed to the Snorkel project, and are excited for you to check out our next chapter here.

Quick Links

Getting Started

The quickest way to familiarize yourself with the Snorkel library is to walk through the Get Started page on the Snorkel website, followed by the full-length tutorials in the Snorkel tutorials repository. These tutorials demonstrate a variety of tasks, domains, labeling techniques, and integrations that can serve as templates as you apply Snorkel to your own applications.

Installation

Snorkel requires Python 3.6 or later. To install Snorkel, we recommend using pip:

pip install snorkel

or conda:

conda install snorkel -c conda-forge

For information on installing from source and contributing to Snorkel, see our contributing guidelines.

Details on installing with conda

The following example commands give some more color on installing with conda. These commands assume that your conda installation is Python 3.6, and that you want to use a virtual environment called snorkel-env.

# [OPTIONAL] Activate a virtual environment called "snorkel"
conda create --yes -n snorkel-env python=3.6
conda activate snorkel-env

# We specify PyTorch here to ensure compatibility, but it may not be necessary.
conda install pytorch==1.1.0 -c pytorch
conda install snorkel==0.9.0 -c conda-forge

A quick note for Windows users

If you're using Windows, we highly recommend using Docker (you can find an example in our tutorials repo) or the Linux subsystem. We've done limited testing on Windows, so if you want to contribute instructions or improvements, feel free to open a PR!

Discussion

Issues

We use GitHub Issues for posting bugs and feature requests — anything code-related. Just make sure you search for related issues first and use our Issues templates. We may ask for contributions if a prompt fix doesn't fit into the immediate roadmap of the core development team.

Contributions

We welcome contributions from the Snorkel community! This is likely the fastest way to get a change you'd like to see into the library.

Small contributions can be made directly in a pull request (PR). If you would like to contribute a larger feature, we recommend first creating an issue with a proposed design for discussion. For ideas about what to work on, we've labeled specific issues as help wanted.

To set up a development environment for contributing back to Snorkel, see our contributing guidelines. All PRs must pass the continuous integration tests and receive approval from a member of the Snorkel development team before they will be merged.

Community Forum

For broader Q&A, discussions about using Snorkel, tutorial requests, etc., use the Snorkel community forum hosted on Spectrum. We hope this will be a venue for you to interact with other Snorkel users — please don't be shy about posting!

Announcements

To stay up-to-date on Snorkel-related announcements (e.g. version releases, upcoming workshops), subscribe to the Snorkel mailing list. We promise to respect your inboxes — communication will be sparse!

Twitter

Comments

connectionError while parse_corpus

from snorkel.parser import CorpusParser
cp = CorpusParser(doc_parser, sent_parser)
%time corpus = cp.parse_corpus(session, 'News Training')
---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
<ipython-input-5-277d2c9f9bed> in <module>()
      2 
      3 cp = CorpusParser(doc_parser, sent_parser)
----> 4 get_ipython().magic(u"time corpus = cp.parse_corpus(session, 'News Training')")

/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2156         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2157         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158         return self.run_line_magic(magic_name, magic_arg_s)
   2159 
   2160     #-------------------------------------------------------------------------

/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2077                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2078             with self.builtin_trap:
-> 2079                 result = fn(*args,**kwargs)
   2080             return result
   2081 

<decorator-gen-59> in time(self, line, cell, local_ns)

/usr/local/lib/python2.7/dist-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1178         else:
   1179             st = clock2()
-> 1180             exec(code, glob, local_ns)
   1181             end = clock2()
   1182             out = None

<timed exec> in <module>()

/home/yejunbin/Github/snorkel/snorkel/parser.pyc in parse_corpus(self, session, name)
     38                     break
     39             corpus.append(doc)
---> 40             for _ in self.sent_parser.parse(doc, text):
     41                 pass
     42         if self.max_docs is not None:

/home/yejunbin/Github/snorkel/snorkel/parser.pyc in parse(self, doc, text)
    274     def parse(self, doc, text):
    275         """Parse a raw document as a string into a list of sentences"""
--> 276         for parts in self.corenlp_handler.parse(doc, text):
    277             yield Sentence(**parts)

/home/yejunbin/Github/snorkel/snorkel/parser.pyc in parse(self, document, text)
    211         if isinstance(text, unicode):
    212             text = text.encode('utf-8', 'error')
--> 213         resp = self.requests_session.post(self.endpoint, data=text, allow_redirects=True)
    214         text = text.decode('utf-8')
    215         content = resp.content.strip()

/usr/local/lib/python2.7/dist-packages/requests/sessions.pyc in post(self, url, data, json, **kwargs)
    520         """
    521 
--> 522         return self.request('POST', url, data=data, json=json, **kwargs)
    523 
    524     def put(self, url, data=None, **kwargs):

/usr/local/lib/python2.7/dist-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    473         }
    474         send_kwargs.update(settings)
--> 475         resp = self.send(prep, **send_kwargs)
    476 
    477         return resp

/usr/local/lib/python2.7/dist-packages/requests/sessions.pyc in send(self, request, **kwargs)
    594 
    595         # Send the request
--> 596         r = adapter.send(request, **kwargs)
    597 
    598         # Total elapsed time of the request (approximately)

/usr/local/lib/python2.7/dist-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    485                 raise ProxyError(e, request=request)
    486 
--> 487             raise ConnectionError(e, request=request)
    488 
    489         except ClosedPoolError as e:

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=12345): Max retries exceeded with url: /?properties=%7B%22annotators%22:%20%22tokenize,ssplit,pos,lemma,depparse,ner%22,%20%22outputFormat%22:%20%22json%22%7D (Caused by ProtocolError('Connection aborted.', BadStatusLine("''",)))

opened by yejunbin 34

Training data for training a NER model
Hi,

I have two questions about the ways of using the snorkel:

I am trying to train a Stanford Core-NLP NER model for picking all the software names and vendor names from the comments and descriptions of a software ( raw text). For this I tried to do some manual labeling which is very time consuming but I can clearly see the improvement in accuracy when keep increasing the training data. In this case I am trying to use Snorkel to produce some training data for me but it seems like(in Tutorials) it is already using Core-NLP NER models and generating training data for abstracting relation between two entities. Is their a way to use snorkel for creating train data for abstracting entities rather than their relation?

I have also used DeepDive to abstract relations between entities, when I skim through the tutorials I am not able to find much difference between DeepDive and Snorkel. Is Snorkel is the python version of DeepDive ?

Q&A
opened by arjasethan1 27
ImportError: No module named 'snorkel'

Hello all,

I am trying to execute the gene tagging example and I am stuck on step 1: Obtain and parse input data. I am trying to import functions from the snorkel.parser module but keep getting an error as follows:

from snorkel.parser import HTMLDocParser causes the following error: ImportError Traceback (most recent call last) in () ----> 1 from snorkel.parser import HTMLDocParser

ImportError: No module named 'snorkel'

opened by Geldren1 19
Add preliminary spacy v3 support.
Description of proposed changes

Just add the new parameter of wapper to support spacy v3.

Related issue(s)

#1701 Fixes # (issue) Preliminary spacy v3 support.

Need help on these? Just ask! Hallo, I just reviewed the wrapper of spacy v3, and add the parameter exclude, write the docs for changed parameters exclude, disable. I am not sure what other functions snorkel used from spacy, please help me to confirm other needed changes. Thanks.

[*] I have read the CONTRIBUTING document.

[*] I have updated the documentation accordingly.

[*] I have added tests to cover my changes.

[ ] I have run tox -e complex and/or tox -e spark if appropriate.

[ ] All new and existing tests passed.

no-stale
opened by yinxiangshi 18
negative predictions for data point with positive labels

Hi, I am using snorkel for a search result quality prediction project. I have quite a few cases where the predictions do not quite make sense to me. There are false negative results where the predictions are 0 (negative), but if you look at the labels assigned by each labeling functions, they are either 1(positive) or -1(abstain). There are nearly no 0(negative) at all. Could anyone help clarify this for me please? Thanks, Peng
bug

opened by pengdumle 17
LabelModel produces equal probability for labeled data
Issue description

I am using snorkel for creating binary text classification training examples with 9 labeling functions. However, I find that for some data that are trained with the label model, they receive a probabilistic label of equal probabilities (i.e. [0.5 0.5]), despite that they only receive label of one class from the labeling functions (e.g. [-1 0 -1 -1 0 -1 -1 -1 -1], so only class 0 or ABSTAIN), why is that?

Besides, I find that setting verbose = True when defining the LabelModel does not print the logging information.

Lastly, if producing the label [0.5 0.5] is a normal behavior, then when filtering out unlabeled data points, they should be removed as well, because they do not contribute to the training (of a classifier), and if the classifier that does not support probablistic labels (e.g. FastText), using argmax will always lead to class 0 (which is undesired).

Code example/repro steps

Below I show my code of defining LabelModel:

print('==================================================') print('Training label model...') label_model = LabelModel(cardinality=2, verbose=True) label_model.fit(L_train=L_train, n_epochs=10000, lr=0.001, log_freq=100, seed=1) print('Done!')

Below I show some of the logs I print:

Check results for data in the training set: a_text_in_the_training_set_but_I_removed_here * Output of L_train for this data point is: [-1 0 -1 -1 0 -1 -1 -1 -1] * Output of probs_train for this data point is: [0.5 0.5] * Output of probs_train_filtered for this data point is: [0.5 0.5] * Output of preds_train_filtered for this data point is: 0

Expected behavior

I expect that if a datapoint receives label of only a single class, then it should not get equal probabilities of both classes when the label model is trained.

System info

How you installed Snorkel (conda, pip, source): pip

Build command you used (if compiling from source):

OS: Windows

Python version: 3.7

Snorkel version: 0.9

Versions of any other relevant libraries:

bug
opened by jamie0725 16
Using snorkel for Image Clasification/Detection

I am working on a binary classifier/detector involving Images. True if a particular object in present in the image and false if it doesnt. Is there an example to make snorkel work on Image data? Thanks.
Q&A

opened by anilmaddala 16
Question: Has anyone used snorkel for tabular numerical data?

I have a very large sampling of tabular data including mainly numerical fields where each line is an example I would like to try and label. Looking through the documentation and examples, I don't see a way to use the tool in this manner or at least easily get the data into a usable format. Does anyone know if this can or has been done? Any thoughts? The "concept" seems similar, but your sought after audience was text based labeling. Just wondering if it could be adapted. Thanks! Also, great work. Heard about the package on O'Reilly Data Show.
Q&A

opened by matt256 15
Multi-Label classification

Thank you for the very interesting work!

I was wondering if Snorkel is planning to extend its features to multilabel classifications? e.g., Semantic Role Labelling task.

Thanks a lot
feature request

opened by Wronskia 13
stable version of snorkel

I would like to bring snorkel to the conda users community and it would be great to learn if there will be a release file anytime soon for the latest/stable master repo. The gunzipped tarball of the most recent version (i.e., 0.7-beta) is 150 commits behind and a year older.

any thoughts?

opened by adbedada 12
New Website
Temporary redesign and restructure for Workshop

Highlight Snorkel captures Augment, Slice, and Label

Old links to Blogs, Papers, and Use Cases maintained

Overview Image for SFs
opened by paroma 12

Dependency need update

Issue description

On anaconda, snorkel requires numpy=1.19.5, on Apple m chip, NumPy in such version can not work. I am testing if snorkel works with apple metal GPU acceleration.

Code example/repro steps

import numpy
Traceback (most recent call last):
  File "/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/__init__.py", line 22, in <module>
    from . import multiarray
  File "/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/multiarray.py", line 12, in <module>
    from . import overrides
  File "/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/overrides.py", line 7, in <module>
    from numpy.core._multiarray_umath import (
ImportError: dlopen(/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so, 0x0002): Library not loaded: '@rpath/libcblas.3.dylib'
  Referenced from: '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so'
  Reason: tried: '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/bin/../lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/bin/../lib/libcblas.3.dylib' (no such file), '/usr/local/lib/libcblas.3.dylib' (no such file), '/usr/lib/libcblas.3.dylib' (no such file)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/__init__.py", line 140, in <module>
    from . import core
  File "/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/__init__.py", line 48, in <module>
    raise ImportError(msg)
ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.9 from "/Users/jhj/anaconda3/envs/snorkel/bin/python"
  * The NumPy version is: "1.19.5"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: dlopen(/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so, 0x0002): Library not loaded: '@rpath/libcblas.3.dylib'
  Referenced from: '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so'
  Reason: tried: '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/bin/../lib/libcblas.3.dylib' (no such file), '/Users/jhj/anaconda3/envs/snorkel/bin/../lib/libcblas.3.dylib' (no such file), '/usr/local/lib/libcblas.3.dylib' (no such file), '/usr/lib/libcblas.3.dylib' (no such file)

System info

How you installed Snorkel (conda, pip, source): conda
OS: MacOS 12.6
Python version: 3.9
Snorkel version: 0.9.9

Additional context

I did the following test:

install NumPy before installing the snorkel, the NumPy can work.

After installing the snorkel, the NumPy will downgrade to 1.19.5, then the NumPy can not work.

opened by yinxiangshi 2

Bump pyspark from 3.1.3 to 3.2.2
Bumps pyspark from 3.1.3 to 3.2.2.

Commits

78a5825 Preparing Spark release v3.2.2-rc1

ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases

001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...

9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check

bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....

c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...

32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...

be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check

1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec

3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
probabilistic weight for each labelling function?

Does Snorkel have probabilistic weight for each labelling function accounting for predicted label ? For example, suppose we have three labeling function resulting in a particular label output as 1. Does snorkel provide the weights corresponding to each labelling function resulting in a particular prediction? If yes, Can you please point out to the correct direction.

opened by arfeen93 1
Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel

Issue description

I tried Snorkel on five different problems and while the framework does work... MajorityLabelVoter and / or MajorityClassVoter outperformed the LabelModel on every problem. This problem has been noted in the past. In frustration, I asked around and found the secret to using Snorkel from two different people:

Snorkel's LabelModel doesn't work with broad coverage LabelingFunctions

This is the key to making the LabelModel work. You have to write fairly narrow LabelingFunctions that can work in combination to achieve the coverage you need. A 50% coverage LF will break the LabelModel.

For me it is good for a 3% performance bump across two classification problems using this method.

Big Question

Where do I make this contribution? This will make Snorkel much more useful to its users, many of whom had the same frustrations that I did. I would like it to get picked up in the website, and the tutorials haven't been updated in a while. Is that still the right place for this PR? https://github.com/snorkel-team/snorkel-tutorials

cc @ajratner @henryre

opened by rjurney 2

AttributeError: split not found

Issue description

In the logistic regression model when I run code i face Attribute erorr. A clear and concise description of what the bug is.

Code example/repro steps

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer



tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop_words, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop_words, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])
#The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

after run this code I face the split not found error. Please try to provide a minimal example to repro the bug. Error messages and stack traces are also helpful.

AttributeError: split not found

no-stale

opened by WithIbadKhan 3

LabelModel should support loading sparse matrices

Problem I Want To solve

I've found it easy to generate millions of labels with label functions, but loading them into Snorkel is hard. The problem is the conversion to augmented format and (for training) the calculation of the O matrix.

Describe the solution you'd like

In addition to letting the user load the full label matrix (n_docs,n_funcs), we can let the user load the indicator matrix (n_docs,n_funcsn_labels) in sparse format. e.g. user would input a list of tuples (doc_id,func_idnum_labels+label_id) and populate a sparse matrix. This makes the [email protected] calculation cheap, and saves lots of time and memory building indicator matrix.

Torch supports Sparse matrices, so we could even do training and inference without the memory hassle of the dense L matrix.

Example:

I calculate and store the label functions in SQL, so it's easy to generate that list of tuples.

Caveat

This would make modelling dependencies between LFs harder, but since _create_tree is degenerate that doesn't seem to be an issue in practice.

Describe alternatives you've considered

The other alternative is some "big-data" solution, but that's a lot of friction for something I can do so simply.

Additional context

I'm implementing this anyway for my own fun, happy to contribute it back if theirs interest
feature request help wanted no-stale

opened by talolard 3

Releases(v0.9.9)

v0.9.9(Jul 29, 2022)
[0.9.9] - 2022-07-28

[Added]

PR #1690: Bump numpy version for Mac M1 compat

PR #1696: Fix linting

PR #1694: Fix test loss

PR #1693: Update black version to fix build

PR #1671: Fix flaky test

PR #1688: Fix branch filtering for complex tests

PR #1686: Switch to CircleCI badge

PR #1685: Migrating travis -> circle

[Contributors]

Thanks to @rsmith49, @zexuan-zhou, @humzaiqbal, @fpoms, @crawlingcub, and @henryre for contributions!
Source code(tar.gz)
Source code(zip)
snorkel-0.9.9-py3-none-any.whl(100.89 KB)
snorkel-0.9.9.tar.gz(106.68 KB)
v0.9.8(Nov 19, 2021)
[0.9.8] - 2021-11-18

[Added]

PR #1649: Add progress bar for label model training

[Changed]

PR #1645: Upgrade networkx requirement to support networkx>=2.5

PR #1663: Upgrade tensorboard requirement to support tensorboard>=2.0

PR #1652: Fix issue with logging display during label model training

PR #1677: Bump version upper bound of networkx to 2.7

[Contributors]

Thanks to @anerirana, @thatch, @sfilipi, @hardianlawi, @asottile and @marekmodry for contributions!
Source code(tar.gz)
Source code(zip)
snorkel-0.9.8-py3-none-any.whl(100.79 KB)
snorkel-0.9.8.tar.gz(74.96 KB)
v0.9.7(Mar 9, 2021)
[0.9.7] - 2021-03-03

[Breaking Changes]

PR #1633, #1628: Update requirements for numpy (>=1.16.5,<1.20.0), pandas (>=1.0.0,<2.0.0), and scikit-learn (>=0.20.2,<0.25.0)

[Added]

PR #1628: Add unit test coverage for Python 3.8

PR #1616: Accept prec_init as array or list in LabelModel

PR #1602: Add get_label_instances to analysis module

[Contributors]

Thanks to @DavidKoleczek and @antonis19 for contributions!
Source code(tar.gz)
Source code(zip)
snorkel-0.9.7-py3-none-any.whl(142.04 KB)
snorkel-0.9.7.tar.gz(107.94 KB)
v0.9.6(Aug 9, 2020)
[Added]

PR #1572: Allow specification of memoize_key in preprocessors

PR #1597: Improved error messages for MultitaskClassifier

PR #1592: Improved LabelModel documentation

Source code(tar.gz)
Source code(zip)
snorkel-0.9.6-py3-none-any.whl(141.02 KB)
snorkel-0.9.6.tar.gz(104.74 KB)
v0.9.5(Apr 7, 2020)
[Breaking Changes]

PR #1565: Upgrade pytorch requirement to support torch>=1.2.0

Source code(tar.gz)
Source code(zip)
snorkel-0.9.5-py3-none-any.whl(138.45 KB)
snorkel-0.9.5.tar.gz(98.12 KB)
v0.9.4(Apr 5, 2020)
[Breaking Changes]

PR #1535: Refactor baseline model imports in snorkel.labeling.model

Now, from from snorkel.labeling import MajorityLabelVoter, LabelModel can be expressed from snorkel.labeling.model import MajorityLabelVoter, LabelModel

[Added]

PR #1533: Allow option to save optimizer state in Trainer

PR #1523: Add GPU option for spaCy preprocessor

[Changed]

PR #1540: Fix squeeze bug in to_int_label_array function

PR #1520: Fix error bucket documentation in snorkel.analysis.error_analysis

[Contributors]

Thanks to @rjurney, @ptrcklv for recent contributions!
Source code(tar.gz)
Source code(zip)
snorkel-0.9.4-py3-none-any.whl(138.46 KB)
snorkel-0.9.4.tar.gz(98.22 KB)
v0.9.3(Nov 12, 2019)
[Changed]

PR #1502: Faster symmetry breaking in LabelModel using Munkres algorithm

Source code(tar.gz)
Source code(zip)
snorkel-0.9.3-py3-none-any.whl(136.27 KB)
snorkel-0.9.3.tar.gz(96.71 KB)
v0.9.2(Oct 23, 2019)
[Breaking Changes]

PR #1481: removed fault tolerant mode for labeling functions

[Added]

PR #1481: fault tolerant mode for appliers

[Changed]

PR #1450, 1467: ignore abstains in scoring, except coverage

PR #1463: serialize all attributes of label model

PR #1466: fix label model GPU training option

PR #1477, #1492: pin dependency versions

[Removed]

PR #1454: set_seed utility removed

[Contributors]

Thanks to @HiromuHota, @ferhatelmas, and @garaud for their recent contributions!
Source code(tar.gz)
Source code(zip)
snorkel-0.9.2-py3-none-any.whl(136.26 KB)
snorkel-0.9.2.tar.gz(96.52 KB)
v0.9.1(Sep 6, 2019)
[Breaking Changes]

PR #1453: SlicingClassifier renamed to SliceAwareClassifier

[Added]

PR #1451: add heuristic for breaking symmetry in multiple label model optima case

PR #1442: integration test for MultitaskClassifier

[Changed]

PR #1444: fix label model weight clamping behavior

PR #1445: fix JSON log writer

PR #1447: fix correct/incorrect count bug in LFAnalysis

PR #1428, #1449: catch invalid label model inputs

PR #1441: make inputs to Scorer.score optional

Source code(tar.gz)
Source code(zip)
snorkel-0.9.1-py3-none-any.whl(134.85 KB)
snorkel-0.9.1.tar.gz(95.54 KB)
v0.9.0(Aug 15, 2019)

Version 0.9.0 is a complete redesign of the Snorkel library. There's too much added, changed, and removed to list in this entry. For more information on the release, check out the project homepage. From here forward, we'll keep a detailed changelog.
Source code(tar.gz)
Source code(zip)
snorkel-0.9.0-py3-none-any.whl(128.44 KB)
snorkel-0.9.0.tar.gz(90.46 KB)
v0.7.0-beta(Jun 27, 2018)
Major changes in v0.7:

PyTorch classifiers

Installation now via Conda and pip

Now spaCy is the default parser (v1), with support for v2

And many more fixes, additions, and new material!

Source code(tar.gz)
Source code(zip)
v0.6.3(May 3, 2018)

Version 0.6.3 adds many bug fixes and improved default hyperparameters for the generative model.
Source code(tar.gz)
Source code(zip)
v0.6.2(Jul 15, 2017)

Source code(tar.gz)
Source code(zip)
v0.5(Jan 18, 2017)

Source code(tar.gz)
Source code(zip)
v0.4(Jan 13, 2017)

Source code(tar.gz)
Source code(zip)

Owner

Snorkel Team

GitHub Repository https://snorkel.org

First-Order Probabilistic Programming Language

FOPPL: A First-Order Probabilistic Programming Language This is an implementation of FOPPL, an S-expression based probabilistic programming language d

23 Dec 20, 2022

The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection .

GCoNet The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection . Trained model Download final_gconet.pth

46 Nov 17, 2022

A Python 3 package for state-of-the-art statistical dimension reduction methods

direpack: a Python 3 library for state-of-the-art statistical dimension reduction techniques This package delivers a scikit-learn compatible Python 3

32 Dec 14, 2022

This is the Pytorch implementation of Progressive Attentional Manifold Alignment.

PAMA This is the Pytorch implementation of Progressive Attentional Manifold Alignment. Requirements python 3.6 pytorch 1.2.0+ PIL, numpy, matplotlib C

98 Nov 15, 2022

(NeurIPS 2021) Pytorch implementation of paper "Re-ranking for image retrieval and transductive few-shot classification"

SSR (NeurIPS 2021) Pytorch implementation of paper "Re-ranking for image retrieval and transductivefew-shot classification" [Paper] [Project webpage]

29 Dec 06, 2022

A voice recognition assistant similar to amazon alexa, siri and google assistant.

kenyan-Siri Build an Artificial Assistant Full tutorial (video) To watch the tutorial, click on the image below Installation For windows users (run th

3 Aug 19, 2022

🇰🇷 Text to Image in Korean

KoDALLE Utilizing pretrained language model’s token embedding layer and position embedding layer as DALLE’s text encoder. Background Training DALLE mo

74 Sep 22, 2022

[arXiv'22] Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation

Panoptic NeRF Project Page | Paper | Dataset Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation Xiao Fu*, Shangzhan zhang*,

111 Dec 16, 2022

Node Editor Plug for Blender

NodeEditor Blender的程序化建模插件 Show Current 基本框架：自定义的tree-node-socket、tree中的node与socket采用字典查询、基于socket入度的拓扑排序数据传递和处理依靠Tree中的字典，socket传递字典key TODO 增加更多的节点

11 Dec 03, 2022

Example of semantic segmentation in Keras

keras-semantic-segmentation-example Example of semantic segmentation in Keras Single class example: Generated data: random ellipse with random color o

53 Mar 23, 2022

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022

Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

Codes-for-Algorithms Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

1 Apr 12, 2022

NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

593 Jan 03, 2023

Code for "Localization with Sampling-Argmax", NeurIPS 2021

Localization with Sampling-Argmax [Paper] [arXiv] [Project Page] Localization with Sampling-Argmax Jiefeng Li, Tong Chen, Ruiqi Shi, Yujing Lou, Yong-

71 Dec 17, 2022

공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다.

ObsCare_Main 소개 공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다. CCTV의 대수가 급격히 늘어나면서 관리와 효율성 문제와 더불어, 곳곳에 설치된 CCTV를 개별 관제하는 것으로는 응급 상

5 Jul 07, 2022

Hard cater examples from Hopper ICLR paper

CATER-h Honglu Zhou*, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf (*Contact: honglu.zhou

6 May 11, 2021

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

NCVX NCVX: A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning. Please check https://ncvx.org for detailed instruction

28 Aug 03, 2022

Data and Code for paper Outlining and Filling: Hierarchical Query Graph Generation for Answering Complex Questions over Knowledge Graph is available for research purposes.

Data and Code for paper Outlining and Filling: Hierarchical Query Graph Generation for Answering Complex Questions over Knowledge Graph is available f

5 Nov 10, 2022

Edge Restoration Quality Assessment

ERQA - Edge Restoration Quality Assessment ERQA - a full-reference quality metric designed to analyze how good image and video restoration methods (SR

27 Dec 17, 2022

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

3 Dec 28, 2022

A system for quickly generating training data with weak supervision

Related tags

Overview

Announcement

Quick Links

Getting Started

Installation

Discussion

Issues

Contributions

Community Forum

Announcements

Twitter

Comments

Description of proposed changes

Related issue(s)

Issue description

Code example/repro steps

Expected behavior

System info

Issue description

Code example/repro steps

System info

Additional context

Issue description

Big Question

Issue description

Code example/repro steps

Problem I Want To solve

Describe the solution you'd like

Example:

Caveat

Describe alternatives you've considered

Additional context

Releases(v0.9.9)

v0.9.9(Jul 29, 2022)

[0.9.9] - 2022-07-28

[Added]

[Contributors]

v0.9.8(Nov 19, 2021)

[0.9.8] - 2021-11-18

[Added]

[Changed]

[Contributors]

v0.9.7(Mar 9, 2021)

[0.9.7] - 2021-03-03

[Breaking Changes]

[Added]

[Contributors]

v0.9.6(Aug 9, 2020)

[Added]

v0.9.5(Apr 7, 2020)

[Breaking Changes]

v0.9.4(Apr 5, 2020)

[Breaking Changes]

[Added]

[Changed]

[Contributors]

v0.9.3(Nov 12, 2019)

[Changed]

v0.9.2(Oct 23, 2019)

[Breaking Changes]

[Added]

[Changed]

[Removed]

[Contributors]

v0.9.1(Sep 6, 2019)

[Breaking Changes]

[Added]

[Changed]

v0.9.0(Aug 15, 2019)

v0.7.0-beta(Jun 27, 2018)

Major changes in v0.7:

v0.6.3(May 3, 2018)

v0.6.2(Jul 15, 2017)

v0.5(Jan 18, 2017)

v0.4(Jan 13, 2017)

Owner

Snorkel Team

First-Order Probabilistic Programming Language

This is the Pytorch implementation of Progressive Attentional Manifold Alignment.