nlp_classify_tokens
This component is mostly stable but breaking changes could still be made outside of major version releases if a fundamental problem with the component is found.
Performs token classification using a Hugging Face 🤗 NLP pipeline with an ONNX Runtime model.
Introduced in version v1.11.0.
- Common
- Advanced
# Common config fields, showing default values
label: ""
nlp_classify_tokens:
name: "" # No default (optional)
path: /path/to/models/my_model.onnx # No default (required)
aggregation_strategy: SIMPLE
ignore_labels: []
# All config fields, showing default values
label: ""
nlp_classify_tokens:
name: "" # No default (optional)
path: /path/to/models/my_model.onnx # No default (required)
enable_download: false
download_options:
repository: KnightsAnalytics/distilbert-NER # No default (required)
onnx_filepath: model.onnx
aggregation_strategy: SIMPLE
ignore_labels: []
Token Classification
Token classification assigns a label to individual tokens in a sentence. This processor runs token classification inference against batches of text data, returning a set of Entities classification corresponding to each input. This component uses Hugot, a library that provides an interface for running Open Neural Network Exchange (ONNX) models and transformer pipelines, with a focus on NLP tasks.
Currently, Bento only implements:
What is a pipeline?
From HuggingFace docs:
A pipeline in 🤗 Transformers is an abstraction referring to a series of steps that are executed in a specific order to preprocess and transform data and return a prediction from a model. Some example stages found in a pipeline might be data preprocessing, feature extraction, and normalization.
While, only models in ONNX format are supported, exporting existing formats to ONNX is both possible and straightforward in most standard ML libraries. For more on this, check out the ONNX conversion docs. Otherwise, check out using HuggingFace Optimum for easy model conversion.
Examples
- Named Entity Recognition
- Custom Entity Extraction
Extract entities like persons, organizations, and locations from text.
pipeline:
processors:
- nlp_classify_tokens:
path: "KnightsAnalytics/distilbert-NER"
aggregation_strategy: "SIMPLE"
ignore_labels: ["O"]
# In: "John works at Apple Inc. in New York."
# Out: [
# {"Entity": "PER", "Score": 0.997136, "Index": 0, "Word": "John", "Start": 0, "End": 4, "IsSubword": false},
# {"Entity": "ORG", "Score": 0.985432, "Index": 3, "Word": "Apple Inc.", "Start": 14, "End": 24, "IsSubword": false},
# {"Entity": "LOC", "Score": 0.972841, "Index": 6, "Word": "New York", "Start": 28, "End": 36, "IsSubword": false}
# ]
Extract entities with no aggregation to see individual token classifications.
pipeline:
processors:
- nlp_classify_tokens:
path: "KnightsAnalytics/distilbert-NER"
aggregation_strategy: "NONE"
ignore_labels: ["O", "MISC"]
# In: "Microsoft was founded by Bill Gates."
# Out: [
# {"Entity": "B-ORG", "Score": 0.991234, "Index": 0, "Word": "Microsoft", "Start": 0, "End": 9, "IsSubword": false},
# {"Entity": "B-PER", "Score": 0.987654, "Index": 4, "Word": "Bill", "Start": 23, "End": 27, "IsSubword": false},
# {"Entity": "I-PER", "Score": 0.976543, "Index": 5, "Word": "Gates", "Start": 28, "End": 33, "IsSubword": false}
# ]
Fields
name
Name of the hugot pipeline. Defaults to a random UUID if not set.
Type: string
path
Path to the ONNX model file, or directory containing the model. When downloading (enable_download: true
), this becomes the destination and must be a directory.
Type: string
# Examples
path: /path/to/models/my_model.onnx
path: /path/to/models/
enable_download
When enabled, attempts to download an ONNX Runtime compatible model from HuggingFace specified in repository
.
Type: bool
Default: false
download_options
Options used to download a model directly from HuggingFace. Before the model is downloaded, validation occurs to ensure the remote repository contains both an.onnx
and tokenizers.json
file.
Type: object
download_options.repository
The name of the huggingface model repository.
Type: string
# Examples
repository: KnightsAnalytics/distilbert-NER
repository: KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english
repository: sentence-transformers/all-MiniLM-L6-v2
download_options.onnx_filepath
Filepath of the ONNX model within the repository. Only needed when multiple .onnx
files exist.
Type: string
Default: "model.onnx"
# Examples
onnx_filepath: onnx/model.onnx
onnx_filepath: onnx/model_quantized.onnx
onnx_filepath: onnx/model_fp16.onnx
aggregation_strategy
The aggregation strategy to use for the token classification pipeline.
Type: string
Default: "SIMPLE"
Options: SIMPLE
, NONE
.
ignore_labels
Labels to ignore in the token classification pipeline.
Type: array
Default: []
# Examples
ignore_labels:
- O
- MISC