API reference¶
Dataset loading utilities¶
-
seqlearn.datasets.
load_conll
(f, features, n_features=65536, split=False)¶ Load CoNLL file, extract features on the tokens and vectorize them.
The ConLL file format is a line-oriented text format that describes sequences in a space-separated format, separating the sequences with blank lines. Typically, the last space-separated part is a label.
Since the tab-separated parts are usually tokens (and maybe things like part-of-speech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.
A
sklearn.feature_extraction.FeatureHasher
(the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.Parameters: f : {string, file-like}
Input file.
features : callable
Feature extraction function. Must take a list of tokens l that represent a single sequence and an index i into this list, and must return an iterator over strings that represent the features of l[i].
n_features : integer, optional
Number of columns in the output.
split : boolean, default=False
Whether to split lines on whitespace beyond what is needed to parse out the labels. This is useful for CoNLL files that have extra columns containing information like part of speech tags.
Returns: X : scipy.sparse matrix, shape (n_samples, n_features)
Samples (feature vectors), as a single sparse matrix.
y : np.ndarray, dtype np.string, shape n_samples
Per-sample labels.
lengths : np.ndarray, dtype np.int32, shape n_sequences
Lengths of sequences within (X, y). The sum of these is equal to n_samples.
Evaluation and model selection¶
-
class
seqlearn.evaluation.
SequenceKFold
(lengths, n_folds=3, n_iter=1, shuffle=False, random_state=None, yield_lengths=True)¶ Sequence-aware (repeated) k-fold CV splitter.
Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.
Parameters: lengths : array-like of integers, shape (n_samples,)
Lengths of sequences, in the order in which they appear in the dataset.
n_folds : int, optional
Number of folds.
n_iter : int, optional
Number of iterations of repeated k-fold splitting. The default value is 1, meaning a single k-fold split; values >1 give repeated k-fold with shuffling (see below).
shuffle : boolean, optional
Whether to shuffle sequences.
random_state : {np.random.RandomState, integer}, optional
Random state/random seed for shuffling.
yield_lengths : boolean, optional
Whether to yield lengths in addition to indices/masks for both training and test sets.
Returns: folds : iterable
A generator yielding (train_indices, test_indices) pairs when yield_lengths is false, or tuples (train_indices, train_lengths, test_indices, test_lengths) when yield_lengths is true.
-
seqlearn.evaluation.
bio_f_score
(y_true, y_pred)¶ F-score for BIO-tagging scheme, as used by CoNLL.
This F-score variant is used for evaluating named-entity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.
Support for tags schemes with classes (e.g. “B-NP”) are limited: reported scores may be too high for inconsistent labelings.
Parameters: y_true : array-like of strings, shape (n_samples,)
Ground truth labeling.
y_pred : array-like of strings, shape (n_samples,)
Sequence classifier’s predictions.
Returns: f : float
F-score.
-
seqlearn.evaluation.
whole_sequence_accuracy
(y_true, y_pred, lengths)¶ Average accuracy measured on whole sequences.
Returns the fraction of sequences in y_true that occur in y_pred without a single error.
Sequence classifiers¶
Hidden Markov models (HMMs) with supervised training.
-
class
seqlearn.hmm.
MultinomialHMM
(decode='viterbi', alpha=0.01)¶ First-order hidden Markov model with multinomial event model.
Parameters: decode : string, optional
Decoding algorithm, either “bestfirst” or “viterbi” (default). Best-first decoding is also called posterior decoding in the HMM literature.
alpha : float
Lidstone (additive) smoothing parameter.
Methods
-
fit
(X, y, lengths)¶ Fit HMM model to data.
Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features)
Feature matrix of individual samples.
y : array-like, shape (n_samples,)
Target labels.
lengths : array-like of integers, shape (n_sequences,)
Lengths of the individual sequences in X, y. The sum of these should be n_samples.
Returns: self : MultinomialHMM
Notes
Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.
-
-
class
seqlearn.perceptron.
StructuredPerceptron
(decode='viterbi', lr_exponent=0.1, max_iter=10, random_state=None, trans_features=False, verbose=0)¶ Structured perceptron for sequence classification.
This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.
Parameters: decode : string, optional
Decoding algorithm, either “bestfirst” or “viterbi” (default).
lr_exponent : float, optional
Exponent for inverse scaling learning rate. The effective learning rate is 1. / (t ** lr_exponent), where t is the iteration number.
max_iter : integer, optional
Number of iterations (aka. epochs). Each sequence is visited once in each iteration.
random_state : {integer, np.random.RandomState}, optional
Random state or seed used for shuffling sequences within each iteration.
trans_features : boolean, optional
Whether to attach features to transitions between labels as well as individual labels. This requires more time, more memory and more samples to train properly.
verbose : integer, optional
Verbosity level. Defaults to zero (quiet mode).
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.
Methods
-
fit
(X, y, lengths)¶ Fit to a set of sequences.
Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features)
Feature matrix of individual samples.
y : array-like, shape (n_samples,)
Target labels.
lengths : array-like of integers, shape (n_sequences,)
Lengths of the individual sequences in X, y. The sum of these should be n_samples.
Returns: self : StructuredPerceptron
-