API reference¶
Dataset loading utilities¶

seqlearn.datasets.
load_conll
(f, features, n_features=65536, split=False)¶ Load CoNLL file, extract features on the tokens and vectorize them.
The ConLL file format is a lineoriented text format that describes sequences in a spaceseparated format, separating the sequences with blank lines. Typically, the last spaceseparated part is a label.
Since the tabseparated parts are usually tokens (and maybe things like partofspeech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.
A
sklearn.feature_extraction.FeatureHasher
(the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.Parameters: f : {string, filelike}
Input file.
features : callable
Feature extraction function. Must take a list of tokens l that represent a single sequence and an index i into this list, and must return an iterator over strings that represent the features of l[i].
n_features : integer, optional
Number of columns in the output.
split : boolean, default=False
Whether to split lines on whitespace beyond what is needed to parse out the labels. This is useful for CoNLL files that have extra columns containing information like part of speech tags.
Returns: X : scipy.sparse matrix, shape (n_samples, n_features)
Samples (feature vectors), as a single sparse matrix.
y : np.ndarray, dtype np.string, shape n_samples
Persample labels.
lengths : np.ndarray, dtype np.int32, shape n_sequences
Lengths of sequences within (X, y). The sum of these is equal to n_samples.
Evaluation and model selection¶

class
seqlearn.evaluation.
SequenceKFold
(lengths, n_folds=3, n_iter=1, shuffle=False, random_state=None, yield_lengths=True)¶ Sequenceaware (repeated) kfold CV splitter.
Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.
Parameters: lengths : arraylike of integers, shape (n_samples,)
Lengths of sequences, in the order in which they appear in the dataset.
n_folds : int, optional
Number of folds.
n_iter : int, optional
Number of iterations of repeated kfold splitting. The default value is 1, meaning a single kfold split; values >1 give repeated kfold with shuffling (see below).
shuffle : boolean, optional
Whether to shuffle sequences.
random_state : {np.random.RandomState, integer}, optional
Random state/random seed for shuffling.
yield_lengths : boolean, optional
Whether to yield lengths in addition to indices/masks for both training and test sets.
Returns: folds : iterable
A generator yielding (train_indices, test_indices) pairs when yield_lengths is false, or tuples (train_indices, train_lengths, test_indices, test_lengths) when yield_lengths is true.

seqlearn.evaluation.
bio_f_score
(y_true, y_pred)¶ Fscore for BIOtagging scheme, as used by CoNLL.
This Fscore variant is used for evaluating namedentity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.
Support for tags schemes with classes (e.g. “BNP”) are limited: reported scores may be too high for inconsistent labelings.
Parameters: y_true : arraylike of strings, shape (n_samples,)
Ground truth labeling.
y_pred : arraylike of strings, shape (n_samples,)
Sequence classifier’s predictions.
Returns: f : float
Fscore.

seqlearn.evaluation.
whole_sequence_accuracy
(y_true, y_pred, lengths)¶ Average accuracy measured on whole sequences.
Returns the fraction of sequences in y_true that occur in y_pred without a single error.
Sequence classifiers¶
Hidden Markov models (HMMs) with supervised training.

class
seqlearn.hmm.
MultinomialHMM
(decode='viterbi', alpha=0.01)¶ Firstorder hidden Markov model with multinomial event model.
Parameters: decode : string, optional
Decoding algorithm, either “bestfirst” or “viterbi” (default). Bestfirst decoding is also called posterior decoding in the HMM literature.
alpha : float
Lidstone (additive) smoothing parameter.
Methods

fit
(X, y, lengths)¶ Fit HMM model to data.
Parameters: X : {arraylike, sparse matrix}, shape (n_samples, n_features)
Feature matrix of individual samples.
y : arraylike, shape (n_samples,)
Target labels.
lengths : arraylike of integers, shape (n_sequences,)
Lengths of the individual sequences in X, y. The sum of these should be n_samples.
Returns: self : MultinomialHMM
Notes
Make sure the training set (X) is onehot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.


class
seqlearn.perceptron.
StructuredPerceptron
(decode='viterbi', lr_exponent=0.1, max_iter=10, random_state=None, trans_features=False, verbose=0)¶ Structured perceptron for sequence classification.
This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.
Parameters: decode : string, optional
Decoding algorithm, either “bestfirst” or “viterbi” (default).
lr_exponent : float, optional
Exponent for inverse scaling learning rate. The effective learning rate is 1. / (t ** lr_exponent), where t is the iteration number.
max_iter : integer, optional
Number of iterations (aka. epochs). Each sequence is visited once in each iteration.
random_state : {integer, np.random.RandomState}, optional
Random state or seed used for shuffling sequences within each iteration.
trans_features : boolean, optional
Whether to attach features to transitions between labels as well as individual labels. This requires more time, more memory and more samples to train properly.
verbose : integer, optional
Verbosity level. Defaults to zero (quiet mode).
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.
Methods

fit
(X, y, lengths)¶ Fit to a set of sequences.
Parameters: X : {arraylike, sparse matrix}, shape (n_samples, n_features)
Feature matrix of individual samples.
y : arraylike, shape (n_samples,)
Target labels.
lengths : arraylike of integers, shape (n_sequences,)
Lengths of the individual sequences in X, y. The sum of these should be n_samples.
Returns: self : StructuredPerceptron
