API reference

Dataset loading utilities

seqlearn.datasets.load_conll(f, features, n_features=65536, split=False)

Load CoNLL file, extract features on the tokens and vectorize them.

The ConLL file format is a line-oriented text format that describes sequences in a space-separated format, separating the sequences with blank lines. Typically, the last space-separated part is a label.

Since the tab-separated parts are usually tokens (and maybe things like part-of-speech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.

A sklearn.feature_extraction.FeatureHasher (the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.

Parameters:

f : {string, file-like}

Input file.

features : callable

Feature extraction function. Must take a list of tokens l that represent a single sequence and an index i into this list, and must return an iterator over strings that represent the features of l[i].

n_features : integer, optional

Number of columns in the output.

split : boolean, default=False

Whether to split lines on whitespace beyond what is needed to parse out the labels. This is useful for CoNLL files that have extra columns containing information like part of speech tags.

Returns:

X : scipy.sparse matrix, shape (n_samples, n_features)

Samples (feature vectors), as a single sparse matrix.

y : np.ndarray, dtype np.string, shape n_samples

Per-sample labels.

lengths : np.ndarray, dtype np.int32, shape n_sequences

Lengths of sequences within (X, y). The sum of these is equal to n_samples.

Evaluation and model selection

class seqlearn.evaluation.SequenceKFold(lengths, n_folds=3, n_iter=1, shuffle=False, random_state=None, yield_lengths=True)

Sequence-aware (repeated) k-fold CV splitter.

Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.

Parameters:

lengths : array-like of integers, shape (n_samples,)

Lengths of sequences, in the order in which they appear in the dataset.

n_folds : int, optional

Number of folds.

n_iter : int, optional

Number of iterations of repeated k-fold splitting. The default value is 1, meaning a single k-fold split; values >1 give repeated k-fold with shuffling (see below).

shuffle : boolean, optional

Whether to shuffle sequences.

random_state : {np.random.RandomState, integer}, optional

Random state/random seed for shuffling.

yield_lengths : boolean, optional

Whether to yield lengths in addition to indices/masks for both training and test sets.

Returns:

folds : iterable

A generator yielding (train_indices, test_indices) pairs when yield_lengths is false, or tuples (train_indices, train_lengths, test_indices, test_lengths) when yield_lengths is true.

seqlearn.evaluation.bio_f_score(y_true, y_pred)

F-score for BIO-tagging scheme, as used by CoNLL.

This F-score variant is used for evaluating named-entity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.

Support for tags schemes with classes (e.g. “B-NP”) are limited: reported scores may be too high for inconsistent labelings.

Parameters:

y_true : array-like of strings, shape (n_samples,)

Ground truth labeling.

y_pred : array-like of strings, shape (n_samples,)

Sequence classifier’s predictions.

Returns:

f : float

F-score.

seqlearn.evaluation.whole_sequence_accuracy(y_true, y_pred, lengths)

Average accuracy measured on whole sequences.

Returns the fraction of sequences in y_true that occur in y_pred without a single error.

Sequence classifiers

Hidden Markov models (HMMs) with supervised training.

class seqlearn.hmm.MultinomialHMM(decode='viterbi', alpha=0.01)

First-order hidden Markov model with multinomial event model.

Parameters:

decode : string, optional

Decoding algorithm, either “bestfirst” or “viterbi” (default). Best-first decoding is also called posterior decoding in the HMM literature.

alpha : float

Lidstone (additive) smoothing parameter.

Methods

fit(X, y, lengths)

Fit HMM model to data.

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Feature matrix of individual samples.

y : array-like, shape (n_samples,)

Target labels.

lengths : array-like of integers, shape (n_sequences,)

Lengths of the individual sequences in X, y. The sum of these should be n_samples.

Returns:

self : MultinomialHMM

Notes

Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.

class seqlearn.perceptron.StructuredPerceptron(decode='viterbi', lr_exponent=0.1, max_iter=10, random_state=None, trans_features=False, verbose=0)

Structured perceptron for sequence classification.

This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.

Parameters:

decode : string, optional

Decoding algorithm, either “bestfirst” or “viterbi” (default).

lr_exponent : float, optional

Exponent for inverse scaling learning rate. The effective learning rate is 1. / (t ** lr_exponent), where t is the iteration number.

max_iter : integer, optional

Number of iterations (aka. epochs). Each sequence is visited once in each iteration.

random_state : {integer, np.random.RandomState}, optional

Random state or seed used for shuffling sequences within each iteration.

trans_features : boolean, optional

Whether to attach features to transitions between labels as well as individual labels. This requires more time, more memory and more samples to train properly.

verbose : integer, optional

Verbosity level. Defaults to zero (quiet mode).

References

M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.

Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.

Methods

fit(X, y, lengths)

Fit to a set of sequences.

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Feature matrix of individual samples.

y : array-like, shape (n_samples,)

Target labels.

lengths : array-like of integers, shape (n_sequences,)

Lengths of the individual sequences in X, y. The sum of these should be n_samples.

Returns:

self : StructuredPerceptron