// Copyright (C) 2013 Davis E. King (davis@dlib.net) // License: Boost Software License See LICENSE.txt for the full license. #undef DLIB_SEQUENCE_SeGMENTER_ABSTRACT_H_h_ #ifdef DLIB_SEQUENCE_SeGMENTER_ABSTRACT_H_h_ #include "../matrix.h" #include <vector> #include "sequence_labeler_abstract.h" namespace dlib { // ---------------------------------------------------------------------------------------- class example_feature_extractor { /*! WHAT THIS OBJECT REPRESENTS This object defines the interface a feature extractor must implement if it is to be used with the sequence_segmenter defined at the bottom of this file. The model used by sequence_segmenter objects is the following. Given an input sequence x, predict an output label sequence y such that: y == argmax_Y dot(w, PSI(x,Y)) Where w is a parameter vector and the label sequence defines a segmentation of x. Recall that a sequence_segmenter uses the BIO or BILOU tagging model and is also an instantiation of the dlib::sequence_labeler. Selecting to use the BIO model means that each element of the label sequence y takes on one of three possible values (B, I, or O) and together these labels define a segmentation of the sequence. For example, to represent a segmentation of the sequence of words "The dog ran to Bob Jones" where only "Bob Jones" was segmented out we would use the label sequence OOOOBI. The BILOU model is similar except that it uses five different labels and each segment is labeled as U, BL, BIL, BIIL, BIIIL, and so on depending on its length. Therefore, the BILOU model is able to more explicitly model the ends of the segments than the BIO model, but has more parameters to estimate. Keeping all this in mind, the purpose of a sequence_segmenter is to take care of the bookkeeping associated with creating BIO/BILOU tagging models for segmentation tasks. In particular, it presents the user with a simplified version of the interface used by the dlib::sequence_labeler. It does this by completely hiding the BIO/BILOU tags from the user and instead exposes an explicit sub-segment based labeling representation. It also simplifies the construction of the PSI() feature vector. Like in the dlib::sequence_labeler, PSI() is a sum of feature vectors, each derived from the entire input sequence x but only part of the label sequence y. In the case of the sequence_segmenter, we use an order one Markov model. This means that PSI(x,y) == sum_i XI(x, y_{i-1}, y_{i}, i) where the sum is taken over all the elements in the sequence. At each element we extract a feature vector, XI(), that is expected to encode important details describing what the i-th position of the sequence looks like in the context of the current and previous labels. To do this, XI() is allowed to look at any part of the input sequence x, the current and previous labels, and of course it must also know the position in question, i. The sequence_segmenter simplifies this further by decomposing XI() into components which model the current window around each position as well as the conjunction of the current window around each position and the previous label. In particular, the sequence_segmenter only asks a user to provide a single feature vector which characterizes a position of the sequence independent of any labeling. We denote this feature vector by ZI(x,i), where x is the sequence and i is the position in question. For example, suppose we use a window size of 3 and BIO tags, then we can put all this together and define XI() in terms of ZI(). To do this, we can think of XI() as containing 12*3 slots which contain either a zero vector or a ZI() vector. Each combination of window position and labeling has a different slot. To explain further, consider the following examples where we have annotated which parts of XI() correspond to each slot. If the previous and current label are both B and we use a window size of 3 then XI() would be instantiated as: XI(x, B, B, i) = [ZI(x,i-1) \ ZI(x,i) > If current label is B ZI(x,i+1) / 0 \ 0 > If current label is I 0 / 0 \ 0 > If current label is O 0 / ZI(x,i-1) \ ZI(x,i) > If previous label is B and current label is B ZI(x,i+1) / 0 \ 0 > If previous label is B and current label is I 0 / 0 \ 0 > If previous label is B and current label is O 0 / 0 \ 0 > If previous label is I and current label is B 0 / 0 \ 0 > If previous label is I and current label is I 0 / 0 \ 0 > If previous label is I and current label is O 0 / 0 \ 0 > If previous label is O and current label is B 0 / 0 \ 0 > If previous label is O and current label is I 0 / 0 \ 0 > If previous label is O and current label is O 0] / If the previous label is I and the current label is O and we use a window size of 3 then XI() would be instantiated as: XI(x, I, O, i) = [0 \ 0 > If current label is B 0 / 0 \ 0 > If current label is I 0 / ZI(x,i-1) \ ZI(x,i) > If current label is O ZI(x,i+1) / 0 \ 0 > If previous label is B and current label is B 0 / 0 \ 0 > If previous label is B and current label is I 0 / 0 \ 0 > If previous label is B and current label is O 0 / 0 \ 0 > If previous label is I and current label is B 0 / 0 \ 0 > If previous label is I and current label is I 0 / ZI(x,i-1) \ ZI(x,i) > If previous label is I and current label is O ZI(x,i+1) / 0 \ 0 > If previous label is O and current label is B 0 / 0 \ 0 > If previous label is O and current label is I 0 / 0 \ 0 > If previous label is O and current label is O 0] / If we had instead used the BILOU tagging model the XI() vector would have been similarly defined except that there would be 30*3 slots for the various label combination instead of 12*3. Finally, while not shown here, we also include indicator features in XI() to model label transitions and individual label biases. These are 12 extra features in the case of the BIO tagging model and 30 extra in the case of the BILOU tagging model. THREAD SAFETY Instances of this object are required to be threadsafe, that is, it should be safe for multiple threads to make concurrent calls to the member functions of this object. !*/ public: // This should be the type used to represent an input sequence. It can be // anything so long as it has a .size() which returns the length of the sequence. typedef the_type_used_to_represent_a_sequence sequence_type; // If you want to use the BIO tagging model then set this bool to true. Set it to // false to use the BILOU tagging model. const static bool use_BIO_model = true; // In the WHAT THIS OBJECT REPRESENTS section above we discussed how we model the // conjunction of the previous label and the window around each position. Doing // this greatly expands the size of the parameter vector w. You can optionally // disable these higher order features by setting the use_high_order_features bool // to false. This will cause XI() to include only slots which are independent of // the previous label. const static bool use_high_order_features = true; // You use a tool like the structural_sequence_segmentation_trainer to learn the weight // vector needed by a sequence_segmenter. You can tell the trainer to force all the // elements of the weight vector corresponding to ZI() to be non-negative. This is all // the elements of w except for the elements corresponding to the label transition and // bias indicator features. To do this, just set allow_negative_weights to false. const static bool allow_negative_weights = true; example_feature_extractor ( ); /*! ensures - this object is properly initialized !*/ unsigned long num_features( ) const; /*! ensures - returns the dimensionality of the ZI() feature vector. This number is always >= 1 !*/ unsigned long window_size( ) const; /*! ensures - returns the size of the window ZI() vectors are extracted from. This number is always >= 1. !*/ template <typename feature_setter> void get_features ( feature_setter& set_feature, const sequence_type& x, unsigned long position ) const; /*! requires - position < x.size() - set_feature is a function object which allows expressions of the form: - set_features((unsigned long)feature_index, (double)feature_value); - set_features((unsigned long)feature_index); ensures - This function computes the ZI(x,position) feature vector. This is a feature vector which should capture the properties of x[position] that are informative relative to the sequence segmentation task you are trying to perform. - ZI(x,position) is returned as a sparse vector by invoking set_feature(). For example, to set the feature with an index of 55 to the value of 1 this method would call: set_feature(55); Or equivalently: set_feature(55,1); Therefore, the first argument to set_feature is the index of the feature to be set while the second argument is the value the feature should take. Additionally, note that calling set_feature() multiple times with the same feature index does NOT overwrite the old value, it adds to the previous value. For example, if you call set_feature(55) 3 times then it will result in feature 55 having a value of 3. - This function only calls set_feature() with feature_index values < num_features() !*/ }; // ---------------------------------------------------------------------------------------- void serialize( const example_feature_extractor& item, std::ostream& out ); /*! provides serialization support !*/ void deserialize( example_feature_extractor& item, std::istream& in ); /*! provides deserialization support !*/ // ---------------------------------------------------------------------------------------- // ---------------------------------------------------------------------------------------- template < typename feature_extractor > unsigned long total_feature_vector_size ( const feature_extractor& fe ); /*! requires - fe must be an object that implements an interface compatible with the example_feature_extractor discussed above. ensures - returns the dimensionality of the PSI() vector defined by the given feature extractor. !*/ // ---------------------------------------------------------------------------------------- // ---------------------------------------------------------------------------------------- template < typename feature_extractor > class sequence_segmenter { /*! REQUIREMENTS ON feature_extractor It must be an object that implements an interface compatible with the example_feature_extractor discussed above. WHAT THIS OBJECT REPRESENTS This object is a tool for segmenting a sequence of objects into a set of non-overlapping chunks. An example sequence segmentation task is to take English sentences and identify all the named entities. In this example, you would be using a sequence_segmenter to find all the chunks of contiguous words which refer to proper names. Internally, the sequence_segmenter uses the BIO (Begin, Inside, Outside) or BILOU (Begin, Inside, Last, Outside, Unit) sequence tagging model. Moreover, it is implemented using a dlib::sequence_labeler object and therefore sequence_segmenter objects are examples of chain structured conditional random field style sequence taggers. THREAD SAFETY It is always safe to use distinct instances of this object in different threads. However, when a single instance is shared between threads then the following rules apply: It is safe to call the const members of this object from multiple threads so long as the feature_extractor is also threadsafe. This is because the const members are purely read-only operations. However, any operation that modifies a sequence_segmenter is not threadsafe. !*/ public: typedef typename feature_extractor::sequence_type sample_sequence_type; typedef std::vector<std::pair<unsigned long, unsigned long> > segmented_sequence_type; sequence_segmenter( ); /*! ensures - #get_feature_extractor() == feature_extractor() (i.e. it will have its default value) - #get_weights().size() == total_feature_vector_size(#get_feature_extractor()) - #get_weights() == 0 !*/ explicit sequence_segmenter( const matrix<double,0,1>& weights ); /*! requires - total_feature_vector_size(feature_extractor()) == weights.size() ensures - #get_feature_extractor() == feature_extractor() (i.e. it will have its default value) - #get_weights() == weights !*/ sequence_segmenter( const matrix<double,0,1>& weights, const feature_extractor& fe ); /*! requires - total_feature_vector_size(fe) == weights.size() ensures - #get_feature_extractor() == fe - #get_weights() == weights !*/ const feature_extractor& get_feature_extractor ( ) const; /*! ensures - returns the feature extractor used by this object. !*/ const matrix<double,0,1>& get_weights ( ) const; /*! ensures - returns the parameter vector associated with this sequence segmenter. The length of the vector is total_feature_vector_size(get_feature_extractor()). !*/ segmented_sequence_type operator() ( const sample_sequence_type& x ) const; /*! ensures - Takes an input sequence and returns a list of detected segments within that sequence. - None of the returned segments will overlap. - The returned segments are listed in the order they appeared in the input sequence. - To be precise, this function returns a std::vector Y of segments such that: - Y.size() == the number of segments detected in the input sequence x. - for all valid i: - Y[i].first == the start of the i-th segment. - Y[i].second == one past the end of the i-th segment. - Therefore, the i-th detected segment in x is composed of the elements x[Y[i].first], x[Y[i].first+1], ..., x[Y[i].second-1] - Y[i].first < x.size() - Y[i].second <= x.size() - Y[i].first < Y[i].second (i.e. This function never outputs empty segments) - Y[i].second <= Y[i+1].first (i.e. the segments are listed in order of appearance and do not overlap) !*/ void segment_sequence ( const sample_sequence_type& x, segmented_sequence_type& y ) const; /*! ensures - #y == (*this)(x) (i.e. This is just another interface to the operator() routine above. This one avoids returning the results by value and therefore might be a little faster in some cases) !*/ }; // ---------------------------------------------------------------------------------------- template < typename feature_extractor > void serialize ( const sequence_segmenter<feature_extractor>& item, std::ostream& out ); /*! provides serialization support !*/ // ---------------------------------------------------------------------------------------- template < typename feature_extractor > void deserialize ( sequence_segmenter<feature_extractor>& item, std::istream& in ); /*! provides deserialization support !*/ // ---------------------------------------------------------------------------------------- } #endif // DLIB_SEQUENCE_SeGMENTER_ABSTRACT_H_h_