Gloda

This content covers features introduced in Thunderbird 3

Thunderbird includes a new message indexing and search system (gloda) that improves search performance, provides sophisticated full-text  search capabilities and categorized search results. "Gloda" is short for the Thunderbird "global database".

This page provides an overview of gloda concepts and implementation. See Creating a gloda message query for information about using gloda in extensions. See Gloda examples for some gloda use-cases.  See Gloda indexing for the big-picture on indexing; see the implementation source and its comments for the nitty gritty.  See Gloda debugging for gloda debugging capabilities.

Core concepts

Gloda is an index, not a data store

If the gloda database gets deleted, no information is lost - it merely needs to be re-indexed from its source. From gloda's perspective, the authoritative store for messages is the .msf files and actual message body. Gloda can recompute everything it needs from these two things. For the address book (which is only indexed in a limited fashion by gloda at this time), the cards are the authoritative source of data.

Gloda objects are not directly mutable

Given that gloda is an index of authoritative data, it follows that directly manipulating the gloda objects is not meaningful. You need to manipulate the underlying authoritative object and have gloda recognize that manipulation. The gloda objects may provide helpers to perform this manipulation, but remember that unless the documentation for a class says that you can set an attribute and have the desired changes made, you probably can't.

Gloda is asynchronous

All interaction with the database happens on another thread in order to avoid locking up the main (UI) thread. This means that whenever you issue a query, the results are not immediately available; you need to wait for your callback to get notified.

Gloda maintains object identity consistency, but you need to follow the rules

Gloda (and its indexer helpers) are responsible for creating instances of objects, serializing them to the database, and deserializing them from the database. Although this makes it easy for gloda to know when objects come into existence, it needs your help to know when objects go out of existence. Without your help, gloda would not know when retrieving database results whether it needs to create a new instance or give you an existing instance. This would lead to potentially having two instances representing the same conceptual object, which is highly undesirable.

The 'rules' are that every live gloda object needs to be held in at least one gloda collection object and that you need to be holding onto that collection object so it doesn't get garbage collected. The gloda collection manager maintains a weak reference to every gloda collection in memory, allowing it to know the set of live objects. The weak reference does not stop the garbage collector from collecting the collection if your code is not maintaining a strong reference. This avoids memory leaks, but can lead to confusing behavior if you forget to hold onto the collection but are holding onto some of the gloda objects in that collection.

Gloda data types are 'nouns'

Gloda is extensible. The data types are referred to as nouns. For example, there is a 'message' noun that represents messages, a 'contact' noun that represents people, and an 'identity' noun that represents each of the various identities (each email address, each twitter account, each Facebook account, etc.) each contact has.

Pieces of data on 'nouns' are 'attributes' and are contributed by 'attribute providers'

Whatever code defines a new noun type is likely to define a set of attributes on that noun that are fundamental. For example, all messages have an author, which is codified in the 'from' attribute. But other code is also allowed to define attributes on the noun. For example, a Bugzilla plugin may define a 'bugsReferenced' attribute on message nouns which it populates by analyzing the message body. Both the code that provides the fundamental attributes such as 'from' and the Bugzilla plugin providing 'bugsReferenced' are known as attribute providers.

Broad implementation details

Data storage

The gloda database is a SQLite database named "global-messages-db.sqlite" and can currently be found in the user's profile directory. The database file may be moved to another location in the future so that it does not complicate backup procedures. Since gloda is just an index, it is not important to back it up and dealing with it can be a problem for backup mechanisms that operate on the granularity of entire files.

All of the attributes on a noun instance are saved in a single database row for lookup purposes. Fundamental attributes may be stored in database columns, the rest are stored in a JSON blob in a designated column on the row.  The JSON blob is obviously not something that SQLite can query on, but it does not have to. We have a separate table where we emit rows intended for use by queries. For example, in the case of messages, the "messages" table stores the representation and "messageAttributes" stores the attribute representations for query purposes.

Full-text search support is provided by the SQLite FTS3 engine and is stored in the same file as the rest of the database. A custom "mozporter" tokenizer is used to tokenize the indexed fulltext. The tokenizer has the following behaviors:

  • All characters are accent-folded and case-folded and then subjected to further processing.  We use a lookup table built using the unicode character database to perform this step.  The primary limitation of this step is that each unicode character in our mapping can reduce to only a single output character which can result in inconsistencies where a human-expanded accent/case-folded expansion would use two.
  • CJK characters are bi-gram tokenized. Pretending the string "ABC" was made of CJK characters, the tokens "AB" and "BC" would be emitted.
  • Tokens are broken on whitespace and punctuation.  High-unicode punctuation that does not get folded may be misinterpreted as part of a string.
  • Tokens that are made up of entirely ASCII characters (after folding) are run through the Porter stemmer.  Tokens must be at least 2 characters long to be emitted.  We are thinking of upping this limit because of it being dangerously low and not requiring too much work.
  • Tokens that contain non-ASCII characters (after folding) are emitted as-is.
  • It has never heard of stop-words.  This means exciting and frequent words like "the" get emitted.  It's all a bit embarrassing, really.  Patches totally, totally encouraged and accepted.

The source for our tokenizer is found in mailnews/extensions/fts3/src/fts3_porter.c.

FTS3 requires that it store the message text that it is indexing. This is in contrast to other solutions such as Lucene which can index text without being able to (efficiently) reconstruct the message text. At least in the case of messages, we leverage the fact that we have to do this and load the text from disk when retrieving the message.

Content whittlers

Email message bodies are frequently full of text that is not part of the message proper. Quoted message, boilerplate footers, automatically created envelopes for web services effectively tunneling through email, etc, all abound. Content whittlers are a system for whittling the message body down to the actual message.

This is important for full-text indexing as well as presenting the user with a snippet of the message which is actually the message. It is also useful in cases where gloda is not directly involved, which is why connotent.js also serves as an API that is used by the (non-gloda-aware) summary display that shows up when you select a collapsed thread or multiple messages/threads in the Thunderbird UI thread pane.

See connotent.js for more information.

Indexing

Indexing occurs on the main thread in an asynchronous fashion. It happens on the main thread because the MailNews folder/message implementations are not intended to be accessed off the main thread except in a very few controlled cases. We could probably do more things on their own thread than we currently do, especially in cases where we are not dealing with traditional MailNews folders/messages.

Although we hope that indexing generally keeps up with the real time inflow of new messages and user changes, it is neither a guarantee nor a requirement.  Although indexing could have been synchronous with changes, there are obviously a lot of problems with that in the face of our desire for extensibility and the whole 'all database operations are async' thing.

GlodaIndexer provides the core indexing logic. GlodaMsgIndexer has the message-specific stuff, although the actual attribute-providers are found in GlodaExplicitAttr and GlodaFundAttr. GlodaABIndexer has the limited address book support.

Detailed implementation details

See the source code documentation for most things, but here are a few topics that are felt it is important to make more accessible / easier to find:

SQLite page/cache sizes

Page Size

From Thunderbird 3.0 through Thunderbird 8.0, gloda uses a SQLite page size of 1KiB.  This value was chosen because bug 401985 suggested it was the optimal page size on Linux and OS X and that 4 KiB was only slightly better on windows.  Additionally, the smaller page size was found beneficial in minimizing the amount of total data churn when updating the fulltext indexes, relevant because it affects the rollback journal size and potential number of fsyncs required during a database transaction, especially when the cache size is exceeded.  By keeping the page size down we were able to keep the cache size small as well, improving indexing performance.

From Thunderbird 9.0 onwards (probably), gloda will use the (now mozilla platform default) SQLite page size of 32KiB.  The change in value is being made because:

  1. Extensive research by Taras Glek for the Places database determined that the I/O benefits of the larger page size were significant.  (This is not particularly surprising given the characteristics of seeks on hard disks and SQLite's limited ability to perform defragmentation of its page tables, but Taras was the first person to extensively quantify the ramifications while also accounting for operating system caching behaviour across multiple platforms, etc.)
  2. Investigation of the internal structure of the gloda database after an extended period of usage reveals non-trivial numbers of full-text index segments spilled over multiple pages located far apart page-number-wise (meaning that any query against the segment would result in many seeks, greatly degrading seek performance.)
  3. An effort is being made to reduce the size of the database by various means, offsetting any increase in wasted space owing to the larger page sizes.

It is worth noting that while spindle-based hard disks are the primary beneficiary of these changes, solid-state disks (SSDs) should see an improvement in write performance.

Cache size

The primary benefits of the SQLite page cache that differentiate it from the operating system cache for us:

  • Avoiding disk I/O during database transactions.  Overflows of the cache can result in additional expensive fsync operations.
  • Message indexing and querying is reasonably correlated, but happens over longer time periods.  Because the SQLite database is very random-access (exhibiting little locality), it is extremely hard for operating system caching layers to successfully predict this.  Obviously, if they cache all reads (or all of the file), it's not a problem.  However, in cases where the system does not have excessive amounts of spare memory, the SQLite cache is able to do a much better job.

From Thunderbird 3.0 through Thunderbird 8.0, gloda uses a *maximum* cache size of 8 MiB.  SQLite does not allocate all the cache up front, so if the global database is empty or is otherwise unused, the cache size will remain extremely small.

From Thunderbird 9.0 onwards (probably), gloda will select a *maximum* cache size somewhere between 8 MiB and 64 MiB.  The preference mailnews.database.global.datastore.cache_to_memory_permillage controls the rate at which we grow between the minimum "maximum cache size" of 8 MiB and the maximum "maximum cache size" of 64 MiB.  The default value is "10" (for 1.0%) which means that our maximum cache size will start growing from 8 MiB on systems with ~800 MiB of RAM to a maximum of 64 MiB on systems with ~6400 MiB.  A system with 2 GiB will have a maximum cache size of ~20 MiB, a system with 4 GiB will have a maximum cache size of ~40 MiB, a system with 8 GiB will have a cache size of 64 MiB, and a system with 16 GiB will still have a maximum cache size of 64 MiB.  In the event we are unable to determine how much memory the system has, we will assume it has 256 MiB and choose a cache size of 8 MiB.

The SQLite page cache is fairly clever: it is aware of memory pressure and will shrink itself when it discards unneeded pages.

Document Tags and Contributors

 Contributors to this page: chrisdavidmills, Sheppy, mozjonathan, AndrewSutherland, jenzed
 Last updated by: chrisdavidmills,