HTML Parser Threading
The HTML parser parses data received from the network off the main thread. (There's currently one parser thread serving all parser instances.) Data received from document.write()
is parsed on the main thread.
Main Objects
nsHtml5Parser
contains the code for dealing with data from document.write()
. (For historical reasons, it also contains unrelated fragment parsing code that should be refactored into a separate class in due course.) nsHtml5StreamParser
contains the code for dealing with data from the network. nsHtml5Parser
owns nsHtml5StreamParser
. Script-created parsers (i.e. parsers created using document.open()
) don't have an nsHtml5StreamParser
instance.
nsHtml5Parser
has one tokenizer/tree builder pair for parsing data from document.write()
. It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail of document.write()
data when the parser blocks without parsing the document.write()
data to completion.
nsHtml5StreamParser
has one tokenizer/tree builder pair for parsing data from the network.
Additionally, each nsHtml5Parser
has an associated nsHtml5TreeOpExecutor
that turns the output (tree operations; discussed later) of the portable parser core into actions performed on the Gecko DOM.
Initialization
An nsHtml5Parser
is constructed without an nsHtml5StreamParser
due to legacy code structure around parser creation. If the parser is not script-created, nsHtml5Parser::MarkAsNotScriptCreated()
is called to create an nsHtml5StreamParser
for the nsHtml5Parser.
Due to legacy interface design oddities, an nsHtml5Parser
is initialized by calling nsHtml5Parser::Parse(nsIURI*, nsIRequestObserver*, void*, nsDTDMode)
. That method call doesn't yet cause anything to be parsed, though. Actual network data is passed to an nsIStreamListener
. nsHtml5StreamParser
is the nsIStreamListener
implementation obtained by calling nsIStreamListener*
nsHtml5Parser::GetStreamListener().
The nsIStreamListener
methods (OnStartRequest
, OnDataAvailable
and OnStopRequest
) are called on the main thread. OnStartRequest
does all sorts of initialization on the main thread. At that point, nothing has happened on the parser thread yet and the nsHtml5StreamParser
lives fully on the main thread.
Passing Data to the Parser Thread
nsHtml5StreamParser
gets some of its methods called on the main thread and some called on the parser thread. The method implementations assert which thread they are supposed to be called on.
OnDataAvailable
and OnStopRequest
on the main thread post runnables on to the parser thread. There runnables obtain mTokenizerMutex
and call DoDataAvailable
and DoStopRequest
, repectively, on the parser thread. These two methods along with TimerFlush
(discussed later) are the entry points from the parser thread event loop into nsHtml5StreamParser.
The runnables hold mTokenizerMutex
for the entire duration of the DoDataAvailable
or DoStopRequest
call. mTokenizerMutex
protects most data structures on nsHtml5StreamParser
from concurrent access.
The runnables need to keep the nsHtml5StreamParser
instance alive, so they have to AddRef
it. However, if they Release
d on the parser thread, nsHtml5StreamParser
could be delete
d from the parser thread, which would lead to all sorts of badness, because nsHtml5StreamParser
has fields that hold objects that aren't safe to Release
except from the main thread. The runnables use nsHtml5RefPtr<nsHtml5StreamParser>
. nsHtml5RefPtr
sends runnables back to the main thread to call Release on nsHtml5StreamParser
on the main thread.
Normal (Non-Speculative) Parsing
Initially, DoDataAvailable
performs character encoding sniffing on the data if an encoding wasn't declared on the enclosing protocol (HTTP) level. Once a Unicode decoder has been set up, DoDataAvailable
passes the byte data to the decoder whose output is accumulated into a linked list of nsHtml5OwningUTF16Buffer
objects.
nsHtml5OwningUTF16Buffer
objects are then passed to the tokenizer in sequence by the ParseAvailableData()
method. The tokenizer calls into the tree builder which outputs nsHtml5TreeOperation
objects (tree ops) into a queue. Each tree op represents a small operation, such as element creation or appending a node into another, that can later be performed on the main thread.
Note that mTokenizerMutex
is being held by the parser thread for the entire time that the tokenizer runs. ParseAvailableData()
checks for parser termination or interruption from time to time. The termination and interruption indicators are guarded by a more narrowly scoped mTerminatedMutex
, so ParseAvailableData()
has the opportunity to receive an order to terminate or interrupt even while the parser thread is holding mTerminatedMutex
.
The tree op queue is flushed to the main thread from time to time. There's a timer that calls TimerFlush()
(which takes mTokenizerMutex
). Tree ops from the queue in the tree builder are moved to a tree op stage (nsHtml5TreeOpStage)
. The staging queue is protected by a mutex. This way, the queues into which tree ops are produced and the queues from which they are consumed don't need to involve a mutex on each modification. It's more efficient to obtain a mutex only every once in a while when a whole bunch of tree ops is moved to or from the staging queue.
After some tree ops have been moved to the staging queue, the nsIRunnable
held in mExecutorFlusher
is dispatched to the main thread. (The same runnable is used repeatedly in order to avoid cross-thread refcounting issues.)
Memory Management When Crossing the Thread Boundary
The tree ops hold various heap-allocated objects that end up crossing the thread boundary. These objects are memory managed as follows:
Attribute holders (nsHtml5HtmlAttributes
objects) are allocated using new
by the tokenizer. The tree builder either transfers an attribute holder to a tree op whose destructor explicitly delete
s the attribute holder or if the tree builder ends up ignoring an attribute holder, the tree builder delete
s the attribute holder explicitly.
Attribute values, public identifiers (in doctype) and system identifiers (in doctype) are heap-allocated nsString
objects (i.e. pointed to by nsString*
!). Attribute values nsStrings
are deleted by the attribute holder when it gets deleted. Public identifiers and system identifiers are deleted by the tree op or by the portable parser core if they don't make it as far a tree op. The backing nsStringBuffer
s are thread-safely refcounted appropriately when the nsString
goes away. (Aside: It would make more sense to make the parser deal with nsStringBuffer
s directly without having heap-allocated nsString
s involved.)
Element names, attribute names and the doctype name are represented as nsIAtom
s. Pre-interned attribute and element names hold atoms that are actually app-wide nsStaticAtom
s. Since nsStaticAtom
s are valid app-wide, these atoms work right app-wide on the main thread. For other atoms, the parser uses nsHtml5Atom
objects that are atomic only within the scope of an nsHtml5AtomTable
. There is one nsHtml5AtomTable
per each nsHtml5Parser
and one per each nsHtml5StreamParser
. Thus, each tokenizer/tree builder pair sees atoms that are atomic (pointer-comparable) within the tokenizer/tree builder pair. However, when tree ops are executed, every atom has to be tested for being a static atom (using IsStaticAtom()
). If the atom isn't static, a corresponding normal main-thread dynamically-allocated atom has to be obtained for the same string and used instead of the nsHtml5Atom
. nsHtml5Atom
objects are guaranteed to be immutable and to stay alive for the lifetime of the parser instance, so it's safe for the main thread to call their methods (as is necessary to find the corresponding normal atom).
References to DOM nodes in tree ops are of type nsIContent**
and are called content handles. A handle points to memory allocated by the tree builder and guaranteed to stick around for the life time of the parser. The tree builder never dereferences a content handle. In fact, a content handle won't have an actual node behind it initially. Only when the tree op executor executes a node creation tree op, the nsIContent*
that points to the node gets written to the memory location pointed to by the nsIContent**
. For any given content handle, the node creation tree op is always the first tree op in queue that uses that handle, so subsequent non-creation tree ops can safely dereference the handle to obtain a real node. Nodes created by tree ops are owned by the parser for the lifetime of the parser, so for refcounting purposes, the parser owns each nsIContent
object it creates by one reference no matter how many copies of the content handle for the node exist inside the parser. (This indeed means that removing parser-created nodes from the DOM during parsing doesn't release memory until the parser stops parsing, which is, in theory, a problem for long polling using HTML in an iframe e.g. for a chat app.)
Some tree ops hold char
or PRUnichar
arrays. Those arrays are allocated when creating the tree op and delete[]
d by the destructor of nsHtml5TreeOperation
. Hence, data gets copied to and from these arrays.
Executing Tree Ops
When the executor flushing runnable fires on the main thread, it calls nsHtml5TreeOpExecutor::RunFlushLoop()
. This method returns immediately if another invocation of it is already on the call stack (nested event loop case). The method then enters a loop that has a bunch of return conditions (including parser termination).
The loop first flushes speculative loads (creation discussed later). If the executor is reading from a stage (i.e. the ops are coming from the parser thread), the executor moves the tree ops from the staging queue into its own queue. (Again, here we get to move a bunch of tree ops by obtaining a mutex once instead of having to synchronize thread on a per-tree op basis.) If the executor isn't reading from a stage, it calls nsHtml5Parser::ParseUntilBlocked()
to parse potential document.write()
-generated data into tree ops.
The outer loop then has an inner loop that iterates over the tree ops in the executor's queue and calls Perform on each one. Parser termination is checked before each tree op for an early return, because, unfortunately, Gecko expects parsers to be able to terminate immediately. After each tree op except the last one in the queue, the clock time is checked to see if RunFlushLoop()
has spent too much time without returning to the event loop. If too much time has been spent, a runnable for calling RunFlushLoop()
again is dispatched and an early return made before all tree ops have been executed. (Only the already executed ops are destructed and the rest are left in the queue.)
The last op in the queue may be an op for attempting to execute a script that may block the parser (ops for attempting to execute async and defer scripts are normal tree ops and don't need to be the last op in a queue). The parts of the parser that drive the tokenizer/tree builder pairs guarantee to flush tree ops immediately if a script execution tree op is generated. (The tree builder makes the tokenizer return immediately when such an op is generated.) Thus, a tree op that may cause the parser to block can only occur as the last op in a queue that RunFlushLoop
sees.
Preparing the Parser Thread for Script Execution
When the parser thread sees that it has generated a tree op that attempts to execute a (non-async, non-defer) script, it starts speculating. It acquires mSpeculationMutex
and while holding it, it creates a new speculation and adds it to the queue of speculations, flushes the tree ops to the main thread, marks the nsHtml5StreamParser
as being in a speculating state and sets the newly created speculation object as the tree op sink (instead of the tree op stage described earlier) of the tree builder. The speculation object has a queue of tree ops (into which the tree builder will now flush ops to instead of the tree op stage), an owning reference to the nsHtml5OwningUTF16Buffer
that contains the starting point of the speculation, an index into the nsHtml5OwningUTF16Buffer
defining the exact starting point within the buffer, the line number of the tokenizer at that point and a snapshot of the tree op state. The line number and the snapshot are also added to the tree op that attempts to execute scripts (before flushing it).
The tree builder snapshot contains a copy of the tree builder stack, a copy of the list of open formatting elements and copies of various fields that define the tree builder state. It contains all the information that is necessary to load back into the tree builder to restore it into a behaviorally equivalent state.
The flush timer is disengaged, because it's not useful to make the tree builder flush ops into the queue of the speculation incrementally.
After the speculation has been started and mSpeculationMutex
released, the parser thread will parse incoming data and the tree builder will produce tree ops so that they get flushed into the queue in the speculation object.
If another tree op that attempts to execute a script is generated while speculating, another speculation object is initialized as described above and placed in the queue of speculation objects after the previous one.
It is possible that the parser thread ends up parsing all remaining data into speculations before the main thread is ready to receive more tree ops.
Attempting to Execute a Script
On the main thread, when the tree op executor executes a tree op that attempts to execute a script, the tree op is first checked for a tree builder state snapshot. There will be a snapshot if the tree op came from the parser thread. There won't be one if the tree op came from document.write()
on the main thread (discussed later). If there is a snapshot, the tokenizer used for parsing document.write()
-originating data on the main thread is reset to the data state. There's no need to snapshot the tokenizer state, because the state of the tokenizer is always the same immediately after the tokenizer has emitted a script end tag. Then the tree builder that is used for parsing document.write()
-originating data is initialized from the state snapshot. Now the tokenizer/tree builder pair used for parsing document.write()
data is in the same state that the off-the-main-thread tokenizer/tree builder pair was immediately after processing the script end tag.
The tree op executor then attempts to execute the script. This operation either executes an inline script right away or tells the parser to block (for external scripts). If the parser is told to block, the script loader will unblock the parser before the external script executes. If the script was executed without blocking, the executor dispatches an event to reflush itself. (Remember that reflushing ends up calling nsHtml5Parser::ParseUntilBlocked()
. That method is responsible for moving back to consuming tree ops from the parser thread via the staging area.)
document.write()
document.write()
Calls to document.write()
push data to the tokenizer/tree builder pair that exists on the main thread solely for parsing data from document.write()
. If the parser isn't blocked, document.write()
parses input into tree ops synchronously and flushes the tree ops synchronously by calling nsHtml5TreeOpExecutor::FlushDocumentWrite()
. Unlike nsHtml5TreeOpExecutor::RunFlushLoop()
, nsHtml5TreeOpExecutor::FlushDocumentWrite()
does not sample the clock to return early.
Inline scripts that arrive from document.write()
are executed synchronously (if the parser isn't already blocked). External scripts arriving from document.write()
block the parser.
If there is data in document.write()
input after a an external script (or any data at all if a previous document.write()
call has contained an external script and blocked the parser), the data is left in a linked list of nsHtml5OwningUTF16Buffer
objects. (How exactly data is inserted into the buffer list depends on parser keys when document.write()
is invoked re-entrantly, but that's outside the scope of this document, since this document is about parser threading.)
Whenever the argument of document.write()
isn't tokenized to completion synchronously, the part left unprocessed by the main document.write()
tokenizer/tree builder is processed by the third tokenizer/tree builder pair. The tree ops from that processing are ignored and only speculative loads are used. (Speculative loads are discussed later.) The third tokenizer/tree builder pair is initialized from the state of the main document.write()
tokenizer/tree builder pair but can get into an inconsistent state if document.write()
is invoked re-entrantly. This is OK, because the tree ops aren't used and it's mostly harmless if the speculative loads are wrong.
After an external script finishes executing, it dispatches an event to reflush the tree op executor. The reflush causes a call to nsHtml5Parser::ParseUntilBlocked()
. ParseUntilBlocked()
parses data left by document.write()
in the linked list of buffers, because the parser was blocked. This might cause more script execution tree ops to be generated and executed and document.write()
to be called more times.
When ParseUntilBlocked()
exhausts the data available to it, it calls nsHtml5StreamParser::ContinueAfterScripts
to resume the consumption of data from the network.
Continuing to Use Network-Originating Data After Scripts
nsHtml5StreamParser::ContinueAfterScripts
runs on the main thread. It acquires mSpeculationMutex
in order to be able to touch the queue of speculations. The mutex acquisition only blocks if the parser thread happens to be setting up a new speculation object at the same moment.
The first speculation in the speculation queue is examined. If the last character seen by the document.write()
tokenizer was not a carriage return, if the document.write()
tokenizer is in the "data" state and if the tree builder state snapshot on the speculation matches the state of the document.write()
tree builder, the speculation has succeeded. Otherwise it has failed.
If there is more than one speculation in the queue and if the speculation is successful, we can dequeue the first speculation while holding mSpeculationMutex
without bothering the parser thread. In this case, the tree ops from the speculation are flushed into the tree op executor and ContinueAfterScripts()
returns early. (nsHtml5TreeOpExecutor::RunFlushLoop
is on the call stack already in this case and will loop around to start flushing the ops right away.)
If there's only one speculation or the speculation failed, it's necessary to bother the parser thread. In that case, the main thread raises an interruption flag on the nsHtml5StreamParser
object to make the parser thread release mTokenizerMutex
earlier. The main thread then attempts to acquire mTokenizerMutex
.
If the single speculation is successful, the tree ops from the parser-thread tree builder are flushed straight to the tree op executor and the tree op stage is set as the tree op sink for the tree builder. The executor is told that it's reading from the stage again.
If the speculation failed, the first buffer corresponding to the starting point of the speculation gets its start index restored to the index stored on the speculation object. Subsequent buffers get their start index reset back to zero. The tokenizer gets its line number restored from the speculation object. This is equivalent to rewinding the input to where it was when speculation started.
All the speculations in the speculation queue are destroyed. Pending tree ops in the tree builder are cleared.
The tree op stage is set as the tree op sink for the tree builder. The executor is told that it's reading from the stage again.
The state of the main-thread tokenizer is copied to the parser-thread tokenizer. Likewise for the tree builders. The "last was carriage return" flag is copied over also.
A runnable is dispatched to the parser thread to call to mark the parser uninterrupted and to call nsHtml5StreamParser::ParseAvailableData()
on the parser thread.
mTokenizerMutex
is released.
Speculative Loads
When the tree builder on the parser thread encounters HTML script
, stylesheet link
, video
(with a poster frame), base
or img
or SVG script
, style
or image
elements, preload operations are appended to a speculative load queue. There speculative load operations are flushed to the main thread (via the staging area) whenever tree ops are flushed and also whenever nsHtml5StreamParser::ParseAvailableData()
is about to stop working. That is, when the parser thread is parsing speculatively, speculative loads continue to be flushed to the main thread even though tree ops accumulate into the speculations.
On the main thread, the speculative tokenizer/tree builder pair used for parsing document.write()
input that didn't get parsed by the main document.write()
tokenizer/tree builder synchronously also sends speculative loads to the tree op executor.
The executor acts on the speculative load queue when the runnable dispatched by nsHtml5StreamParser::ParseAvailableData()
fires and right before executing tree ops whenever tree ops are executed.
When the executor acts on speculative loads, it starts speculative HTTP fetches for images (including video poster frames), style sheets and scripts.
Additionally, the speculative load queue is used for transferring information from the parser thread to the main thread when the information needs to arrive before starting any speculative loads and when the information is known not to be speculative. There are two such pieces of information: the manifest URL for an App Cache manifest and the character encoding for the document.