Closed Bug 551344 Opened 14 years ago Closed 14 years ago

HTML5 parser review

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sicking, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(9 files, 7 obsolete files)

21.75 KB, patch
sicking
: review+
Details | Diff | Splinter Review
3.22 KB, patch
sicking
: review+
Details | Diff | Splinter Review
7.23 KB, patch
sicking
: review+
Details | Diff | Splinter Review
5.39 KB, patch
sicking
: review+
Details | Diff | Splinter Review
96.57 KB, patch
sicking
: review+
Details | Diff | Splinter Review
26.27 KB, patch
sicking
: review+
Details | Diff | Splinter Review
775 bytes, patch
Details | Diff | Splinter Review
539.33 KB, patch
sicking
: review+
Details | Diff | Splinter Review
758 bytes, patch
sicking
: review+
Details | Diff | Splinter Review
Will put my review comments of the HTML5 parser in this bug.
nsHtml5Parser.cpp:


nsHtml5Parser::ContinueInterruptedParsing()

184   if (mExecutor->IsFlushing()) {
185     // A nested event loop dequeued the continue event and there aren't
186     // scripts executing. What's currently causing the flush is running to 
187     // completion or there will be a script later and the script will cause
188     // another continue event.
189     return NS_OK;
190   }

Could this not also happen if for example a timeout causes flushing?


nsHtml5Parser::Parse

265   // Return early if the parser has processed EOF
266   if (!mExecutor->HasStarted()) {
267     NS_ASSERTION(!mStreamParser,
268                  "Had stream parser but document.write started life cycle.");
269     mExecutor->SetParser(this);
270     mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled());
271     mTokenizer->start();
272     mExecutor->Start();
273     /*
274      * If you move the following line, be very careful not to cause 
275      * WillBuildModel to be called before the document has had its 
276      * script global object set.
277      */
278     mExecutor->WillBuildModel(eDTDMode_unknown);
279   }
The initial comment doesn't seem to be related to what happens inside the if-statment.

294   NS_PRECONDITION(IsInsertionPointDefined(), 
295                   "Document.write called when insertion point not defined.");
296 
297   NS_PRECONDITION(!(mStreamParser && !aKey), "Got a null key in a non-script-created parser");

Don't use NS_PRECONDITION for tests that happen in the middle of the function. NS_ASSERTION is more appropriate here. Also, shouldn't the first assertion say that nsHtml5Parser::Parse was called. document.write being called is not assertion-worthy since web pages can cause that.

There's quite a few lines longer than 80 chars in this file.

303   nsRefPtr<nsHtml5UTF16Buffer> buffer = new nsHtml5UTF16Buffer(aSourceBuffer.Length());
304   memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
305   buffer->setEnd(aSourceBuffer.Length());

It's very unfortunate to allocate a new buffer and copy the memory every time document.write is called. We should really only do that when needed, which is only when a <script> is written, right? Since the old parser does that maybe leave this as is for now, but please do file a bug on it.

317   if (aKey) { // after document.open, the first level of document.write has null key
318     while (searchBuf != mLastBuffer) {
319       if (searchBuf->key == aKey) {
320         // found a key holder
321         // now insert the new buffer between the previous buffer
322         // and the key holder.
323         buffer->next = searchBuf;
324         if (prevSearchBuf) {
325           prevSearchBuf->next = buffer;
326         } else {
327           mFirstBuffer = buffer;
328         }
329         break;
330       }
331       prevSearchBuf = searchBuf;
332       searchBuf = searchBuf->next;
333     }
334     if (searchBuf == mLastBuffer) {
335       // key was not found
336       nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey);
337       keyHolder->next = mFirstBuffer;
338       buffer->next = keyHolder;
339       mFirstBuffer = buffer;
340     }

Ugh, the fact that this list is a mix of a magic mLastBuffer buffer, special key-holding buffers, and normal text-holding buffers is pretty ugly. It took me a while to figure out what's going on. Why not make the list just hold normal buffers that all point to a key, and make the list terminate with nsnull?

Also, is there a reason these buffers are refcounted? It seems like they are always owned by the list.

356   if (!mBlocked) {
357     // mExecutor->WillResume();
358     while (buffer->hasMore()) {
359       buffer->adjust(mLastWasCR);
360       mLastWasCR = PR_FALSE;
361       if (buffer->hasMore()) {

You can rewrite the loop to |while (!mBlocked && buffer->hasMore()) {| right? Can mLastWasCR ever be true other than before the first iteration? If not just move the adjust call to outside the loop.

And please add a comment stating why WillResume/WillInterrupt calls aren't needed, and remove the commented out calls.

368           // we aren't the root context, so save the line number on the 
369           // *stack* so that we can restore it.
370           lineNumberSave = mTokenizer->getLineNumber();

Don't we need to put the tokenizer in a mode where it doesn't increase the line number at all? Otherwise you'll end up with a weird line number for document.written stuff?

395   if (!mBlocked) { // buffer was tokenized to completion
396     NS_ASSERTION(!buffer->hasMore(), "Buffer wasn't tokenized to completion?");  
397     // Scripting semantics require a forced tree builder flush here
398     mTreeBuilder->flushCharacters(); // Flush trailing characters
399     mTreeBuilder->Flush(); // Move ops to the executor

Please make Flush() call flushCharacters(). Even if might result in a couple of extra calls to flushCharacters it's unlikely to matter perf wise. And it'll reduce the risk of people forgetting to call flushCharacters causing bugs.


416 nsHtml5Parser::Terminate(void)

Please remove the 'void'. We don't use that convention anywhere else.


nsHtml5Parser::ParseFragment:

469   mExecutor->SetNodeInfoManager(target->GetOwnerDoc()->NodeInfoManager());

Just use doc->NodeInfoManager()

461   nsCOMPtr<nsISupports> container = doc->GetContainer();
462   // Can be null if owner document is synthetic

I have a feeling that we want to just use a null container for fragment parsing.

476   mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled());
477   mTokenizer->start();
478   mExecutor->Start(); // Don't call WillBuildModel in fragment cas

Please consistently use FirstCharUpperCase convention when naming functions.

479   if (!aSourceBuffer.IsEmpty()) {
480     PRBool lastWasCR = PR_FALSE;
481     nsHtml5UTF16Buffer buffer(aSourceBuffer.Length());
482     memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
483     buffer.setEnd(aSourceBuffer.Length());
484     while (buffer.hasMore()) {
485       buffer.adjust(lastWasCR);
486       lastWasCR = PR_FALSE;
487       if (buffer.hasMore()) {
488         lastWasCR = mTokenizer->tokenizeBuffer(&buffer);
489       }
490     }
491   }

When can we ever return from the tokenizer without consuming the whole buffer? Scripts getting parsed shouldn't stop the parser here, no?


535 nsHtml5Parser::CanInterrupt()
536 {
537   return !mExecutor->IsFragmentMode();
538 }

Shouldn't this also return false during document.write?


nsHtml5Parser::ParseUntilBlocked

580   if (mBlocked) {
581     return;
582   }
583 
584   if (mExecutor->IsComplete()) {
585     return;
586   }

Combine these.

597         if (mDocumentClosed) {
598           NS_ASSERTION(!mStreamParser,
599                        "This should only happen with script-created parser.");
600           mTokenizer->eof();
601           mTreeBuilder->StreamEnded();
602           mTreeBuilder->Flush();
603           mExecutor->Flush(PR_TRUE);
604           mTokenizer->end();
605           return;            
606         } else {

else-after-return

607           // never release the last buffer.
608           NS_ASSERTION(!mLastBuffer->getStart(), 
609             "Sentinel buffer had its indeces changed.");
610           NS_ASSERTION(!mLastBuffer->getEnd(), 
611             "Sentinel buffer had its indeces changed.");

Combine these

621           }
622           return; // no more data for now but expecting more
623         }
624       } else {

Once you fix the above else-after-return, then this will be an else-after-return too.

634     // now we have a non-empty buffer
635     mFirstBuffer->adjust(mLastWasCR);
636     mLastWasCR = PR_FALSE;
637     if (mFirstBuffer->hasMore()) {
638       PRBool inRootContext = (!mStreamParser && (mFirstBuffer->key == mRootContextKey));
639       if (inRootContext) {
640         mTokenizer->setLineNumber(mRootContextLineNumber);
641       }

Don't you need the 'else' part for the line number handling that we have in nsHtml5Parser::Parse?
(line numbers based on revision 6f8f9de6efb7, i think)
(In reply to comment #1)
> nsHtml5Parser.cpp:
> 
> 
> nsHtml5Parser::ContinueInterruptedParsing()
> 
> 184   if (mExecutor->IsFlushing()) {
> 185     // A nested event loop dequeued the continue event and there aren't
> 186     // scripts executing. What's currently causing the flush is running to 
> 187     // completion or there will be a script later and the script will cause
> 188     // another continue event.
> 189     return NS_OK;
> 190   }
> 
> Could this not also happen if for example a timeout causes flushing?

As far as I can tell, a timeout can't cause flushing of the HTML5 parser, because document.write() from a timeout implies document.open() and there are no content-initiated flushes in the HTML5 parser.

(Note that this code is getting rearranged in bug 543458.)

> nsHtml5Parser::Parse
> 
> 265   // Return early if the parser has processed EOF
> 266   if (!mExecutor->HasStarted()) {
> 267     NS_ASSERTION(!mStreamParser,
> 268                  "Had stream parser but document.write started life
> cycle.");
> 269     mExecutor->SetParser(this);
> 270     mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled());
> 271     mTokenizer->start();
> 272     mExecutor->Start();
> 273     /*
> 274      * If you move the following line, be very careful not to cause 
> 275      * WillBuildModel to be called before the document has had its 
> 276      * script global object set.
> 277      */
> 278     mExecutor->WillBuildModel(eDTDMode_unknown);
> 279   }
> The initial comment doesn't seem to be related to what happens inside the
> if-statment.

Moved the comment to the right place.

> 294   NS_PRECONDITION(IsInsertionPointDefined(), 
> 295                   "Document.write called when insertion point not
> defined.");
> 296 
> 297   NS_PRECONDITION(!(mStreamParser && !aKey), "Got a null key in a
> non-script-created parser");
> 
> Don't use NS_PRECONDITION for tests that happen in the middle of the function.
> NS_ASSERTION is more appropriate here. 

Changed to NS_ASSERTION.

> Also, shouldn't the first assertion say
> that nsHtml5Parser::Parse was called. document.write being called is not
> assertion-worthy since web pages can cause that.

The point is that it's assertion-worthy if document.write() reaches all the way to the backing Parse() method in that case. Changed the wording.

> There's quite a few lines longer than 80 chars in this file.

Fixed in this file. Also filed bug 551939.

> 303   nsRefPtr<nsHtml5UTF16Buffer> buffer = new
> nsHtml5UTF16Buffer(aSourceBuffer.Length());
> 304   memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(),
> aSourceBuffer.Length() * sizeof(PRUnichar));
> 305   buffer->setEnd(aSourceBuffer.Length());
> 
> It's very unfortunate to allocate a new buffer and copy the memory every time
> document.write is called. We should really only do that when needed, which is
> only when a <script> is written, right? 

Only when an external script is written, but yeah.

> Since the old parser does that maybe
> leave this as is for now, but please do file a bug on it.

Filed bug 551942.

> 317   if (aKey) { // after document.open, the first level of document.write has
> null key
> 318     while (searchBuf != mLastBuffer) {
> 319       if (searchBuf->key == aKey) {
> 320         // found a key holder
> 321         // now insert the new buffer between the previous buffer
> 322         // and the key holder.
> 323         buffer->next = searchBuf;
> 324         if (prevSearchBuf) {
> 325           prevSearchBuf->next = buffer;
> 326         } else {
> 327           mFirstBuffer = buffer;
> 328         }
> 329         break;
> 330       }
> 331       prevSearchBuf = searchBuf;
> 332       searchBuf = searchBuf->next;
> 333     }
> 334     if (searchBuf == mLastBuffer) {
> 335       // key was not found
> 336       nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey);
> 337       keyHolder->next = mFirstBuffer;
> 338       buffer->next = keyHolder;
> 339       mFirstBuffer = buffer;
> 340     }
> 
> Ugh, the fact that this list is a mix of a magic mLastBuffer buffer, special
> key-holding buffers, and normal text-holding buffers is pretty ugly. It took me
> a while to figure out what's going on.

Yeah, document.write() order management isn't pretty. :-(

> Why not make the list just hold normal
> buffers that all point to a key, 

My recollection is that I had that design first. Then to fix a problem with multiple document.write() calls and nested document.write()s, I decided I needed keyholders to mark points between the actual buffers. Unfortunately, I can't recall what problem exactly I was solving, so I can't say if a simpler solution exists. I used the same buffer class for the key holders, because the code that iterates forward in the list conveniently eats up the key holders when they appear as empty buffers.

> and make the list terminate with nsnull?

With the sentinel buffer, there's a place for the root context key should it ever be non-null. Having a sentinel object also keeps the queue management code more similar to the queue management code in nsHtml5StreamParser, which puts data in the last buffer but recycles the buffer without releasing it.

> Also, is there a reason these buffers are refcounted? It seems like they are
> always owned by the list.

They used to be non-refcounted. The same buffer class is used by nsStreamParser and there they do need to be refcounted to keep data around as long as there's a pending speculation that points to the buffer list in case the speculation fails and the stream has to be rewound.

> 356   if (!mBlocked) {
> 357     // mExecutor->WillResume();
> 358     while (buffer->hasMore()) {
> 359       buffer->adjust(mLastWasCR);
> 360       mLastWasCR = PR_FALSE;
> 361       if (buffer->hasMore()) {
> 
> You can rewrite the loop to |while (!mBlocked && buffer->hasMore()) {| right?

Yes. Fixed.

> Can mLastWasCR ever be true other than before the first iteration? If not just
> move the adjust call to outside the loop.

Yes, mLastWasCR changing during the loop is a crucial aspect of the loop.

The line
mLastWasCR = mTokenizer->tokenizeBuffer(buffer);
sets it inside the loop when the tokenizer finds a CR and needs the outer code to skip over an LF immediately after. The LF could be in different buffer, so it makes sense to do the LF skipping outside the tokenizer. (CRLF is such a pain.)

> And please add a comment stating why WillResume/WillInterrupt calls aren't
> needed, and remove the commented out calls.

These are going away in bug 543458.

> 368           // we aren't the root context, so save the line number on the 
> 369           // *stack* so that we can restore it.
> 370           lineNumberSave = mTokenizer->getLineNumber();
> 
> Don't we need to put the tokenizer in a mode where it doesn't increase the line
> number at all? Otherwise you'll end up with a weird line number for
> document.written stuff?

The old parser ends up with weird line numbers, too, so this isn't any worse than the status quo. The correct fix would be making the JS engine pass the line number of the document.write() call to the parser and make the parser associate document.written scripts and styles with that line number (and file). I haven't introduced a new tokenizer mode, because the correct fix wouldn't need it.
 
> 395   if (!mBlocked) { // buffer was tokenized to completion
> 396     NS_ASSERTION(!buffer->hasMore(), "Buffer wasn't tokenized to
> completion?");  
> 397     // Scripting semantics require a forced tree builder flush here
> 398     mTreeBuilder->flushCharacters(); // Flush trailing characters
> 399     mTreeBuilder->Flush(); // Move ops to the executor
> 
> Please make Flush() call flushCharacters(). Even if might result in a couple of
> extra calls to flushCharacters it's unlikely to matter perf wise. And it'll
> reduce the risk of people forgetting to call flushCharacters causing bugs.

Fixed.

> 416 nsHtml5Parser::Terminate(void)
> 
> Please remove the 'void'. We don't use that convention anywhere else.

Fixed. (Copied from nsParser, FWIW.)

> nsHtml5Parser::ParseFragment:
> 
> 469   mExecutor->SetNodeInfoManager(target->GetOwnerDoc()->NodeInfoManager());
> 
> Just use doc->NodeInfoManager()

Fixed.

> 461   nsCOMPtr<nsISupports> container = doc->GetContainer();
> 462   // Can be null if owner document is synthetic
> 
> I have a feeling that we want to just use a null container for fragment
> parsing.

Changed.

> 476   mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled());
> 477   mTokenizer->start();
> 478   mExecutor->Start(); // Don't call WillBuildModel in fragment cas
> 
> Please consistently use FirstCharUpperCase convention when naming functions.

We decided long ago it wasn't worthwhile to make the translator munge Java naming conventions to Mozilla C++ conventions. start() on the tokenizer is translated from Java.

> 479   if (!aSourceBuffer.IsEmpty()) {
> 480     PRBool lastWasCR = PR_FALSE;
> 481     nsHtml5UTF16Buffer buffer(aSourceBuffer.Length());
> 482     memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(),
> aSourceBuffer.Length() * sizeof(PRUnichar));
> 483     buffer.setEnd(aSourceBuffer.Length());
> 484     while (buffer.hasMore()) {
> 485       buffer.adjust(lastWasCR);
> 486       lastWasCR = PR_FALSE;
> 487       if (buffer.hasMore()) {
> 488         lastWasCR = mTokenizer->tokenizeBuffer(&buffer);
> 489       }
> 490     }
> 491   }
> 
> When can we ever return from the tokenizer without consuming the whole buffer?
> Scripts getting parsed shouldn't stop the parser here, no?

The tokenizer returns early if it saw a CR or if the tree builder saw a </script>.

Returning from the tokenizer on CR and making it the responsibility of the outer code to call adjust() on the buffer makes CRLF coalescing work right across buffer boundaries including crazy document.write boundaries and makes the tokenizer internals much nicer that they'd be if the tokenizer had to coalesce CRLF on its own.

For consistency and simplicity, the tree builder makes the tokenizer return on </script> even in the fragment case. This is harmless, since the while loop just calls the tokenizer again.

> 535 nsHtml5Parser::CanInterrupt()
> 536 {
> 537   return !mExecutor->IsFragmentMode();
> 538 }
> 
> Shouldn't this also return false during document.write?

In the revision you read, it doesn't matter what this returns.

With the patch for bug 543458, this could always return PR_TRUE, since the value affects the behavior of nsContentSink::DidProcessATokenImpl() which isn't called by the HTML5 parser in the uninterruptible cases.

Changed to return PR_TRUE always.

> nsHtml5Parser::ParseUntilBlocked
> 
> 580   if (mBlocked) {
> 581     return;
> 582   }
> 583 
> 584   if (mExecutor->IsComplete()) {
> 585     return;
> 586   }
> 
> Combine these.

Can't combine these any longer after the patch for bug 543458, because the "IsComplete" case now does more stuff than the blocked case.

In bug 543458, there are three return conditions that could be combined, but keeping them separate makes it nicer to have distinct comments for each condition. (I'm expecting the compiler to produce the same opt code either way.)

> 597         if (mDocumentClosed) {
> 598           NS_ASSERTION(!mStreamParser,
> 599                        "This should only happen with script-created
> parser.");
> 600           mTokenizer->eof();
> 601           mTreeBuilder->StreamEnded();
> 602           mTreeBuilder->Flush();
> 603           mExecutor->Flush(PR_TRUE);
> 604           mTokenizer->end();
> 605           return;            
> 606         } else {
> 
> else-after-return

Fixed.

> 607           // never release the last buffer.
> 608           NS_ASSERTION(!mLastBuffer->getStart(), 
> 609             "Sentinel buffer had its indeces changed.");
> 610           NS_ASSERTION(!mLastBuffer->getEnd(), 
> 611             "Sentinel buffer had its indeces changed.");
> 
> Combine these

Fixed.

> 621           }
> 622           return; // no more data for now but expecting more
> 623         }
> 624       } else {
> 
> Once you fix the above else-after-return, then this will be an
> else-after-return too.

Fixed.

> 634     // now we have a non-empty buffer
> 635     mFirstBuffer->adjust(mLastWasCR);
> 636     mLastWasCR = PR_FALSE;
> 637     if (mFirstBuffer->hasMore()) {
> 638       PRBool inRootContext = (!mStreamParser && (mFirstBuffer->key ==
> mRootContextKey));
> 639       if (inRootContext) {
> 640         mTokenizer->setLineNumber(mRootContextLineNumber);
> 641       }
> 
> Don't you need the 'else' part for the line number handling that we have in
> nsHtml5Parser::Parse?

The line numbers are so bogus in that case anyway that it's not worthwhile to make them bogus but in a different way.
Depends on: 539887, 543458
Comment on attachment 432131 [details] [diff] [review]
Address review comments in nsHtml5Parser.cpp

Oops. I need to tweak the patch a bit.
Attachment #432131 - Attachment is obsolete: true
Attachment #432131 - Flags: review?(jonas)
nsHtml5TreeOpExecutor.cpp

Again there are lots of lines over 80 chars wide. I'd say for now lets leave that as it's harder to review patches that contain whitespace cleanup together with actual changes. But we should at some point go through and fix up this. And please make sure to not use over long lines in new code.

nsHtml5TreeOpExecutor::InitializeStatics

101   if (sTreeOpQueueMinLength <= 0) {
102     sTreeOpQueueMinLength = 200;
103   }
104   if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) {
105     sTreeOpQueueLengthLimit = sTreeOpQueueMinLength;
106   }
107   if (sTreeOpQueueMaxLength < sTreeOpQueueMinLength) {
108     sTreeOpQueueMaxLength = sTreeOpQueueMinLength;
109   }
110   if (sTreeOpQueueMaxTime <= 0) {
111     sTreeOpQueueMaxTime = 200;
112   }

These adjustments won't help very much given that if the pref is changed they won't be rerun, right? Maybe you should install a observer for "html5.opqueue" and always update the prefs and run these adjustments if anything under that changes.


nsHtml5TreeOpExecutor::DidBuildModel

139   if (!aTerminated) {
140     // Break out of update batch if we are in one 
141     // and aren't forcibly terminating
142     EndDocUpdate();

For my own understanding, why should we not exit the current patch if we are forcably terminated? When can we get terminated during a bach, aren't we preventing scripts from running during a batch?

153   // This is comes from nsXMLContentSink

Remove the "is".


180 nsHtml5TreeOpExecutor::WillInterrupt()
181 {
182   return WillInterruptImpl();
183 }

So it sounds like all the WillInterrupt stuff is going away from html5 parsing, is this correct?


nsHtml5TreeOpExecutor::SetDocumentCharset

207   if (mDocShell) {
208     // the following logic to get muCV is copied from
209     // nsHTMLDocument::StartDocumentLoad
210     // We need to call muCV->SetPrevDocCharacterSet here in case
211     // the charset is detected by parser DetectMetaTag
212     nsCOMPtr<nsIMarkupDocumentViewer> muCV;

Please call this variable 'mucv' or some such. As it's named now it looks confusingly similar to an mFoo member variable.


nsHtml5TreeOpExecutor::UpdateStyleSheet

This function should probably be renamed to HandleLinkElement or some such.

277   // Break out of the doc update created by Flush() to zap a runnable 
278   // waiting to call UpdateStyleSheet without the right observer

This comment doesn't make sense to me. Sounds like someone is intentionally waiting to "call UpdateStyleSheet without the right observer".


nsHtml5TreeOpExecutor::Flush

380   if (aForceWholeQueue) {
381     if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
382       flushStart = PR_IntervalNow(); // compute averages only if enough ops
383     }    
384   } else {
385     if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
386       flushStart = PR_IntervalNow(); // compute averages only if enough ops
387       if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) {
388         numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit;
389         reflushNeeded = PR_TRUE;
390       }
391     }
392   }

You can simplify this to:

if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
  flushStart = PR_IntervalNow(); // compute averages only if enough ops
  if (!aForceWholeQueue &&
      numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) {
    numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit;
    reflushNeeded = PR_TRUE;
  }
}

Though see below.

Editorial: There is whitespace at the end of line 383 here.

398   for (nsHtml5TreeOperation* iter = (nsHtml5TreeOperation*)start; iter < end; ++iter) {
399     if (NS_UNLIKELY(!mParser)) {
400       // The previous tree op caused a call to nsIParser::Terminate();
401       break;
402     }

How could we end up terminating here? Isn't the idea that we won't be running script while inside the doc-update? Is this due to charset switching or something else?

413   if (flushStart) {
414     PRUint32 delta = PR_IntervalToMilliseconds(PR_IntervalNow() - flushStart);
415     sTreeOpQueueLengthLimit = delta ?
416       (PRUint32)(((PRUint64)sTreeOpQueueMaxTime * (PRUint64)numberOfOpsToFlush)
417                  / delta) :
418       sTreeOpQueueMaxLength; // if the delta is less than one ms, use max
419     if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) {
420       // both are signed and sTreeOpQueueMinLength is always positive, so this
421       // also takes care of the theoretical overflow of sTreeOpQueueLengthLimit
422       sTreeOpQueueLengthLimit = sTreeOpQueueMinLength;
423     }
424     if (sTreeOpQueueLengthLimit > sTreeOpQueueMaxLength) {
425       sTreeOpQueueLengthLimit = sTreeOpQueueMaxLength;
426     }
427 #ifdef DEBUG_NS_HTML5_TREE_OP_EXECUTOR_FLUSH
428     printf("FLUSH DURATION (millis): %d\n", delta);
429     printf("QUEUE NEW MAX LENGTH: %d\n", sTreeOpQueueLengthLimit);      
430 #endif
431   }

I'm definitely nervous about changing the way we back off to the main event loop. Specifically things like checking if there are pending user events seems like a good idea but are no longer happening.

While we can't duplicate exactly what the old parser does, since we can't have callbacks for each parsed token, I don't see why we couldn't keep approximately the same strategy. I.e. start measuring time when entering nsHtml5TreeOpExecutor::Flush or some such, then treat iteration through the "perform" loop do what we currently do in nsContentSink::DidProcessATokenImpl. I need to review flushing more in detail to see how well this matches.

I definitely agree that we likely will want to change things there, but I'd rather do that orthogonally to changing the parser so that we don't risk improvements in one patch masking regressions in another. We also don't have a good way to measure responsiveness automatically, so messing around with this stuff means that we need to put in a lot more manual QAing.

441   if (scriptElement) {
442     NS_ASSERTION(!reflushNeeded, "Got scriptElement when queue not fully flushed.");
443     RunScript(scriptElement); // must be tail call when mFlushState is eNotFlushing

What's preventing that assertion from firing? I.e. what is preventing finding a script to execute after we've set reflushNeeded to true.


nsHtml5TreeOpExecutor::ProcessBASETag(nsIContent* aContent)

459   if (mHasProcessedBase) {
460     return NS_OK;
461   }
462   mHasProcessedBase = PR_TRUE;

Hmm.. won't this mean that:

<base target="...">
<base href="...">

won't work? I.e. we'll ignore the second base element here, despite that being the first one with a href. Steps 3 and 4 of http://www.whatwg.org/specs/web-apps/current-work/#document-base-url seems to indicate otherwise. Same if the elements are reversed but for the target attribute.


nsHtml5TreeOpExecutor::DocumentMode

Maybe call this SetDocumentMode or UpdateDocumentMode or some such.
(In reply to comment #7)
> nsHtml5TreeOpExecutor.cpp
> 
> Again there are lots of lines over 80 chars wide. I'd say for now lets leave
> that as it's harder to review patches that contain whitespace cleanup together
> with actual changes. But we should at some point go through and fix up this.
> And please make sure to not use over long lines in new code.

OK. (FWIW, there's also bug 523334.)

> nsHtml5TreeOpExecutor::InitializeStatics
> 
> 101   if (sTreeOpQueueMinLength <= 0) {
> 102     sTreeOpQueueMinLength = 200;
> 103   }
> 104   if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) {
> 105     sTreeOpQueueLengthLimit = sTreeOpQueueMinLength;
> 106   }
> 107   if (sTreeOpQueueMaxLength < sTreeOpQueueMinLength) {
> 108     sTreeOpQueueMaxLength = sTreeOpQueueMinLength;
> 109   }
> 110   if (sTreeOpQueueMaxTime <= 0) {
> 111     sTreeOpQueueMaxTime = 200;
> 112   }
> 
> These adjustments won't help very much given that if the pref is changed they
> won't be rerun, right? Maybe you should install a observer for "html5.opqueue"
> and always update the prefs and run these adjustments if anything under that
> changes.

This code goes away in bug 543458.

> nsHtml5TreeOpExecutor::DidBuildModel
> 
> 139   if (!aTerminated) {
> 140     // Break out of update batch if we are in one 
> 141     // and aren't forcibly terminating
> 142     EndDocUpdate();
> 
> For my own understanding, why should we not exit the current patch if we are
> forcably terminated?

This is needed to avoid unblocking loads too many times on one hand and on the other hand to avoid destroying the frame constructor from within an update batch. See bug 537683.

Rewrote the comment to say so.

> When can we get terminated during a bach, aren't we
> preventing scripts from running during a batch?

In the normal EOF case, we have aTerminate set to false and we are in a batch here.

<rant>As for aTerminate set to true, the immediate termination semantics of nsIParser::Terminate() have been a huge pain. At least the HTML5 app cache stuff can decide to terminate the parser mid-flight (I don't remember in what situation exactly). Bug 531106 and bug 537683 have interensting re-entrancy scenarios.</rant>

> 153   // This is comes from nsXMLContentSink
> 
> Remove the "is".

Fixed.

> 180 nsHtml5TreeOpExecutor::WillInterrupt()
> 181 {
> 182   return WillInterruptImpl();
> 183 }
> 
> So it sounds like all the WillInterrupt stuff is going away from html5 parsing,
> is this correct?

Yes. This has become
NS_IMETHODIMP
nsHtml5TreeOpExecutor::WillInterrupt()
{
  NS_NOTREACHED("Don't call. For interface compat only.");
  return NS_ERROR_NOT_IMPLEMENTED;
}
in bug 543458.

> nsHtml5TreeOpExecutor::SetDocumentCharset
> 
> 207   if (mDocShell) {
> 208     // the following logic to get muCV is copied from
> 209     // nsHTMLDocument::StartDocumentLoad
> 210     // We need to call muCV->SetPrevDocCharacterSet here in case
> 211     // the charset is detected by parser DetectMetaTag
> 212     nsCOMPtr<nsIMarkupDocumentViewer> muCV;
> 
> Please call this variable 'mucv' or some such. As it's named now it looks
> confusingly similar to an mFoo member variable.

This comes from nsHTMLDocument::StartDocumentLoad, but OK. Changed.

> nsHtml5TreeOpExecutor::UpdateStyleSheet
> 
> This function should probably be renamed to HandleLinkElement or some such.

UpdateStyleSheet is consistent with the old sinks and the XSLT code.

> 277   // Break out of the doc update created by Flush() to zap a runnable 
> 278   // waiting to call UpdateStyleSheet without the right observer
> 
> This comment doesn't make sense to me. Sounds like someone is intentionally
> waiting to "call UpdateStyleSheet without the right observer".

Indeed, code on the content tree side has questionable behavior that relies on the parser extinguishing a useless runnable here. I'd like to fix the content tree side eventually but with the content tree side the way it is, this hack is needed to avoid parsing <style> contents twice.

> nsHtml5TreeOpExecutor::Flush
> 
> 380   if (aForceWholeQueue) {
> 381     if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
> 382       flushStart = PR_IntervalNow(); // compute averages only if enough ops
> 383     }    
> 384   } else {
> 385     if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
> 386       flushStart = PR_IntervalNow(); // compute averages only if enough ops
> 387       if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) {
> 388         numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit;
> 389         reflushNeeded = PR_TRUE;
> 390       }
> 391     }
> 392   }
> 
> You can simplify this to:
> 
> if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) {
>   flushStart = PR_IntervalNow(); // compute averages only if enough ops
>   if (!aForceWholeQueue &&
>       numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) {
>     numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit;
>     reflushNeeded = PR_TRUE;
>   }
> }
> 
> Though see below.
> 
> Editorial: There is whitespace at the end of line 383 here.

This code went away in bug 543458.

> 398   for (nsHtml5TreeOperation* iter = (nsHtml5TreeOperation*)start; iter <
> end; ++iter) {
> 399     if (NS_UNLIKELY(!mParser)) {
> 400       // The previous tree op caused a call to nsIParser::Terminate();
> 401       break;
> 402     }
> 
> How could we end up terminating here? Isn't the idea that we won't be running
> script while inside the doc-update? Is this due to charset switching or
> something else?

At least the HTML5 app cache can terminate. That said, nsIParser::Terminate() has been a source of bugs that felt endless, so I can't remember what else could call nsIParser::Terminate() here. (I failed to record all the cases I had seen.)

> 413   if (flushStart) {
> 414     PRUint32 delta = PR_IntervalToMilliseconds(PR_IntervalNow() -
> flushStart);
> 415     sTreeOpQueueLengthLimit = delta ?
> 416       (PRUint32)(((PRUint64)sTreeOpQueueMaxTime *
> (PRUint64)numberOfOpsToFlush)
> 417                  / delta) :
> 418       sTreeOpQueueMaxLength; // if the delta is less than one ms, use max
> 419     if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) {
> 420       // both are signed and sTreeOpQueueMinLength is always positive, so
> this
> 421       // also takes care of the theoretical overflow of
> sTreeOpQueueLengthLimit
> 422       sTreeOpQueueLengthLimit = sTreeOpQueueMinLength;
> 423     }
> 424     if (sTreeOpQueueLengthLimit > sTreeOpQueueMaxLength) {
> 425       sTreeOpQueueLengthLimit = sTreeOpQueueMaxLength;
> 426     }
> 427 #ifdef DEBUG_NS_HTML5_TREE_OP_EXECUTOR_FLUSH
> 428     printf("FLUSH DURATION (millis): %d\n", delta);
> 429     printf("QUEUE NEW MAX LENGTH: %d\n", sTreeOpQueueLengthLimit);      
> 430 #endif
> 431   }
> 
> I'm definitely nervous about changing the way we back off to the main event
> loop. Specifically things like checking if there are pending user events seems
> like a good idea but are no longer happening.
> 
> While we can't duplicate exactly what the old parser does, since we can't have
> callbacks for each parsed token, I don't see why we couldn't keep approximately
> the same strategy. I.e. start measuring time when entering
> nsHtml5TreeOpExecutor::Flush or some such, then treat iteration through the
> "perform" loop do what we currently do in nsContentSink::DidProcessATokenImpl.
> I need to review flushing more in detail to see how well this matches.
> 
> I definitely agree that we likely will want to change things there, but I'd
> rather do that orthogonally to changing the parser so that we don't risk
> improvements in one patch masking regressions in another. We also don't have a
> good way to measure responsiveness automatically, so messing around with this
> stuff means that we need to put in a lot more manual QAing.

Bug 543458 makes all the above code go away and uses nsContentSink::DidProcessATokenImpl instead.

> 441   if (scriptElement) {
> 442     NS_ASSERTION(!reflushNeeded, "Got scriptElement when queue not fully
> flushed.");
> 443     RunScript(scriptElement); // must be tail call when mFlushState is
> eNotFlushing
> 
> What's preventing that assertion from firing? I.e. what is preventing finding a
> script to execute after we've set reflushNeeded to true.

reflushNeeded goes away in bug 543458.

Sorry about not advertising bug 543458 to you in advance. I thought you'd review the unreviewed Java-based parser core first instead of re-reviewing the hand-written C++ code.

As for reviewing the Java-based code, I'd like to highlight bug 489820 and bug 483209 as having a pending patch (that I'm provisionally treating as rs=sicking). Other than that, P3, P4 and P5 "[HTML5]" bugs document a lot of known things that could benefit from improvement eventually.

> nsHtml5TreeOpExecutor::ProcessBASETag(nsIContent* aContent)
> 
> 459   if (mHasProcessedBase) {
> 460     return NS_OK;
> 461   }
> 462   mHasProcessedBase = PR_TRUE;
> 
> Hmm.. won't this mean that:
> 
> <base target="...">
> <base href="...">
> 
> won't work? I.e. we'll ignore the second base element here, despite that being
> the first one with a href. Steps 3 and 4 of
> http://www.whatwg.org/specs/web-apps/current-work/#document-base-url seems to
> indicate otherwise. Same if the elements are reversed but for the target
> attribute.

In the Java code, I have attempted to conform to HTML5. In the hand-written Gecko integration C++ code, I have been trying to retain compatibility with existing Gecko behavior until the HTML5 parser is on by default. In this case, I have copied the behavior of the XML content sink without trying to change it to be HTML5-compliant.

> nsHtml5TreeOpExecutor::DocumentMode
> 
> Maybe call this SetDocumentMode or UpdateDocumentMode or some such.

Changed to SetDocumentMode.
Attachment #432525 - Flags: review?(jonas)
I realized that it would be nice to put all this code into a mozilla::html5 parser namespace. That should should reduce concerns about long identifiers a bit. Definitely not a requirement though.

Why do we need nsHtml5DocumentMode? nsCompatibility.h has an equivalent enum.

nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big tables they contain. We really should have those broken out into #include files similarly to how we do atom lists.


nsHtml5MetaScanner.cpp

Have you looked at if this code shows up in profiles at all? In general there are a fair number of optimizations that you could do here. Such as not feeding the scanner asynchronously (which there really is no reason to do as we don't display anything until we're done scanning, right?), and instead implementing this in a much less "switchy" way.

Another thing is to get rid of the the reconsume stuff, it should be faster to rewind the reading iterator one char.

This various members need descriptions. I couldn't find any in the java source either.
681 nsHtml5MetaScanner::tryCharset()
682 {
683   if (metaState != NS_HTML5META_SCANNER_A || !(contentIndex == 6 || charsetIndex == 6)) {
684     return PR_FALSE;
685   }
686   nsString* attVal = nsHtml5Portability::newStringFromBuffer(strBuf, 0, strBufLen);

Ugh, this is really ugly, and scary from a leaks point of view. We need to find a solution that allows us to generate nice C++ code here. I.e. with a normal string object living on the stack.

Don't expect there is anything we can do about it in this bug (probably a bigger project), but wanted to note it so that we don't loose track of it.
(In reply to comment #9)
> I realized that it would be nice to put all this code into a mozilla::html5
> parser namespace. That should should reduce concerns about long identifiers a
> bit. Definitely not a requirement though.

Yeah, the mozilla::* stuff started being common and wanted after the nsHtml5* naming was already in place. This is bug 497820.

> Why do we need nsHtml5DocumentMode? nsCompatibility.h has an equivalent enum.

The Java code has a DocumentMode enum independently of Gecko. I could add more special casing to the translator and make the DocumentMode enum translate directly to nsCompatibility.h if adding complexity to the translator is preferred. (In the big picture, though, this is nothing new: the old parser also had its own enum distinct from the layout enum for this purpose.)

> nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big
> tables they contain. We really should have those broken out into #include files
> similarly to how we do atom lists.

The main difference is that those big tables are translated from the Java code (and Java doesn't have include files) while the atom list is generated by the C++ translator. I could special-case nsHtml5ElementName and nsHtml5AttributeName to be split across multiple files. Do you consider this a blocker for turning the parser on by default?

> nsHtml5MetaScanner.cpp
> 
> Have you looked at if this code shows up in profiles at all? 

I've noticed meta prescan-related stuff on a profile, but everything that happens on the parser thread on tp4 is below 1% of the total time, so any optimizations to the parser core would be squeezing below 1% time to even more below 1%.

> In general there
> are a fair number of optimizations that you could do here. Such as not feeding
> the scanner asynchronously (which there really is no reason to do as we don't
> display anything until we're done scanning, right?), and instead implementing
> this in a much less "switchy" way.
> 
> Another thing is to get rid of the the reconsume stuff, it should be faster to
> rewind the reading iterator one char.

The meta scanner uses an old approach to implementing the states. I want to change the meta scanner to use the same patterns as the main tokenizer. However, I've prioritized mochitest failures and user-visible bugs. Once the meta scanner uses the same patterns as the tokenizer, fixing bug 482920 would speed up both.

I filed this as bug 552900.

> This various members need descriptions. I couldn't find any in the java source
> either.

OK.

(In reply to comment #10)
> 681 nsHtml5MetaScanner::tryCharset()
> 682 {
> 683   if (metaState != NS_HTML5META_SCANNER_A || !(contentIndex == 6 ||
> charsetIndex == 6)) {
> 684     return PR_FALSE;
> 685   }
> 686   nsString* attVal = nsHtml5Portability::newStringFromBuffer(strBuf, 0,
> strBufLen);
> 
> Ugh, this is really ugly, and scary from a leaks point of view. We need to find
> a solution that allows us to generate nice C++ code here. I.e. with a normal
> string object living on the stack.
> 
> Don't expect there is anything we can do about it in this bug (probably a
> bigger project), but wanted to note it so that we don't loose track of it.

The cases where the HTML5 parser could benefit from on-stack strings are so few that it's probably not worth it. Most strings are attribute values and those need to travel across threads on the heap anyway.

However, it's kinda pointless to have nsString* when nsStringBuffer* would have the desired semantics of java.lang.String. But nsStringBuffer* is supposed to be a hidden impl. detail. :-(
So having looked at the entity code also, I think we shoud switch to having specialized code in the translator that output "macro files", and then hand write cpp files that use those to populate whatever data structure we think is appropriate. This is especially nice for the entity map as we can get rid of the high-lo stuff.

Not necessarily a requirement for turning on by default, I'd say most things in this bug aren't, but something that needs to be done for sure.
(In reply to comment #11)
> (In reply to comment #9)
> > This various members need descriptions. I couldn't find any in the java source
> > either.
> 
> OK.

I landed some more JavaDocs, though in most cases I couldn't think of much to say.

(In reply to comment #9)
> nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big
> tables they contain. We really should have those broken out into #include files
> similarly to how we do atom lists.

I realized we're repeating bug 487949 comment 42 and bug 487949 comment 49.

(In reply to comment #12)
> So having looked at the entity code also, I think we shoud switch to having
> specialized code in the translator that output "macro files", and then hand
> write cpp files that use those to populate whatever data structure we think is
> appropriate.

I don't really see the benefit of having the generator generate macros and includes that at compile time generate the desired structure. It seems simpler to let the generator generate the desired structure. (Except for the enum trick from bug 501082 that makes compile times faster.)

> This is especially nice for the entity map as we can get rid of
> the high-lo stuff.

I filed bug 555941 about splitting the HILO_ACCEL table into a separate class, but I don't see positive value in having the C preprocessor compute the contents of HILO_ACCEL. Seems like an unnecessary complication to me.
> (In reply to comment #12)
> > So having looked at the entity code also, I think we shoud switch to having
> > specialized code in the translator that output "macro files", and then hand
> > write cpp files that use those to populate whatever data structure we think
> > is appropriate.
> 
> I don't really see the benefit of having the generator generate macros and
> includes that at compile time generate the desired structure. It seems simpler
> to let the generator generate the desired structure. (Except for the enum
> trick from bug 501082 that makes compile times faster.)

The idea is that we can hand write C++ code to use and implement the datastructures that work the best for our code. So for example in C++ it's much more efficient and clean to get rid of the high-lo stuff entirely and simply use an nsTHashtable mapping the full entity string to an entity value. Or, if we want more efficiency, to build a tree structure which allows us to walk one level down the tree for each char consumed, and which at the end gives the entity value that maps to.

I completely agree that simply using preprocessor macros to generate the exact same code we have now doesn't buy us much. The goal here is simpler and more maintainable C++ code.
(In reply to comment #14)
> The idea is that we can hand write C++ code to use and implement the
> datastructures that work the best for our code. So for example in C++ it's much
> more efficient and clean to get rid of the high-lo stuff entirely and simply
> use an nsTHashtable mapping the full entity string to an entity value.

A hashtable alone wouldn't do the right thing, because named characters need a longest prefix match.

> Or, if
> we want more efficiency, to build a tree structure which allows us to walk one
> level down the tree for each char consumed, and which at the end gives the
> entity value that maps to.

The hilo table encodes the first two levels of trie-equivalent lookup more efficiently than a tree structure would. Using linear search for the tail of the names makes the data structure have a smaller overall footprint than a prefix tree would and makes it easier to arrange the entire data structure ahead of run time.

The only part of the hilo solution that is less nice in C++ because of Java's restrictions is that two 16-bit integers are manually shifted into a 32-bit space instead of using a struct of two shorts as a syntactically nicer way to use the same data layout in C++.

I didn't choose the hilo solution only to work around cross-language limitations even though the solution has the nice property of working in both Java and C++ without language-specific classes. I chose it to balance the the performance characteristics of a trie with the compactness and data segment-friendliness of arrays.
I still think we need to change how we're doing element/attribute/entity names, however I'll take that to a separate bug.

More review comments:

What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr?

We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms. It's very confusing that we in part of the code have atom like things that aren't really atoms. If you don't need these things to be unique (which my understanding is that you don't?), then simply using pointers to nsStrings or even nsStringBuffers seems like a cleaner solution. We could use a class similar to nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if the low-bit is set. Or some such. This can be done after turning on by default though.

nsHtml5StateSnapshot:
Might be cleaner to just make all those members public and avoid all the getter methods.

Would be nice if getListLength had something about formatting elements in its name (unless you remove it entirely, see previous point).

157     if (!!listOfActiveFormattingElements[i]) {

No need for the !!


nsHtml5StreamParser:
197 nsHtml5StreamParser::~nsHtml5StreamParser()
198 {
199   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
200   mTokenizer->end();
201   mRequest = nsnull;
202   mObserver = nsnull;
203   mUnicodeDecoder = nsnull;
204   mSniffingBuffer = nsnull;
205   mMetaScanner = nsnull;
206   mFirstBuffer = nsnull;
207   mExecutor = nsnull;
208   mTreeBuilder = nsnull;
209   mTokenizer = nsnull;
210   mOwner = nsnull;
211   if (mFlushTimer) {
212     mFlushTimer->Cancel();
213     mFlushTimer = nsnull;
214   }
215 }

Is there a reason to null all these things out?

238 nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment, // can be null
239                                                                           PRUint32 aCount,
240                                                                           PRUint32* aWriteCount)
241 {
242   NS_ASSERTION(IsParserThread(), "Wrong thread!");
243   nsresult rv = NS_OK;
244   nsCOMPtr<nsICharsetConverterManager> convManager = do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
245   NS_ENSURE_SUCCESS(rv, rv);
246   rv = convManager->GetUnicodeDecoder(mCharset.get(), getter_AddRefs(mUnicodeDecoder));

I don't think that the charset converter manager is threadsafe yet. Nor are some of the decoders I looked at. I suspect the simplest fix is to make them so, it shouldn't be too hard.

This we need to fix before turning it on by default as to avoid random crashes.

248     mCharset.Assign("windows-1252"); // lower case is the raw form

Use AssignLiteral (This might apply elsewhere too)

979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
980 {
981   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
982   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
983   Uninterrupt();
984   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
985   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
986     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
987   }
988 }

Is there a reason to uninterrupt from the main thread? Why not do it from the parser thread as to avoid blocking the main thread.

nsHtml5StreamParser::ContinueAfterScripts

864     if (mSpeculations.IsEmpty()) {
865       // Not quite sure how exactly this happens...
866       // Maybe an artifact of defer scripts?
867       return;
868     }

Sounds like an assertion would be good.

902     mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
903     #ifdef DEBUG
904       nsCOMPtr<nsIThread> mainThread;
905       NS_GetMainThread(getter_AddRefs(mainThread));
906       mAtomTable.SetPermittedLookupThread(mainThread);
907     #endif

Please don't indent unless there is an actual block. Might actually be good to add a block to avoid mainThread existing only in debug builds.
(In reply to comment #16)
> What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr?

nsHtml5RefPtr calls Release on the main thread.  The only way to achieve the same thing without modifying nsRefPtr would be to make sure the refcounted type overrides Release and does the main-thread dispatching itself, which is inconvenient.

What really needs to happen here is to modify nsRefPtr to accept a template parameter that controls the Release behavior.  I have a patch to do just that, and also get rid of the duplicated code in parser/html/nsHtml5RefPtr.h.
You're also using the charset detector off the main thread, which I suspect isn't safe. In fact, since you are creating it on the main thread, and then calling it from the parser thread, I'm surprised this doesn't assert as it doesn't seem to be implemented using threadsafe refcounting macros.

This is also something we need to fix before we can turn on by default. Do you want me to file a separate bug on this charset threadsafety stuff?
nsHtml5StreamParser::SniffStreamBytes

340   for (PRUint32 i = 0; i < aCount; i++) {
341     switch (mBomState) {
...
390       case SEEN_UTF_8_SECOND_BYTE:
391         if (aFromSegment[i] == 0xBF) {
392           rv = SetupDecodingFromBom("UTF-8", "UTF-8"); // upper case is the raw form
393           NS_ENSURE_SUCCESS(rv, rv);
394           PRUint32 count = aCount - (i + 1);
395           rv = WriteStreamBytes(aFromSegment + (i + 1), count, &writeCount);
396           NS_ENSURE_SUCCESS(rv, rv);
397           *aWriteCount = writeCount + (i + 1);
398           return rv;
399         }
400         mBomState = BOM_SNIFFING_OVER;
401         break;
402       default:
403         goto bom_loop_end;
404     }
405   }
406   // if we get here, there either was no BOM or the BOM sniffing isn't complete yet
407   bom_loop_end:

Change the loop-condition to be |i < aCount && mBomState != BOM_SNIFFING_OVER| and change the default condition to set |mBomState = BOM_SNIFFING_OVER| and get rid of the goto? http://xkcd.com/292/ :)

418     if (mUnicodeDecoder) {
419       mUnicodeDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Recover);

It seems like the responsibility for this call varies with where we instantiate mUnicodeDecoder. Would be nice to make it consistent and always make it the responsibility of whoever created the decoder. Or do we need to set it to different ™

nsHtml5StreamParser::WriteStreamBytes

475     if (NS_FAILED(convResult)) {

Could you add a comment describing what we do on failure. Took a while for me to figure out.

476       if (totalByteCount < aCount) { // mimicking nsScanner even though this seems wrong
477         ++totalByteCount;
478         ++aFromSegment;

Could you assert here? I agree this seems wrong.

479       }
480       mLastBuffer->getBuffer()[end] = 0xFFFD;
481       ++end;
482       mLastBuffer->setEnd(end);
483       if (end == NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE) {
484           mLastBuffer = (mLastBuffer->next = new nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE));

No need for the parenthesis. a = b = c; is ok.

485       }
486       mUnicodeDecoder->Reset();
487       if (totalByteCount == aCount) {
488         *aWriteCount = totalByteCount;
489         return NS_OK;
490       }
491     } else if (convResult == NS_PARTIAL_MORE_OUTPUT) {
492       mLastBuffer = (mLastBuffer->next = new nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE));
493       NS_ASSERTION(totalByteCount < aCount, "The Unicode decoder has consumed too many bytes.");

Same here


nsHtml5StreamParser::internalEncodingDeclaration

The name of this function seems very strange to me. Also, it's not documented anywhere. Is this called any time we find a meta-charset? If so, would calling it something like MetaCharsetFound make sense?

700   NS_ASSERTION(IsParserThread(), "Wrong thread!");
701   if (mCharsetSource >= kCharsetFromMetaTag) { // this threshold corresponds to "confident" in the HTML5 spec
702     return;
703   }
704 
705   // The encodings are different.

This comment threw me off for a while. It doesn't appear to be correct, we haven't done any equality checks at that point, have we. Shouldn't it be moved until after the calias->Equals(newEncoding, mCharset, &eq); call?

739   // we still want to reparse
740   mTreeBuilder->NeedsCharsetSwitchTo(preferred);
741   mTreeBuilder->Flush();

The 'still' here threw me off a little bit too. Isn't this the first point where we've determined that we want to reparse?

nsHtml5StreamParser::ParseAvailableData

762     if (!mFirstBuffer->hasMore()) {
763       if (mFirstBuffer == mLastBuffer) {
764         switch (mStreamState) {
765           case STREAM_BEING_READ:
766             // never release the last buffer.
767             if (!mSpeculating) {
768               // reuse buffer space if not speculating
769               mFirstBuffer->setStart(0);
770               mFirstBuffer->setEnd(0);
771             }
772             mTreeBuilder->FlushLoads();
773             // Dispatch this runnable unconditionally, because the loads
774             // that need flushing may have been flushed earlier even if the
775             // flush right above here did nothing.
776             if (NS_FAILED(NS_DispatchToMainThread(mLoadFlusher))) {
777               NS_WARNING("failed to dispatch load flush event");
778             }
779             return; // no more data for now but expecting more
780           case STREAM_ENDED:
781             if (mAtEOF) {
782                 return;
783             }
784             mAtEOF = PR_TRUE;
785             mTokenizer->eof();
786             mTreeBuilder->StreamEnded();
787             mTreeBuilder->Flush();
788             if (NS_FAILED(NS_DispatchToMainThread(mExecutorFlusher))) {
789               NS_WARNING("failed to dispatch executor flush event");
790             }
791             return; // no more data and not expecting more
792           default:
793             NS_NOTREACHED("It should be impossible to reach this.");
794             return;
795         }
796       } else {
797         mFirstBuffer = mFirstBuffer->next;
798         continue;
799       }
800     }

The else on line 796 is actually an else-after-return. Would probably be clearer to change those 'return's into 'break's and put a return after the switch statement and remove the else. The optimizer should turn it into the same flow anyway.


The timer stuff is very strange. It seems like it keeps refiring on the main event loop, weather we've parsed anything or not, until we happen to be speculating at the time when the timer fires, or we finish parsing.

(This means that the "just in case" comments at the various points when we cancel and restart the timer isn't really true, as we have no idea if the timer is currently pending or not.)

It seems to me much better to have the timer running on the parser thread first of all. Also, to save batteries, only start the timer if we have pending things that need flushing. I.e. when we reach the end of a network buffer without getting blocked by a script (and maybe when we get unblocked with a successful speculation, or do we always flush what we have parsed in that case?)

I think I'd like to have this timer situation solved before turning the parser on by default.
(In reply to comment #9)
> Another thing is to get rid of the the reconsume stuff, it should be faster to
> rewind the reading iterator one char.

I refreshed my memory. The reason why the reconsume pattern exists in the tokenizer is that in the error reporting mode (that the validator uses but Gecko doesn't), reading a character isn't idompotent, because reading a character can have the side effect of emitting an error. In the case of the meta scanner, the data source abstraction currently in use doesn't provide a way to rewind.

In both cases, the reconsume pattern will go away in C++ on bug 482920 is fixed. I hope this isn't considered a blocker to turning the parser on by default.

(In reply to comment #16)
> I still think we need to change how we're doing element/attribute/entity names,
> however I'll take that to a separate bug.

FWIW, I wanted to put element and attribute names in the data segment of the shared lib but couldn't because static atoms aren't in the data segment and, therefore, don't have memory addresses at link time.

> More review comments:
> 
> What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr?

I see that bnewman answered this question already.

> We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms.

Do you mean that we just need to make the fake nsIAtoms not implement nsIAtom or do you want to get rid of the nsHtml5AtomTable mechanism more generally? (I'm OK with the former but want to avoid the latter.)

> It's very
> confusing that we in part of the code have atom like things that aren't really
> atoms. If you don't need these things to be unique (which my understanding is
> that you don't?), 

Even dynamically allocated local names need to be "unique" as far as an nsHtml5Tokenizer instance or an nsHtml5TreeBuilder instance can tell: to deal with duplicate attribute names and to make end tag names match start tag names.

> then simply using pointers to nsStrings or even
> nsStringBuffers seems like a cleaner solution. We could use a class similar to
> nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if
> the low-bit is set. Or some such. This can be done after turning on by default
> though.

This is bug 539433.

> nsHtml5StateSnapshot:
> Might be cleaner to just make all those members public and avoid all the getter
> methods.

The getters are there so that the getters can be part of a Java interface / nsAHtml5TreeBuilderState that both the state snapshot and the tree builder implement.

> Would be nice if getListLength had something about formatting elements in its
> name (unless you remove it entirely, see previous point).

I'll address this in a follow-up patch.

> 157     if (!!listOfActiveFormattingElements[i]) {
> 
> No need for the !!

That's generated code. The translator makes the C++ code avoid !== and == expressions where one of the operands is zero. However, it doesn't check if the resulting expression is used in a context where the potentially resulting !! is useless. The translator uses the visitor pattern, so the visits for the !== and == expressions don't know about the context of the expression.

In order to avoid spending excessive time on making the translator more elegant than it needs to be to work, I haven't tried to get rid of useless but harmless !!. (There's even |if (!!this)| in one generated file...)

> nsHtml5StreamParser:
> 197 nsHtml5StreamParser::~nsHtml5StreamParser()
> 198 {
> 199   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> 200   mTokenizer->end();
> 201   mRequest = nsnull;
> 202   mObserver = nsnull;
> 203   mUnicodeDecoder = nsnull;
> 204   mSniffingBuffer = nsnull;
> 205   mMetaScanner = nsnull;
> 206   mFirstBuffer = nsnull;
> 207   mExecutor = nsnull;
> 208   mTreeBuilder = nsnull;
> 209   mTokenizer = nsnull;
> 210   mOwner = nsnull;
> 211   if (mFlushTimer) {
> 212     mFlushTimer->Cancel();
> 213     mFlushTimer = nsnull;
> 214   }
> 215 }
> 
> Is there a reason to null all these things out?

Mainly to distinguish an already-deleted object more easily in a debugger. Shall I make these #ifdef DEBUG?

> 238
> nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const
> PRUint8* aFromSegment, // can be null
> 239                                                                          
> PRUint32 aCount,
> 240                                                                          
> PRUint32* aWriteCount)
> 241 {
> 242   NS_ASSERTION(IsParserThread(), "Wrong thread!");
> 243   nsresult rv = NS_OK;
> 244   nsCOMPtr<nsICharsetConverterManager> convManager =
> do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
> 245   NS_ENSURE_SUCCESS(rv, rv);
> 246   rv = convManager->GetUnicodeDecoder(mCharset.get(),
> getter_AddRefs(mUnicodeDecoder));
> 
> I don't think that the charset converter manager is threadsafe yet. 

I believe I made the part of the charset converter manager that the HTML5 parser uses thread-safe in the off-the-main-thread parser landing.
https://hg.mozilla.org/mozilla-central/rev/7cda86954b4c

> Nor are
> some of the decoders I looked at. I suspect the simplest fix is to make them
> so, it shouldn't be too hard.

In the changeset referenced above, I made the instantiation of nsOneByteDecoderSupport objects protected by a mutex. Once a given one-byte converter has loaded its tables, it has no mutable state and is, therefore, thread-safe. When I looked at the multibyte decoders, it seemed to me that each instance was safe for using from a single thread and there was no issue with loading data tables because the data tables were baked into code ahead of compile time.

> This we need to fix before turning it on by default as to avoid random crashes.

Are there particular decoders that I have missed?

> 248     mCharset.Assign("windows-1252"); // lower case is the raw form
> 
> Use AssignLiteral (This might apply elsewhere too)

Fixed. (Both occurrences.)

> 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> 980 {
> 981   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> 982   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> 983   Uninterrupt();
> 984   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
> 985   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
> 986     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
> 987   }
> 988 }
> 
> Is there a reason to uninterrupt from the main thread? Why not do it from the
> parser thread as to avoid blocking the main thread.

When calling Uninterrupt(), the main thread already owns mTokenizerMutex and Uninterrupt() doesn't acquire an additional mutex, so no additional blocking occurs. (See the comment in the implementation of Uninterrupt().)

> nsHtml5StreamParser::ContinueAfterScripts
> 
> 864     if (mSpeculations.IsEmpty()) {
> 865       // Not quite sure how exactly this happens...
> 866       // Maybe an artifact of defer scripts?
> 867       return;
> 868     }
> 
> Sounds like an assertion would be good.

I put an NS_WARNING here.

> 902     mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> 903     #ifdef DEBUG
> 904       nsCOMPtr<nsIThread> mainThread;
> 905       NS_GetMainThread(getter_AddRefs(mainThread));
> 906       mAtomTable.SetPermittedLookupThread(mainThread);
> 907     #endif
> 
> Please don't indent unless there is an actual block. Might actually be good to
> add a block to avoid mainThread existing only in debug builds.

Added a block.

(In reply to comment #18)
> You're also using the charset detector off the main thread, which I suspect
> isn't safe. In fact, since you are creating it on the main thread, and then
> calling it from the parser thread, I'm surprised this doesn't assert as it
> doesn't seem to be implemented using threadsafe refcounting macros.
>
> This is also something we need to fix before we can turn on by default. Do you
> want me to file a separate bug on this charset threadsafety stuff?

Yes, a separate bug would be good to have if there really is a thread-safety problem left. As far as I can tell, there is no problem: After bug , the detector is refcounted only from the main thread. The sniffing is run only from the parser thread. The sniffing seems to affect only the internal state of the object. The data tables shared by the instances are baked into code ahead of compile time.

(In reply to comment #19)
> nsHtml5StreamParser::SniffStreamBytes
> 
> 340   for (PRUint32 i = 0; i < aCount; i++) {
> 341     switch (mBomState) {
> ...
> 390       case SEEN_UTF_8_SECOND_BYTE:
> 391         if (aFromSegment[i] == 0xBF) {
> 392           rv = SetupDecodingFromBom("UTF-8", "UTF-8"); // upper case is the
> raw form
> 393           NS_ENSURE_SUCCESS(rv, rv);
> 394           PRUint32 count = aCount - (i + 1);
> 395           rv = WriteStreamBytes(aFromSegment + (i + 1), count,
> &writeCount);
> 396           NS_ENSURE_SUCCESS(rv, rv);
> 397           *aWriteCount = writeCount + (i + 1);
> 398           return rv;
> 399         }
> 400         mBomState = BOM_SNIFFING_OVER;
> 401         break;
> 402       default:
> 403         goto bom_loop_end;
> 404     }
> 405   }
> 406   // if we get here, there either was no BOM or the BOM sniffing isn't
> complete yet
> 407   bom_loop_end:
> 
> Change the loop-condition to be |i < aCount && mBomState != BOM_SNIFFING_OVER|
> and change the default condition to set |mBomState = BOM_SNIFFING_OVER| and get
> rid of the goto? http://xkcd.com/292/ :)

Changed. I wish C++ had break with label.

> 418     if (mUnicodeDecoder) {
> 419      
> mUnicodeDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Recover);
> 
> It seems like the responsibility for this call varies with where we instantiate
> mUnicodeDecoder. Would be nice to make it consistent and always make it the
> responsibility of whoever created the decoder. Or do we need to set it to
> different ™

Good catch. Fixed.

> nsHtml5StreamParser::WriteStreamBytes
> 
> 475     if (NS_FAILED(convResult)) {
> 
> Could you add a comment describing what we do on failure. Took a while for me
> to figure out.

Added comment.

> 476       if (totalByteCount < aCount) { // mimicking nsScanner even though
> this seems wrong
> 477         ++totalByteCount;
> 478         ++aFromSegment;
> 
> Could you assert here? I agree this seems wrong.

I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid a buffer overrun if a decoder misbehaves.

FWIW, the docs at http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123 are not good enough.

> 479       }
> 480       mLastBuffer->getBuffer()[end] = 0xFFFD;
> 481       ++end;
> 482       mLastBuffer->setEnd(end);
> 483       if (end == NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE) {
> 484           mLastBuffer = (mLastBuffer->next = new
> nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE));
> 
> No need for the parenthesis. a = b = c; is ok.

Changed.

> 485       }
> 486       mUnicodeDecoder->Reset();
> 487       if (totalByteCount == aCount) {
> 488         *aWriteCount = totalByteCount;
> 489         return NS_OK;
> 490       }
> 491     } else if (convResult == NS_PARTIAL_MORE_OUTPUT) {
> 492       mLastBuffer = (mLastBuffer->next = new
> nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE));
> 493       NS_ASSERTION(totalByteCount < aCount, "The Unicode decoder has
> consumed too many bytes.");
> 
> Same here

Changed.

> nsHtml5StreamParser::internalEncodingDeclaration
> 
> The name of this function seems very strange to me. Also, it's not documented
> anywhere. 

The documentation is over here:
http://hg.mozilla.org/projects/htmlparser/file/7df2f945c5b7/src/nu/validator/htmlparser/common/EncodingDeclarationHandler.java#l36

> Is this called any time we find a meta-charset?

Yes, it is.

> If so, would calling it something like MetaCharsetFound make sense?

"Meta charset" is not HTML5 spec terminology. "[character] encoding declaration" is spec terminology. That's why the method is named the way it is named. In general, I have tried to stick to spec terminology when naming things even when the spec terminology deviates from colloquial usage. (Consider "entities" vs. "named character refernces".)
 
> 700   NS_ASSERTION(IsParserThread(), "Wrong thread!");
> 701   if (mCharsetSource >= kCharsetFromMetaTag) { // this threshold
> corresponds to "confident" in the HTML5 spec
> 702     return;
> 703   }
> 704 
> 705   // The encodings are different.
> 
> This comment threw me off for a while. It doesn't appear to be correct, we
> haven't done any equality checks at that point, have we. Shouldn't it be moved
> until after the calias->Equals(newEncoding, mCharset, &eq); call?

Removed the comment.

> 739   // we still want to reparse
> 740   mTreeBuilder->NeedsCharsetSwitchTo(preferred);
> 741   mTreeBuilder->Flush();
> 
> The 'still' here threw me off a little bit too. Isn't this the first point
> where we've determined that we want to reparse?

My thinking here was that we presumptively want to reparse but decide not to if the new encoding is an alias for the current encoding. Removed the comment.

> nsHtml5StreamParser::ParseAvailableData
> 
> 762     if (!mFirstBuffer->hasMore()) {
> 763       if (mFirstBuffer == mLastBuffer) {
> 764         switch (mStreamState) {
> 765           case STREAM_BEING_READ:
> 766             // never release the last buffer.
> 767             if (!mSpeculating) {
> 768               // reuse buffer space if not speculating
> 769               mFirstBuffer->setStart(0);
> 770               mFirstBuffer->setEnd(0);
> 771             }
> 772             mTreeBuilder->FlushLoads();
> 773             // Dispatch this runnable unconditionally, because the loads
> 774             // that need flushing may have been flushed earlier even if the
> 775             // flush right above here did nothing.
> 776             if (NS_FAILED(NS_DispatchToMainThread(mLoadFlusher))) {
> 777               NS_WARNING("failed to dispatch load flush event");
> 778             }
> 779             return; // no more data for now but expecting more
> 780           case STREAM_ENDED:
> 781             if (mAtEOF) {
> 782                 return;
> 783             }
> 784             mAtEOF = PR_TRUE;
> 785             mTokenizer->eof();
> 786             mTreeBuilder->StreamEnded();
> 787             mTreeBuilder->Flush();
> 788             if (NS_FAILED(NS_DispatchToMainThread(mExecutorFlusher))) {
> 789               NS_WARNING("failed to dispatch executor flush event");
> 790             }
> 791             return; // no more data and not expecting more
> 792           default:
> 793             NS_NOTREACHED("It should be impossible to reach this.");
> 794             return;
> 795         }
> 796       } else {
> 797         mFirstBuffer = mFirstBuffer->next;
> 798         continue;
> 799       }
> 800     }
> 
> The else on line 796 is actually an else-after-return. Would probably be
> clearer to change those 'return's into 'break's and put a return after the
> switch statement and remove the else. The optimizer should turn it into the
> same flow anyway.

Removed the else. I tried changing the returns into breaks, but when I looked at the result, it looked less obvious than just returning instead of breaking first.

> The timer stuff is very strange. It seems like it keeps refiring on the main
> event loop, weather we've parsed anything or not, until we happen to be
> speculating at the time when the timer fires, or we finish parsing.

The timer is supposed to keep refiring whenever the parser is not speculating. When the parser is not speculating, it's useful to move data to the main thread every now and then even when there's no </script> in sight.

> (This means that the "just in case" comments at the various points when we
> cancel and restart the timer isn't really true, as we have no idea if the timer
> is currently pending or not.)

Removed the comments.

> It seems to me much better to have the timer running on the parser thread first
> of all.

I put the timer on the main thread, because I couldn't find a timer implementation that'd fire on a non-main thread. Also, initially I wanted also to check if the document is in the frontmost tab when the timer fires, but I gave up on that idea.

> Also, to save batteries, only start the timer if we have pending things
> that need flushing. I.e. when we reach the end of a network buffer without
> getting blocked by a script 

For practical purposes, this is approximated by sTimerStartDelay being longer than the subsequent delays.

> (and maybe when we get unblocked with a successful
> speculation, or do we always flush what we have parsed in that case?)

The parser always flushes when a speculation succeeds. The timer is strictly about non-speculative parsing.

> I think I'd like to have this timer situation solved before turning the parser
> on by default.

That seems problematic if we don't already have off-the-main-thread timer infrastructure in place.
Attachment #437849 - Flags: review?(jonas)
> After bug , the

I forgot to paste the bug number. I meant bug 537557.
Depends on: 558775
(In reply to comment #20)
> Created an attachment (id=437849) [details]
> Address review comments in nsHtml5StreamParser.cpp
> 
> (In reply to comment #9)
> > Another thing is to get rid of the the reconsume stuff, it should be faster to
> > rewind the reading iterator one char.
> 
> I refreshed my memory. The reason why the reconsume pattern exists in the
> tokenizer is that in the error reporting mode (that the validator uses but
> Gecko doesn't), reading a character isn't idompotent, because reading a
> character can have the side effect of emitting an error. In the case of the
> meta scanner, the data source abstraction currently in use doesn't provide a
> way to rewind.
> 
> In both cases, the reconsume pattern will go away in C++ on bug 482920 is
> fixed. I hope this isn't considered a blocker to turning the parser on by
> default.

Definitely not a blocker for turning on.

> (In reply to comment #16)
> > I still think we need to change how we're doing element/attribute/entity names,
> > however I'll take that to a separate bug.
> 
> FWIW, I wanted to put element and attribute names in the data segment of the
> shared lib but couldn't because static atoms aren't in the data segment and,
> therefore, don't have memory addresses at link time.

What we've done in the past is to create static pointers to pointers to atoms. I.e. things like &nsGkAtoms::foo. Dunno if that will work here. In any case, I'll file a separate bug on this stuff.

> > We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms.
> 
> Do you mean that we just need to make the fake nsIAtoms not implement nsIAtom
> or do you want to get rid of the nsHtml5AtomTable mechanism more generally?
> (I'm OK with the former but want to avoid the latter.)

The former.

> > It's very
> > confusing that we in part of the code have atom like things that aren't really
> > atoms. If you don't need these things to be unique (which my understanding is
> > that you don't?), 
> 
> Even dynamically allocated local names need to be "unique" as far as an
> nsHtml5Tokenizer instance or an nsHtml5TreeBuilder instance can tell: to deal
> with duplicate attribute names and to make end tag names match start tag names.

Ah, didn't realize this. Good to know.

> > then simply using pointers to nsStrings or even
> > nsStringBuffers seems like a cleaner solution. We could use a class similar to
> > nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if
> > the low-bit is set. Or some such. This can be done after turning on by default
> > though.
> 
> This is bug 539433.

Awesome, thanks.

> > nsHtml5StreamParser:
> > 197 nsHtml5StreamParser::~nsHtml5StreamParser()
> > 198 {
> > 199   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> > 200   mTokenizer->end();
> > 201   mRequest = nsnull;
> > 202   mObserver = nsnull;
> > 203   mUnicodeDecoder = nsnull;
> > 204   mSniffingBuffer = nsnull;
> > 205   mMetaScanner = nsnull;
> > 206   mFirstBuffer = nsnull;
> > 207   mExecutor = nsnull;
> > 208   mTreeBuilder = nsnull;
> > 209   mTokenizer = nsnull;
> > 210   mOwner = nsnull;
> > 211   if (mFlushTimer) {
> > 212     mFlushTimer->Cancel();
> > 213     mFlushTimer = nsnull;
> > 214   }
> > 215 }
> > 
> > Is there a reason to null all these things out?
> 
> Mainly to distinguish an already-deleted object more easily in a debugger.
> Shall I make these #ifdef DEBUG?

Please do.

> > 238
> > nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const
> > PRUint8* aFromSegment, // can be null
> > 239                                                                          
> > PRUint32 aCount,
> > 240                                                                          
> > PRUint32* aWriteCount)
> > 241 {
> > 242   NS_ASSERTION(IsParserThread(), "Wrong thread!");
> > 243   nsresult rv = NS_OK;
> > 244   nsCOMPtr<nsICharsetConverterManager> convManager =
> > do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
> > 245   NS_ENSURE_SUCCESS(rv, rv);
> > 246   rv = convManager->GetUnicodeDecoder(mCharset.get(),
> > getter_AddRefs(mUnicodeDecoder));
> > 
> > I don't think that the charset converter manager is threadsafe yet. 
> 
> I believe I made the part of the charset converter manager that the HTML5
> parser uses thread-safe in the off-the-main-thread parser landing.
> https://hg.mozilla.org/mozilla-central/rev/7cda86954b4c

Yay!! Awesome. I would have made the one-byte table just initialize in the ctor to avoid the mutex, but that's a separate issue.

Didn't know you had done this, so things look good on the charset converter side!


> > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> > 980 {
> > 981   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> > 982   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> > 983   Uninterrupt();
> > 984   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
> > 985   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
> > 986     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
> > 987   }
> > 988 }
> > 
> > Is there a reason to uninterrupt from the main thread? Why not do it from the
> > parser thread as to avoid blocking the main thread.
> 
> When calling Uninterrupt(), the main thread already owns mTokenizerMutex and
> Uninterrupt() doesn't acquire an additional mutex, so no additional blocking
> occurs.

This doesn't appear to be the case. You're here explicitly grabbing the lock in order to be able to safely call Uninterrupt. If you instead do that while on the parser thread, there is no need for the main thread to block.

> > nsHtml5StreamParser::ContinueAfterScripts
> > 
> > 864     if (mSpeculations.IsEmpty()) {
> > 865       // Not quite sure how exactly this happens...
> > 866       // Maybe an artifact of defer scripts?
> > 867       return;
> > 868     }
> > 
> > Sounds like an assertion would be good.
> 
> I put an NS_WARNING here.

Why not NS_ASSERTION?

> (In reply to comment #18)
> > You're also using the charset detector off the main thread, which I suspect
> > isn't safe. In fact, since you are creating it on the main thread, and then
> > calling it from the parser thread, I'm surprised this doesn't assert as it
> > doesn't seem to be implemented using threadsafe refcounting macros.
> >
> > This is also something we need to fix before we can turn on by default. Do you
> > want me to file a separate bug on this charset threadsafety stuff?
> 
> Yes, a separate bug would be good to have if there really is a thread-safety
> problem left. As far as I can tell, there is no problem: After bug , the
> detector is refcounted only from the main thread. The sniffing is run only from
> the parser thread. The sniffing seems to affect only the internal state of the
> object. The data tables shared by the instances are baked into code ahead of
> compile time.

I'll file a separate bug. If we're lucky (which it sounds like based on your description) all we need to do is to move the creation of the detector to the thread where its used.


> > 476       if (totalByteCount < aCount) { // mimicking nsScanner even though
> > this seems wrong
> > 477         ++totalByteCount;
> > 478         ++aFromSegment;
> > 
> > Could you assert here? I agree this seems wrong.
> 
> I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid
> a buffer overrun if a decoder misbehaves.
> 
> FWIW, the docs at
> http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123
> are not good enough.

Would be great if you could fix them.

> > If so, would calling it something like MetaCharsetFound make sense?
> 
> "Meta charset" is not HTML5 spec terminology. "[character] encoding
> declaration" is spec terminology. That's why the method is named the way it is
> named. In general, I have tried to stick to spec terminology when naming things
> even when the spec terminology deviates from colloquial usage. (Consider
> "entities" vs. "named character refernces".)

How about EncodingDeclarationFound? Function names that are simple nouns sound like getters to me.


> > It seems to me much better to have the timer running on the parser thread first
> > of all.
> 
> I put the timer on the main thread, because I couldn't find a timer
> implementation that'd fire on a non-main thread. Also, initially I wanted also
> to check if the document is in the frontmost tab when the timer fires, but I
> gave up on that idea.

Timers by default fire on the thread that created them. So if you simply use the parser thread to create the timer, then the timer will fire using the parser thread. So you use the same API on any thread, as you are on the main thread.

Worst case you can simply set the 'target' property on the nsITimer if for some reason you really can't create the timer using the parser thread.

> > Also, to save batteries, only start the timer if we have pending things
> > that need flushing. I.e. when we reach the end of a network buffer without
> > getting blocked by a script 
> 
> For practical purposes, this is approximated by sTimerStartDelay being longer
> than the subsequent delays.

I don't see how this approximates what we want. The point is that it's unnecessary to have the timer repeatedly fire if no network data is coming in. Once the timer has fired and we flush, there is no reason to restart the timer until we again have pending unflushed tree ops.
Tokenizer comments

nsHtml5Tokenizer::setContentModelFlag

111 nsHtml5Tokenizer::setContentModelFlag(PRInt32 contentModelFlag, nsIAtom* contentModelElement)

Please use an enum for the 'contentModelFlag' argument, that way you can also scope it to the tokenizer and given it shorter names. Also, is "tokenizerState" or "stateFlag" a better name? I take it this what the spec refers to as the 'state' of the tokenizer?


nsHtml5Tokenizer::contentModelElementToArray

175     default: {
176 
177       return;
178     }

Shouldn't you clear contentModelElementNameAsArray here? Or if it should never reach the default, please assert.


193 
194 void 
195 nsHtml5Tokenizer::clearStrBufAndAppendCurrentC(PRUnichar c)
196 {
197   strBuf[0] = c;
198   strBufLen = 1;
199 }
200 
201 void 
202 nsHtml5Tokenizer::clearStrBufAndAppendForceWrite(PRUnichar c)
203 {
204   strBuf[0] = c;
205   strBufLen = 1;
206 }

Why two functions tht do the same thing?

208 void 
209 nsHtml5Tokenizer::clearStrBufForNextState()
210 {
211   strBufLen = 0;
212 }

Why the 'ForNextState' part?


nsHtml5Tokenizer::emitStrBuf
241   if (strBufLen > 0) {
242     tokenHandler->characters(strBuf, 0, strBufLen);
243   }

Is the tokenHandler ever anything other than the treebuilder? If not, we should rename the member to 'treeBuilder'.


246 void 
247 nsHtml5Tokenizer::clearLongStrBufForNextState()
248 {
249   longStrBufLen = 0;
250 }
251 
252 void 
253 nsHtml5Tokenizer::clearLongStrBuf()
254 {
255   longStrBufLen = 0;
256 }
257 
258 void 
259 nsHtml5Tokenizer::clearLongStrBufAndAppendCurrentC(PRUnichar c)
260 {
261   longStrBuf[0] = c;
262   longStrBufLen = 1;
263 }
264 
265 void 
266 nsHtml5Tokenizer::clearLongStrBufAndAppendToComment(PRUnichar c)
267 {
268   longStrBuf[0] = c;
269   longStrBufLen = 1;
270 }

Lots of duplication here, is there really a need for that?

strBuf and longStrBuf is somewhat poor naming. It's unclear when to use which. Maybe identifierBuf and charBuf is better? Though why do we need both? Is the idea that we'll pass out the long buffer instead of copying it out? That doesn't appear to be happening currently.


285 nsHtml5Tokenizer::appendSecondHyphenToBogusComment()
286 {
287   appendLongStrBuf('-');
288 }
289 
290 void 
291 nsHtml5Tokenizer::adjustDoubleHyphenAndAppendToLongStrBufAndErr(PRUnichar c)
292 {
293 
294   appendLongStrBuf(c);
295 }

These functions seem excessive.

nsHtml5Tokenizer::flushChars

342   cstart = 0x7fffffff;

PR_INT32_MAX?


nsHtml5Tokenizer::emitCurrentTagToken

363   nsHtml5HtmlAttributes* attrs = (!attributes ? nsHtml5HtmlAttributes::EMPTY_ATTRIBUTES : attributes);
364   if (endTag) {
365 
366     tokenHandler->endTag(tagName);
367     delete attributes;
368   } else {
369     tokenHandler->startTag(tagName, attrs, selfClosing);
370   }

Move 'attrs' into the 'else'. You could possibly even get rid of the temporary.

362   stateSave = NS_HTML5TOKENIZER_DATA;
...
374   return stateSave;

It seems like this function always returns the same value. If so, just get rid of the return value and use the value inline where needed. If not, add a comment explaining what happens.


391 void 
392 nsHtml5Tokenizer::addAttributeWithoutValue()
393 {
394 
395   if (!!attributeName) {
396     attributes->addAttribute(attributeName, nsHtml5Portability::newEmptyString());
397     attributeName = nsnull;
398   }
399 }
400 
401 void 
402 nsHtml5Tokenizer::addAttributeWithValue()
403 {
404   if (!!attributeName) {
405     nsString* val = longStrBufToString();
406     attributes->addAttribute(attributeName, val);
407     attributeName = nsnull;
408   }
409 }

No need for the '!!'. Also, can these functions really be called without an attribute name? If that shouldn't happen, just assert and assume it's there.

411 void 
412 nsHtml5Tokenizer::startErrorReporting()
413 {
414 }

Remove this function?


nsHtml5Tokenizer::tokenizeBuffer

460   if (pos == buffer->getEnd()) {
461     buffer->setStart(pos);
462   } else {
463     buffer->setStart(pos + 1);
464   }

Why is this needed? Shouldn't we treat buffers as empty if the start == end?


3427 void 
3428 nsHtml5Tokenizer::rememberAmpersandLocation(PRUnichar add)
3429 {
3430   additional = add;
3431 }

This doesn't seem to remember a location at all, but rather a character. I guess this makes more sense in the validator code :(

3447 void 
3448 nsHtml5Tokenizer::emitOrAppendStrBuf(PRInt32 returnState)
3449 {
3450   if ((returnState & (~1))) {
3451     appendStrBufToLongStrBuf();
3452   } else {
3453     emitStrBuf();
3454   }
3455 }

Please create a #define for the '~1' and put it together with the definitions for the values that can be passed in here. That should reduce the risk that someone changes the flag values without updating this value.


3867 void 
3868 nsHtml5Tokenizer::requestSuspension()
3869 {
3870   shouldSuspend = PR_TRUE;
3871 }
3872 
3873 void 
3874 nsHtml5Tokenizer::becomeConfident()
3875 {
3876   confident = PR_TRUE;
3877 }
3878 
3879 PRBool 
3880 nsHtml5Tokenizer::isNextCharOnNewLine()
3881 {
3882   return PR_FALSE;
3883 }
3884 
3885 PRBool 
3886 nsHtml5Tokenizer::isPrevCR()
3887 {
3888   return lastCR;
3889 }
3890 
3891 PRInt32 
3892 nsHtml5Tokenizer::getLine()
3893 {
3894   return -1;
3895 }
3896 
3897 PRInt32 
3898 nsHtml5Tokenizer::getCol()
3899 {
3900   return -1;
3901 }

These all seem worthy of removal. More effects of validator code I guess :(
nsHtml5TreeBuilder.cpp

The big thing here is that I'm worried that all the switch statements with loads of fallthroughs and gotos are going to make it incredibly hard to work with this code in the future. Is there any evidence that the fallthroughs are saving any cycles? If not I think I'd prefer to keep the code sane so that we can modify it in the future when needed.

The gotos are harder due to the deep nesting of switch statements and loops making 'break' largely useless.

I think there is a general over use of switch statements though. In many cases switch statements could be converted to normal if-else statements. In some cases the 'else' part isn't even needed. I've commented on a few specific instances below, but it's a more general problem.

I do like the function names in this class though, in general much easier to understand what's going on without reading all the callers and/or function implementations.

nsHtml5TreeBuilder::doctype

124   needToDropLF = PR_FALSE;
125   for (; ; ) {
126     switch(foreignFlag) {
127       case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
128         goto doctypeloop_end;
129       }
130       default: {
131         switch(mode) {
132           case NS_HTML5TREE_BUILDER_INITIAL: {
133             nsString* emptyString = nsHtml5Portability::newEmptyString();
134             appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring : name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ? emptyString : systemIdentifier);
135             nsHtml5Portability::releaseString(emptyString);
136             if (isQuirky(name, publicIdentifier, systemIdentifier, forceQuirks)) {
137 
138               documentModeInternal(QUIRKS_MODE, publicIdentifier, systemIdentifier, PR_FALSE);
139             } else if (isAlmostStandards(publicIdentifier, systemIdentifier)) {
140 
141               documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier, systemIdentifier, PR_FALSE);
142             } else {
143               documentModeInternal(STANDARDS_MODE, publicIdentifier, systemIdentifier, PR_FALSE);
144             }
145             mode = NS_HTML5TREE_BUILDER_BEFORE_HTML;
146             return;
147           }
148           default: {
149             goto doctypeloop_end;
150           }
151         }
152       }
153     }
154   }
155   doctypeloop_end: ;
156 
157   return;

What is the purpose of the loop here? It doesn't seem like we actually ever loop. Also, why the 'goto's rather than simply returning?


nsHtml5TreeBuilder::comment

164   for (; ; ) {
165     switch(foreignFlag) {
166       case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
167         goto commentloop_end;
168       }
169       default: {
170         switch(mode) {
171           case NS_HTML5TREE_BUILDER_INITIAL:
172           case NS_HTML5TREE_BUILDER_BEFORE_HTML:
173           case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY:
174           case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: {
175             appendCommentToDocument(buf, start, length);
176             return;
177           }
178           case NS_HTML5TREE_BUILDER_AFTER_BODY: {
179             flushCharacters();
180             appendComment(stack[0]->node, buf, start, length);
181             return;
182           }
183           default: {
184             goto commentloop_end;
185           }
186         }
187       }
188     }
189   }
190   commentloop_end: ;

Same here. It seems much cleaner written as

   if (foreignFlag != NS_HTML5TREE_BUILDER_IN_FOREIGN) {
     switch(mode) {
       case NS_HTML5TREE_BUILDER_INITIAL:
       case NS_HTML5TREE_BUILDER_BEFORE_HTML:
       case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY:
       case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: {
         appendCommentToDocument(buf, start, length);
         return;
       }
       case NS_HTML5TREE_BUILDER_AFTER_BODY: {
         flushCharacters();
         appendComment(stack[0]->node, buf, start, length);
         return;
       }
       default: {
       }
     }
   }


nsHtml5TreeBuilder::eof

436   switch(foreignFlag) {
437     case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
438 
439       while (stack[currentPtr]->ns != kNameSpaceID_XHTML) {
440         popOnEof();
441       }
442       foreignFlag = NS_HTML5TREE_BUILDER_NOT_IN_FOREIGN;
443     }
444     default:
445       ; // fall through
446   }

Doesn't seem to be any point in using a switch here. An if-statement would be much more readable.

464       case NS_HTML5TREE_BUILDER_IN_HEAD: {
465         if (currentPtr > 1) {
466 
467         }
468         while (currentPtr > 0) {
469           popOnEof();
470         }

Remove the if-statement.

nsHtml5TreeBuilder::startTag
764                   if (!isCurrent(nsHtml5Atoms::table)) {
765 
766                   }

Same here

831                 if (currentPtr != eltPos) {
832 
833                 }

There seems to be many more of these, I haven't commented on any more of them.



This finishes the review of the bulk of the code. There are a few specific items I've got left to review, such as the flushing policy and notification batches.

There's also a bunch more places that need more comments, even in the java code. In general every class function and member needs a description IMHO, though in some cases groups of functions/members can be described together. This is especially true in the tokenizer and tree builder which by nature implements very complicated algorithms, which makes it very hard to figure out what a variable is used for by looking at the code.

We also need the comments in the java code to be carried over to the Cpp code, though I think there's a separate bug on that.

I haven't found any real blockers for turning the parser on by default though, other than the ones I've explicitly noted already.
(In reply to comment #20)
> > Would be nice if getListLength had something about formatting elements in its
> > name (unless you remove it entirely, see previous point).
> 
> I'll address this in a follow-up patch.

Addressed by this patch.
Attachment #438967 - Flags: review?(jonas)
(In reply to comment #22)
> What we've done in the past is to create static pointers to pointers to atoms.
> I.e. things like &nsGkAtoms::foo. Dunno if that will work here. In any case,
> I'll file a separate bug on this stuff.

In that case, the type of the field would be nsIAtom** instead of nsIAtom*, which would mess up types in the code produced by the translator. Worse, simple pointer compares would be lost and be turned into a pointer dereference plus compare.

> > > nsHtml5StreamParser:
> > > 197 nsHtml5StreamParser::~nsHtml5StreamParser()
> > > 198 {
> > > 199   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> > > 200   mTokenizer->end();
> > > 201   mRequest = nsnull;
> > > 202   mObserver = nsnull;
> > > 203   mUnicodeDecoder = nsnull;
> > > 204   mSniffingBuffer = nsnull;
> > > 205   mMetaScanner = nsnull;
> > > 206   mFirstBuffer = nsnull;
> > > 207   mExecutor = nsnull;
> > > 208   mTreeBuilder = nsnull;
> > > 209   mTokenizer = nsnull;
> > > 210   mOwner = nsnull;
> > > 211   if (mFlushTimer) {
> > > 212     mFlushTimer->Cancel();
> > > 213     mFlushTimer = nsnull;
> > > 214   }
> > > 215 }
> > > 
> > > Is there a reason to null all these things out?
> > 
> > Mainly to distinguish an already-deleted object more easily in a debugger.
> > Shall I make these #ifdef DEBUG?
> 
> Please do.

Done.

> > > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> > > 980 {
> > > 981   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> > > 982   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> > > 983   Uninterrupt();
> > > 984   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
> > > 985   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
> > > 986     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
> > > 987   }
> > > 988 }
> > > 
> > > Is there a reason to uninterrupt from the main thread? Why not do it from the
> > > parser thread as to avoid blocking the main thread.
> > 
> > When calling Uninterrupt(), the main thread already owns mTokenizerMutex and
> > Uninterrupt() doesn't acquire an additional mutex, so no additional blocking
> > occurs.
> 
> This doesn't appear to be the case. You're here explicitly grabbing the lock in
> order to be able to safely call Uninterrupt. If you instead do that while on
> the parser thread, there is no need for the main thread to block.

Oh, right. In this case, that's true.

I changed it the way you suggested. However, I'm nervous that this optimization for a rare case (the case where the shell is unable to fulfill a request to switch charsets fails) comes back to bite us later, because previously Interrupt() and Uninterrupt() were both called from the same thread so Uninterrupt() was guaranteed to cancel one Interrupt(). Now, Interrupt() is, as far as I can tell, called only in places that can't start rerunning until the parser thread has uninterrupted itself and posted more data to the main thread. However, if we later change how Interrupt() is called, it would be easy to introduce a situation where the main thread first posts a runnable to the parser thread, then interrupts and then the runnable cancels that new interrupt when it gets a chance to run on the parser thread.

> > > nsHtml5StreamParser::ContinueAfterScripts
> > > 
> > > 864     if (mSpeculations.IsEmpty()) {
> > > 865       // Not quite sure how exactly this happens...
> > > 866       // Maybe an artifact of defer scripts?
> > > 867       return;
> > > 868     }
> > > 
> > > Sounds like an assertion would be good.
> > 
> > I put an NS_WARNING here.
> 
> Why not NS_ASSERTION?

Because I'm not sure if there's a legitimate situation where that condition holds true. (I realize that not being sure about that isn't exactly good.)

> > (In reply to comment #18)
> > > You're also using the charset detector off the main thread, which I suspect
> > > isn't safe. In fact, since you are creating it on the main thread, and then
> > > calling it from the parser thread, I'm surprised this doesn't assert as it
> > > doesn't seem to be implemented using threadsafe refcounting macros.
> > >
> > > This is also something we need to fix before we can turn on by default. Do you
> > > want me to file a separate bug on this charset threadsafety stuff?
> > 
> > Yes, a separate bug would be good to have if there really is a thread-safety
> > problem left. As far as I can tell, there is no problem: After bug , the
> > detector is refcounted only from the main thread. The sniffing is run only from
> > the parser thread. The sniffing seems to affect only the internal state of the
> > object. The data tables shared by the instances are baked into code ahead of
> > compile time.
> 
> I'll file a separate bug. If we're lucky (which it sounds like based on your
> description) all we need to do is to move the creation of the detector to the
> thread where its used.

What would be the benefit except being able to avoid the detector creation in some cases and being able to delete it earlier in others? It seems to me that it's not a thread/ problem that it's allocated on one thread, used on another and destroyed on the creator thread but only after the other thread isn't going to use the object anyway. Moving the destruction of the detector off-the-main-thread would be annoying in the nsIParser::Terminate and cycle collector cases. Wouldn't those then require creating a runnable solely to clean up the detector?

> > > 476       if (totalByteCount < aCount) { // mimicking nsScanner even though
> > > this seems wrong
> > > 477         ++totalByteCount;
> > > 478         ++aFromSegment;
> > > 
> > > Could you assert here? I agree this seems wrong.
> > 
> > I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid
> > a buffer overrun if a decoder misbehaves.
> > 
> > FWIW, the docs at
> > http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123
> > are not good enough.
> 
> Would be great if you could fix them.

I tried to fix the docs. I hope I wrote the right things!

> > > If so, would calling it something like MetaCharsetFound make sense?
> > 
> > "Meta charset" is not HTML5 spec terminology. "[character] encoding
> > declaration" is spec terminology. That's why the method is named the way it is
> > named. In general, I have tried to stick to spec terminology when naming things
> > even when the spec terminology deviates from colloquial usage. (Consider
> > "entities" vs. "named character refernces".)
> 
> How about EncodingDeclarationFound? Function names that are simple nouns sound
> like getters to me.

The other callback names, for better or worse, use SAX-style naming and SAX uses spec-wise correct nouns, so having this one not be a noun would be inconsistent with the naming of the rest of the callbacks.

> > I put the timer on the main thread, because I couldn't find a timer
> > implementation that'd fire on a non-main thread. Also, initially I wanted also
> > to check if the document is in the frontmost tab when the timer fires, but I
> > gave up on that idea.
> 
> Timers by default fire on the thread that created them. So if you simply use
> the parser thread to create the timer, then the timer will fire using the
> parser thread. So you use the same API on any thread, as you are on the main
> thread.
> 
> Worst case you can simply set the 'target' property on the nsITimer if for some
> reason you really can't create the timer using the parser thread.

I'm amazed that I've managed to overlook the target property. :-(

I moved the timer to the parser thread.

I have a question about the thread-safety of nsITimer::Cancel on multicores, though: 
nsTimerImpl::Cancel assigns to mCanceled without acquiring a mutex and nsTimerImpl::Fire() reads mCanceled without acquiring a mutex. However, before nsTimerImpl::Cancel returns, it calls TimerThread::RemoveTimer which acquires a mutex. Is that mutex acquisition enough to invalidate caches on other cores so that when nsTimerImpl::Cancel returns on one thread, another thread sees mCanceled as true in nsTimerImpl::Fire()?

Or is nsTimerImpl even supposed to be thread-safe in the sense that you are allowed to Cancel from non-target thread? If Cancel isn't actually thread-safe, I need to do something more complicated with a runnable for canceling the timer when the parser is released early due to nsIParser::Terminate and the timer is armed.

(In reply to comment #23)
> Tokenizer comments
> 
> nsHtml5Tokenizer::setContentModelFlag
> 
> 111 nsHtml5Tokenizer::setContentModelFlag(PRInt32 contentModelFlag, nsIAtom*
> contentModelElement)
> 
> Please use an enum for the 'contentModelFlag' argument, that way you can also
> scope it to the tokenizer and given it shorter names. 

I realize that it would be better code style to use enums both in Java and in C++. I actually used to have an enum there before the C++ porting effort even started. However, the switching on the tokenizer state is very performance-critical to the parser and the performance characteristics of enums in Java are surprisingly bad (well, at least of the PowerPC port of the HotSpot client VM that I benchmarked on back then).
http://krijnhoetmer.nl/irc-logs/whatwg/20080427#l-256

So for perf reasons, I don't want to go back to enums on the Java side. As for the C++ side, I've viewed the generated code as a way of talking to the C++ compiler--not as much a way of expressing things to human readers--and when talking to the compiler, it shouldn't matter either way. Therefore, in the interest of not pouring effort excessively into the translator, I haven't tried to make the translator turn groups of ints into enums in the translation.

> Also, is "tokenizerState"
> or "stateFlag" a better name? I take it this what the spec refers to as the
> 'state' of the tokenizer?

When I wrote that piece of code, the spec had a concept of "content model flag". The concept has since been removed from the spec.

Renamed various things in this area.

> nsHtml5Tokenizer::contentModelElementToArray
> 
> 175     default: {
> 176 
> 177       return;
> 178     }
> 
> Shouldn't you clear contentModelElementNameAsArray here? Or if it should never
> reach the default, please assert.

The Java code already asserts there. (That's why you see the empty line there.) Carrying the assertions over to C++ is bug 503190. I added a message to the assertion, though.

> 193 
> 194 void 
> 195 nsHtml5Tokenizer::clearStrBufAndAppendCurrentC(PRUnichar c)
> 196 {
> 197   strBuf[0] = c;
> 198   strBufLen = 1;
> 199 }
> 200 
> 201 void 
> 202 nsHtml5Tokenizer::clearStrBufAndAppendForceWrite(PRUnichar c)
> 203 {
> 204   strBuf[0] = c;
> 205   strBufLen = 1;
> 206 }
> 
> Why two functions tht do the same thing?

These are artifacts of an earlier attempt to boost performance by using the input buffer as the backing buffer for things that now go into strBuf and copying data into strBuf lazily if an input buffer boundary fell inside a name. It turned out that checking if it's necessary to copy data was slower than just unconditionally copying it, so the overall design didn't stay. However, I didn't clean up the redundant methods immediately in order to make it easier to experiment more later.

I guess the experimentation isn't ever going to happen. Flattened these into one.

> 208 void 
> 209 nsHtml5Tokenizer::clearStrBufForNextState()
> 210 {
> 211   strBufLen = 0;
> 212 }
> 
> Why the 'ForNextState' part?

If the buffer were used in the current state, the variant that puts a single PRUnichar into the buffer would be appropriate. Renamed anyway. Also inlined methods like this.

> nsHtml5Tokenizer::emitStrBuf
> 241   if (strBufLen > 0) {
> 242     tokenHandler->characters(strBuf, 0, strBufLen);
> 243   }
> 
> Is the tokenHandler ever anything other than the treebuilder? If not, we should
> rename the member to 'treeBuilder'.

In the Java test harness for the tokenizer, it's not the tree builder. I expect it not to be always be the tree builder even in C++ once I've fixed bug 482909. When the Java version has interfaces but the C++ doesn't, I've made the translator flatten the types but not rename the members.

> 246 void 
> 247 nsHtml5Tokenizer::clearLongStrBufForNextState()
> 248 {
> 249   longStrBufLen = 0;
> 250 }
> 251 
> 252 void 
> 253 nsHtml5Tokenizer::clearLongStrBuf()
> 254 {
> 255   longStrBufLen = 0;
> 256 }
> 257 
> 258 void 
> 259 nsHtml5Tokenizer::clearLongStrBufAndAppendCurrentC(PRUnichar c)
> 260 {
> 261   longStrBuf[0] = c;
> 262   longStrBufLen = 1;
> 263 }
> 264 
> 265 void 
> 266 nsHtml5Tokenizer::clearLongStrBufAndAppendToComment(PRUnichar c)
> 267 {
> 268   longStrBuf[0] = c;
> 269   longStrBufLen = 1;
> 270 }
> 
> Lots of duplication here, is there really a need for that?

Not any longer as mentioned above. Removed redundancy and marked as inline.

> strBuf and longStrBuf is somewhat poor naming. It's unclear when to use which.
> Maybe identifierBuf and charBuf is better? Though why do we need both? Is the
> idea that we'll pass out the long buffer instead of copying it out? That
> doesn't appear to be happening currently.

We don't need two buffers that are capable of growing without bound. (A second buffer whose length is capped at the length of the longest named name (plus 1?) would be practical to have around, though, now that the named character name tables no longer contain the full names.)

I thought I had filed a bug about this long ago, but now I can't find it, so I filed bug 559303.

> 285 nsHtml5Tokenizer::appendSecondHyphenToBogusComment()
> 286 {
> 287   appendLongStrBuf('-');
> 288 }
> 289 
> 290 void 
> 291 nsHtml5Tokenizer::adjustDoubleHyphenAndAppendToLongStrBufAndErr(PRUnichar
> c)
> 292 {
> 293 
> 294   appendLongStrBuf(c);
> 295 }
> 
> These functions seem excessive.

These methods are part of the XML Infoset coercion feature of the Java version of the parser. The latter is also part of error reporting. I marked them inline in C++ to make their invocation overhead compile away for sure.

> nsHtml5Tokenizer::flushChars
> 
> 342   cstart = 0x7fffffff;
> 
> PR_INT32_MAX?

Curious. I though I was already using that there. I had translator support and everything already in place.

Fixed.

> nsHtml5Tokenizer::emitCurrentTagToken
> 
> 363   nsHtml5HtmlAttributes* attrs = (!attributes ?
> nsHtml5HtmlAttributes::EMPTY_ATTRIBUTES : attributes);
> 364   if (endTag) {
> 365 
> 366     tokenHandler->endTag(tagName);
> 367     delete attributes;
> 368   } else {
> 369     tokenHandler->startTag(tagName, attrs, selfClosing);
> 370   }
> 
> Move 'attrs' into the 'else'. You could possibly even get rid of the temporary.

'attrs' is used on the blank line of the 'if' branch but only when error reporting is enabled.

> 362   stateSave = NS_HTML5TOKENIZER_DATA;
> ...
> 374   return stateSave;
> 
> It seems like this function always returns the same value. If so, just get rid
> of the return value and use the value inline where needed. If not, add a
> comment explaining what happens.

The method calls into the token handler, and the token handler can call back into the tokenizer and change stateSave. Added comment (though it doesn't show up in C++ because comments aren't copied over yet).

> 391 void 
> 392 nsHtml5Tokenizer::addAttributeWithoutValue()
> 393 {
> 394 
> 395   if (!!attributeName) {
> 396     attributes->addAttribute(attributeName,
> nsHtml5Portability::newEmptyString());
> 397     attributeName = nsnull;
> 398   }
> 399 }
> 400 
> 401 void 
> 402 nsHtml5Tokenizer::addAttributeWithValue()
> 403 {
> 404   if (!!attributeName) {
> 405     nsString* val = longStrBufToString();
> 406     attributes->addAttribute(attributeName, val);
> 407     attributeName = nsnull;
> 408   }
> 409 }
> 
> No need for the '!!'. 

See comment 20.

> Also, can these functions really be called without an
> attribute name? If that shouldn't happen, just assert and assume it's there.

They can in the end tag case.

> 411 void 
> 412 nsHtml5Tokenizer::startErrorReporting()
> 413 {
> 414 }
> 
> Remove this function?

Removed.

> nsHtml5Tokenizer::tokenizeBuffer
> 
> 460   if (pos == buffer->getEnd()) {
> 461     buffer->setStart(pos);
> 462   } else {
> 463     buffer->setStart(pos + 1);
> 464   }
> 
> Why is this needed? Shouldn't we treat buffers as empty if the start == end?

The first is about getting ready to return because the tokenizer loop hit the end of the buffer. The else branch is about two things: Either the parser is suspending after a </script> or a charset <meta> and pos is the index of '>' or the tokenizer is burping at a CR and pos is the index of the CR. In either of those cases, when the buffer is examined the next time, the position should point to the character after the '>' or the CR (which may actually mean that the buffer gets marked as empty if pos pointed to the last character).

> 3427 void 
> 3428 nsHtml5Tokenizer::rememberAmpersandLocation(PRUnichar add)
> 3429 {
> 3430   additional = add;
> 3431 }
> 
> This doesn't seem to remember a location at all, but rather a character. I
> guess this makes more sense in the validator code :(

I remembers the ampersand location in the Java code. The argument 'add' was introduced to align with a later spec change and I failed to make the name of the method more elaborate at that point.

Renamed.

> 3447 void 
> 3448 nsHtml5Tokenizer::emitOrAppendStrBuf(PRInt32 returnState)
> 3449 {
> 3450   if ((returnState & (~1))) {
> 3451     appendStrBufToLongStrBuf();
> 3452   } else {
> 3453     emitStrBuf();
> 3454   }
> 3455 }
> 
> Please create a #define for the '~1' and put it together with the definitions
> for the values that can be passed in here. That should reduce the risk that
> someone changes the flag values without updating this value.

Fixed.

> 3867 void 
> 3868 nsHtml5Tokenizer::requestSuspension()
> 3869 {
> 3870   shouldSuspend = PR_TRUE;
> 3871 }
> 3872 
> 3873 void 
> 3874 nsHtml5Tokenizer::becomeConfident()
> 3875 {
> 3876   confident = PR_TRUE;
> 3877 }
> 3878 
> 3879 PRBool 
> 3880 nsHtml5Tokenizer::isNextCharOnNewLine()
> 3881 {
> 3882   return PR_FALSE;
> 3883 }
> 3884 
> 3885 PRBool 
> 3886 nsHtml5Tokenizer::isPrevCR()
> 3887 {
> 3888   return lastCR;
> 3889 }
> 3890 
> 3891 PRInt32 
> 3892 nsHtml5Tokenizer::getLine()
> 3893 {
> 3894   return -1;
> 3895 }
> 3896 
> 3897 PRInt32 
> 3898 nsHtml5Tokenizer::getCol()
> 3899 {
> 3900   return -1;
> 3901 }
> 
> These all seem worthy of removal. More effects of validator code I guess :(

requestSuspension() is used in the </script> and charset <meta> cases in Gecko.

The others are Java-only, so I removed them from C++.

(In reply to comment #24)
> nsHtml5TreeBuilder.cpp
> 
> The big thing here is that I'm worried that all the switch statements with
> loads of fallthroughs and gotos are going to make it incredibly hard to work
> with this code in the future. Is there any evidence that the fallthroughs are
> saving any cycles?

In the tokenizer the fallthroughs are an attempt at saving cycles in the Java case (saving cycles in the C++ case is bug 482920). 

In the tree builder the fallthroughs are there mainly to avoid code duplication while keeping the structure of the spec recognizable in the code structure. The spec has a lot of fall throughs to other modes, so they are implemented as fall throughs in the code. Given the structure of the spec, I think the fallthroughs actually make things maintainable in the first place if switch is used at all. 

Whether it's a good idea to use switch instead of having a class per mode inheriting from a pure-virtual class is another question, but I'd be very, very reluctant to change the overall structure of the tree builder that radically at this point in time. I do realize that fixing bug 552908 in the current architecture is going to lead to ugly code, but that spec feature didn't exist when I designed the architecture. Finding out how a bunch of virtual calls would compare to huge switches from the perf POV would involve refactoring the whole thing to use the other way and then benchmarking.

As for the gotos, they are all breaks and continues in the Java code. They aren't used in random ways--they are used in 3 conceptually simple ways, so they should be fine for maintenance. Unfortunately, C++ doesn't have the feature of break and continue by label, so I had to make the translator emulate those with goto. 

The three concepts are:
 1) Done processing this token: break starttagloop;
 2) Reprocess the token: continue starttagloop;
 3) Process the token according to the rules of another mode (later in the order of modes): break statenameloop;

> If not I think I'd prefer to keep the code sane so that we
> can modify it in the future when needed.

I believe the code is quite modifiable by modifying the Java source and rerunning the translator.

> The gotos are harder due to the deep nesting of switch statements and loops
> making 'break' largely useless.

I think the inability to break the a from a switch inside the loop is a defect in C++. Anyway, the place to make modifications is the TreeBuilder.java and nsHtml5TreeBuilder.cpp is just an intermediate file for feeding it into the C++ compiler.

> I think there is a general over use of switch statements though. In many cases
> switch statements could be converted to normal if-else statements. In some
> cases the 'else' part isn't even needed. I've commented on a few specific
> instances below, but it's a more general problem.

For consistency, when certain variables are switched on in the startTag method, I've also switched on them when handling other tokens. Maybe that wasn't a great idea in retrospect. It does make the different token handler methods more consistent among themselves, though.

> nsHtml5TreeBuilder::doctype
> 
> 124   needToDropLF = PR_FALSE;
> 125   for (; ; ) {
> 126     switch(foreignFlag) {
> 127       case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
> 128         goto doctypeloop_end;
> 129       }
> 130       default: {
> 131         switch(mode) {
> 132           case NS_HTML5TREE_BUILDER_INITIAL: {
> 133             nsString* emptyString = nsHtml5Portability::newEmptyString();
> 134             appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring :
> name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ?
> emptyString : systemIdentifier);
> 135             nsHtml5Portability::releaseString(emptyString);
> 136             if (isQuirky(name, publicIdentifier, systemIdentifier,
> forceQuirks)) {
> 137 
> 138               documentModeInternal(QUIRKS_MODE, publicIdentifier,
> systemIdentifier, PR_FALSE);
> 139             } else if (isAlmostStandards(publicIdentifier,
> systemIdentifier)) {
> 140 
> 141               documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier,
> systemIdentifier, PR_FALSE);
> 142             } else {
> 143               documentModeInternal(STANDARDS_MODE, publicIdentifier,
> systemIdentifier, PR_FALSE);
> 144             }
> 145             mode = NS_HTML5TREE_BUILDER_BEFORE_HTML;
> 146             return;
> 147           }
> 148           default: {
> 149             goto doctypeloop_end;
> 150           }
> 151         }
> 152       }
> 153     }
> 154   }
> 155   doctypeloop_end: ;
> 156 
> 157   return;
> 
> What is the purpose of the loop here? It doesn't seem like we actually ever
> loop. 

The purpose is to be able to be break of it to jump forward.

> Also, why the 'goto's rather than simply returning?

In the error reporting cases, there's the line
err("Stray doctype.");
before the return where you see a blank line in C++. To use return directly in all the cases that now break out of the loop, the error emission line would have to be duplicated everywhere.

> nsHtml5TreeBuilder::comment
> 
> 164   for (; ; ) {
> 165     switch(foreignFlag) {
> 166       case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
> 167         goto commentloop_end;
> 168       }
> 169       default: {
> 170         switch(mode) {
> 171           case NS_HTML5TREE_BUILDER_INITIAL:
> 172           case NS_HTML5TREE_BUILDER_BEFORE_HTML:
> 173           case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY:
> 174           case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: {
> 175             appendCommentToDocument(buf, start, length);
> 176             return;
> 177           }
> 178           case NS_HTML5TREE_BUILDER_AFTER_BODY: {
> 179             flushCharacters();
> 180             appendComment(stack[0]->node, buf, start, length);
> 181             return;
> 182           }
> 183           default: {
> 184             goto commentloop_end;
> 185           }
> 186         }
> 187       }
> 188     }
> 189   }
> 190   commentloop_end: ;
> 
> Same here. It seems much cleaner written as
> 
>    if (foreignFlag != NS_HTML5TREE_BUILDER_IN_FOREIGN) {
>      switch(mode) {
>        case NS_HTML5TREE_BUILDER_INITIAL:
>        case NS_HTML5TREE_BUILDER_BEFORE_HTML:
>        case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY:
>        case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: {
>          appendCommentToDocument(buf, start, length);
>          return;
>        }
>        case NS_HTML5TREE_BUILDER_AFTER_BODY: {
>          flushCharacters();
>          appendComment(stack[0]->node, buf, start, length);
>          return;
>        }
>        default: {
>        }
>      }
>    }

Yeah, that switch on foreignFlag is mainly about being pointlessly consistent with startTag.

Fixed.

> nsHtml5TreeBuilder::eof
> 
> 436   switch(foreignFlag) {
> 437     case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
> 438 
> 439       while (stack[currentPtr]->ns != kNameSpaceID_XHTML) {
> 440         popOnEof();
> 441       }
> 442       foreignFlag = NS_HTML5TREE_BUILDER_NOT_IN_FOREIGN;
> 443     }
> 444     default:
> 445       ; // fall through
> 446   }
> 
> Doesn't seem to be any point in using a switch here. An if-statement would be
> much more readable.

Same here. Fixed.

> 464       case NS_HTML5TREE_BUILDER_IN_HEAD: {
> 465         if (currentPtr > 1) {
> 466 
> 467         }
> 468         while (currentPtr > 0) {
> 469           popOnEof();
> 470         }
> 
> Remove the if-statement.
> 
> nsHtml5TreeBuilder::startTag
> 764                   if (!isCurrent(nsHtml5Atoms::table)) {
> 765 
> 766                   }
> 
> Same here
> 
> 831                 if (currentPtr != eltPos) {
> 832 
> 833                 }
> 
> There seems to be many more of these, I haven't commented on any more of them.

In all these cases, there's a method call for reporting an error inside the if. With the exception of the one that calls isCurrent(), it should be safe to assume that the C++ compiler optimizes these away. Apart from relying on the C++ compiler's optimizer, I haven't tried to put effort into making the translator produce more elegant output here, because I have been assumming we might actually want to report tree builder errors to the error console eventually (bug 512229).

> This finishes the review of the bulk of the code.

Thank you! I'd appreciate it if you could r+ the patches on this bug, too, to get them landed in order to avoid merge conflicts.

> There's also a bunch more places that need more comments, even in the java
> code. In general every class function and member needs a description IMHO,

OK. I'll write JavaDocs for every member.

> We also need the comments in the java code to be carried over to the Cpp code,
> though I think there's a separate bug on that.

There was only bug 539213. I filed bug 559547.

> I haven't found any real blockers for turning the parser on by default though,
> other than the ones I've explicitly noted already.

That is, the flush timer and chardet creation?
Attachment #439222 - Flags: review?(jonas)
> > > > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> > > > 980 {
> > > > 981   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> > > > 982   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> > > > 983   Uninterrupt();
> > > > 984   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
> > > > 985   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
> > > > 986     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
> > > > 987   }
> > > > 988 }
> > > > 
> > > > Is there a reason to uninterrupt from the main thread? Why not do it from the
> > > > parser thread as to avoid blocking the main thread.
> > > 
> > > When calling Uninterrupt(), the main thread already owns mTokenizerMutex and
> > > Uninterrupt() doesn't acquire an additional mutex, so no additional blocking
> > > occurs.
> > 
> > This doesn't appear to be the case. You're here explicitly grabbing the lock in
> > order to be able to safely call Uninterrupt. If you instead do that while on
> > the parser thread, there is no need for the main thread to block.
> 
> Oh, right. In this case, that's true.
> 
> I changed it the way you suggested. However, I'm nervous that this optimization
> for a rare case (the case where the shell is unable to fulfill a request to
> switch charsets fails) comes back to bite us later, because previously
> Interrupt() and Uninterrupt() were both called from the same thread so
> Uninterrupt() was guaranteed to cancel one Interrupt(). Now, Interrupt() is, as
> far as I can tell, called only in places that can't start rerunning until the
> parser thread has uninterrupted itself and posted more data to the main thread.
> However, if we later change how Interrupt() is called, it would be easy to
> introduce a situation where the main thread first posts a runnable to the
> parser thread, then interrupts and then the runnable cancels that new interrupt
> when it gets a chance to run on the parser thread.

I think i'm following you, but not completely sure :)

I agree that if things get too complicated and we start getting concerned that a call to Uninterrupt might start Uninterrupting an different Interript than it was scheduled for, then we might need to roll back this change. Ideally we won't get there though, but if we do, lets cross that bridge then.

> > > > nsHtml5StreamParser::ContinueAfterScripts
> > > > 
> > > > 864     if (mSpeculations.IsEmpty()) {
> > > > 865       // Not quite sure how exactly this happens...
> > > > 866       // Maybe an artifact of defer scripts?
> > > > 867       return;
> > > > 868     }
> > > > 
> > > > Sounds like an assertion would be good.
> > > 
> > > I put an NS_WARNING here.
> > 
> > Why not NS_ASSERTION?
> 
> Because I'm not sure if there's a legitimate situation where that condition
> holds true. (I realize that not being sure about that isn't exactly good.)

That's fine. If the assertion starts firing in legitimate situations we'll see what that situation is and can replace the current comment with a description of why this is needed.

Assertions aren't in release builds so no risk that millions of users will be affected.

> > > (In reply to comment #18)
> > > > You're also using the charset detector off the main thread, which I suspect
> > > > isn't safe. In fact, since you are creating it on the main thread, and then
> > > > calling it from the parser thread, I'm surprised this doesn't assert as it
> > > > doesn't seem to be implemented using threadsafe refcounting macros.
> > > >
> > > > This is also something we need to fix before we can turn on by default. Do you
> > > > want me to file a separate bug on this charset threadsafety stuff?
> > > 
> > > Yes, a separate bug would be good to have if there really is a thread-safety
> > > problem left. As far as I can tell, there is no problem: After bug , the
> > > detector is refcounted only from the main thread. The sniffing is run only from
> > > the parser thread. The sniffing seems to affect only the internal state of the
> > > object. The data tables shared by the instances are baked into code ahead of
> > > compile time.
> > 
> > I'll file a separate bug. If we're lucky (which it sounds like based on your
> > description) all we need to do is to move the creation of the detector to the
> > thread where its used.
> 
> What would be the benefit except being able to avoid the detector creation in
> some cases and being able to delete it earlier in others? It seems to me that
> it's not a thread/ problem that it's allocated on one thread, used on another
> and destroyed on the creator thread but only after the other thread isn't going
> to use the object anyway.

The problem with the current situation is that if someone changes the implementation of the detector to do an extra addref/release inside one of the functions that you are calling on the separate thread, then that person will start hitting a bunch of threadsafty warnings. Despite that person not doing anything wrong, or there really being a problem anywhere. However that person won't be able to check in his/her changes until they've figured out how the HTML5 parser works and how to change it to not fire the assertions.

> Moving the destruction of the detector
> off-the-main-thread would be annoying in the nsIParser::Terminate and cycle
> collector cases. Wouldn't those then require creating a runnable solely to
> clean up the detector?

One solution would be to only do this in debug builds, as that is the only place where we'll hit the refcounting assertions.

> > > > If so, would calling it something like MetaCharsetFound make sense?
> > > 
> > > "Meta charset" is not HTML5 spec terminology. "[character] encoding
> > > declaration" is spec terminology. That's why the method is named the way it is
> > > named. In general, I have tried to stick to spec terminology when naming things
> > > even when the spec terminology deviates from colloquial usage. (Consider
> > > "entities" vs. "named character refernces".)
> > 
> > How about EncodingDeclarationFound? Function names that are simple nouns sound
> > like getters to me.
> 
> The other callback names, for better or worse, use SAX-style naming and SAX
> uses spec-wise correct nouns, so having this one not be a noun would be
> inconsistent with the naming of the rest of the callbacks.

My impression is that right now it's sort of a mix, which is what makes things hard to understand. Yes, some "interfaces" use SAX-style naming, but they are implemented on classes which implemented both sax-style naming as well as other interfaces, thus creating the mix.

The main problem was that it was very hard to find docs. One solution is to group functions together based on which interface they implement and then create clear headers showing which interface is being implemented. That makes it easier to find your way to the interface where the function should be documented. For example like:

http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsGenericElement.h#351

> I have a question about the thread-safety of nsITimer::Cancel on multicores,
> though: 
> nsTimerImpl::Cancel assigns to mCanceled without acquiring a mutex and
> nsTimerImpl::Fire() reads mCanceled without acquiring a mutex. However, before
> nsTimerImpl::Cancel returns, it calls TimerThread::RemoveTimer which acquires a
> mutex. Is that mutex acquisition enough to invalidate caches on other cores so
> that when nsTimerImpl::Cancel returns on one thread, another thread sees
> mCanceled as true in nsTimerImpl::Fire()?
> 
> Or is nsTimerImpl even supposed to be thread-safe in the sense that you are
> allowed to Cancel from non-target thread? If Cancel isn't actually thread-safe,
> I need to do something more complicated with a runnable for canceling the timer
> when the parser is released early due to nsIParser::Terminate and the timer is
> armed.

I'm not sure if aquiring a mutex will flush all cores or just the current one.

However in any case, when calling Cancel() you're always running the risk that the other thread is already passed the are-we-cancelled check and thus will fire. In fact, it might already be in the midst of firing already.

So if you just call cancel, you need to be aware that the parser might already be currently flushing. Not sure if that's a problem yet. Haven't yet looked at the patch.

> So for perf reasons, I don't want to go back to enums on the Java side. As for
> the C++ side, I've viewed the generated code as a way of talking to the C++
> compiler--not as much a way of expressing things to human readers--and when
> talking to the compiler, it shouldn't matter either way. Therefore, in the
> interest of not pouring effort excessively into the translator, I haven't tried
> to make the translator turn groups of ints into enums in the translation.

I think this is where we differ in view :)

I think the C++ code needs to be readable to humans too, for a few reasons. The most obvious reason is that during profiling and debugging, it's the C++ code gecko developers will be looking at.

Also, I wouldn't be surprised if we'll eventually end up forking the java and C++ code. Though by no means I want that to happen anytime soon. If that happens we'll be working with the C++ code directly.

I am however fine with using #defines until we deem forking necessary, if that ever happens.

> > nsHtml5Tokenizer::tokenizeBuffer
> > 
> > 460   if (pos == buffer->getEnd()) {
> > 461     buffer->setStart(pos);
> > 462   } else {
> > 463     buffer->setStart(pos + 1);
> > 464   }
> > 
> > Why is this needed? Shouldn't we treat buffers as empty if the start == end?
> 
> The first is about getting ready to return because the tokenizer loop hit the
> end of the buffer. The else branch is about two things: Either the parser is
> suspending after a </script> or a charset <meta> and pos is the index of '>' or
> the tokenizer is burping at a CR and pos is the index of the CR. In either of
> those cases, when the buffer is examined the next time, the position should
> point to the character after the '>' or the CR (which may actually mean that
> the buffer gets marked as empty if pos pointed to the last character).

Ah, so pos always points to the last character consumed?

> (In reply to comment #24)
> > nsHtml5TreeBuilder.cpp
> > 
> > The big thing here is that I'm worried that all the switch statements with
> > loads of fallthroughs and gotos are going to make it incredibly hard to work
> > with this code in the future. Is there any evidence that the fallthroughs are
> > saving any cycles?
> 
> In the tokenizer the fallthroughs are an attempt at saving cycles in the Java
> case (saving cycles in the C++ case is bug 482920). 
> 
> In the tree builder the fallthroughs are there mainly to avoid code duplication
> while keeping the structure of the spec recognizable in the code structure. The
> spec has a lot of fall throughs to other modes, so they are implemented as fall
> throughs in the code. Given the structure of the spec, I think the fallthroughs
> actually make things maintainable in the first place if switch is used at all. 

Ok, if the fallthroughs follow the spec then that makes things better I agree.

> Whether it's a good idea to use switch instead of having a class per mode
> inheriting from a pure-virtual class is another question, but I'd be very, very
> reluctant to change the overall structure of the tree builder that radically at
> this point in time. I do realize that fixing bug 552908 in the current
> architecture is going to lead to ugly code, but that spec feature didn't exist
> when I designed the architecture. Finding out how a bunch of virtual calls
> would compare to huge switches from the perf POV would involve refactoring the
> whole thing to use the other way and then benchmarking.

100% agreed. We definitely shouldn't do any such major refactorings right now.

> As for the gotos, they are all breaks and continues in the Java code. They
> aren't used in random ways--they are used in 3 conceptually simple ways, so
> they should be fine for maintenance. Unfortunately, C++ doesn't have the
> feature of break and continue by label, so I had to make the translator emulate
> those with goto. 
> 
> The three concepts are:
>  1) Done processing this token: break starttagloop;
>  2) Reprocess the token: continue starttagloop;
>  3) Process the token according to the rules of another mode (later in the
> order of modes): break statenameloop;

I agree, the lack of labeled breaks in C++ sucks. But the limitation is there and we have to live with it. Like I said above, I think the C++ code needs to be readable to a human.

I don't have a suggestion for how to improve the current situation, other than trying to use switch-statements less and if-statements more. There are quite a few switch statements with only one or two cases other than the default. In these situations if-statements would likely be cleaner and might make it possible to use plain breaks in a few more places.

I think this is something that we can deal with separately though.

> > nsHtml5TreeBuilder::doctype
> > 
> > 124   needToDropLF = PR_FALSE;
> > 125   for (; ; ) {
> > 126     switch(foreignFlag) {
> > 127       case NS_HTML5TREE_BUILDER_IN_FOREIGN: {
> > 128         goto doctypeloop_end;
> > 129       }
> > 130       default: {
> > 131         switch(mode) {
> > 132           case NS_HTML5TREE_BUILDER_INITIAL: {
> > 133             nsString* emptyString = nsHtml5Portability::newEmptyString();
> > 134             appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring :
> > name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ?
> > emptyString : systemIdentifier);
> > 135             nsHtml5Portability::releaseString(emptyString);
> > 136             if (isQuirky(name, publicIdentifier, systemIdentifier,
> > forceQuirks)) {
> > 137 
> > 138               documentModeInternal(QUIRKS_MODE, publicIdentifier,
> > systemIdentifier, PR_FALSE);
> > 139             } else if (isAlmostStandards(publicIdentifier,
> > systemIdentifier)) {
> > 140 
> > 141               documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier,
> > systemIdentifier, PR_FALSE);
> > 142             } else {
> > 143               documentModeInternal(STANDARDS_MODE, publicIdentifier,
> > systemIdentifier, PR_FALSE);
> > 144             }
> > 145             mode = NS_HTML5TREE_BUILDER_BEFORE_HTML;
> > 146             return;
> > 147           }
> > 148           default: {
> > 149             goto doctypeloop_end;
> > 150           }
> > 151         }
> > 152       }
> > 153     }
> > 154   }
> > 155   doctypeloop_end: ;
> > 156 
> > 157   return;
> > 
> > What is the purpose of the loop here? It doesn't seem like we actually ever
> > loop. 
> 
> The purpose is to be able to be break of it to jump forward.

But the switch statement is right inside it, so any break will break out of that. Also, there aren't actually any break statements in the loop/switch.


> Yeah, that switch on foreignFlag is mainly about being pointlessly consistent
> with startTag.
> 
> Fixed.

For what it's worth, the switch in startTag seems unneeded too and better written as an if-statement.

> > 464       case NS_HTML5TREE_BUILDER_IN_HEAD: {
> > 465         if (currentPtr > 1) {
> > 466 
> > 467         }
> > 468         while (currentPtr > 0) {
> > 469           popOnEof();
> > 470         }
> > 
> > Remove the if-statement.
> > 
> > nsHtml5TreeBuilder::startTag
> > 764                   if (!isCurrent(nsHtml5Atoms::table)) {
> > 765 
> > 766                   }
> > 
> > Same here
> > 
> > 831                 if (currentPtr != eltPos) {
> > 832 
> > 833                 }
> > 
> > There seems to be many more of these, I haven't commented on any more of them.
> 
> In all these cases, there's a method call for reporting an error inside the if.
> With the exception of the one that calls isCurrent(), it should be safe to
> assume that the C++ compiler optimizes these away. Apart from relying on the
> C++ compiler's optimizer, I haven't tried to put effort into making the
> translator produce more elegant output here, because I have been assumming we
> might actually want to report tree builder errors to the error console
> eventually (bug 512229).

I really don't think we'll want to do that anytime soon. Netscape tried this back in the days and the result was a ton of added complexity to the parser which we're still paying for a decade later and yet have never taken advantage of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko which have restricted HTML5 was also a result of this.

Code is changeable, if we'll ever want to do error reporting for parse errors we can modify the code to deal with that then.

> > I haven't found any real blockers for turning the parser on by default though,
> > other than the ones I've explicitly noted already.
> 
> That is, the flush timer and chardet creation?

Yeah.

Reviewing the patches here now.
Attachment #432525 - Flags: review?(jonas) → review+
Comment on attachment 437849 [details] [diff] [review]
Address review comments in nsHtml5StreamParser.cpp

>-      if (totalByteCount < aCount) { // mimicking nsScanner even though this seems wrong
>+      // There's an illegal byte in the input. It's now the responsibility
>+      // of this calling code to output a U+FFFD REPLACEMENT CHARACTER and
>+      // reset the decoder.
>+
>+      if (totalByteCount < aCount) {
>+        // advance over the bad byte
>         ++totalByteCount;
>         ++aFromSegment;
>       }
>+#ifdef DEBUG
>+      else {
>+        NS_NOTREACHED("The decoder signaled an error but consumed all input.");
>+      }
>+#endif

Simply putting NS_ASSERTION(totalByteCount < aCount, "...") before the if statement would avoid the #ifdef

r=me either way.
Attachment #437849 - Flags: review?(jonas) → review+
Comment on attachment 439220 [details] [diff] [review]
Address more review comments in C++ only parts

> However in any case, when calling Cancel() you're always running the risk that
> the other thread is already passed the are-we-cancelled check and thus will
> fire. In fact, it might already be in the midst of firing already.

OK. Not just me thinking Cancel/Fire don't look right to be thread-safe, then. (I heard nsITimer was thread-safe, tried to verify it, it didn't look right, so I started doubting myself and thinking it was thread-safe in some non-obvious way.)
Attachment #439220 - Flags: review?(jonas)
(In reply to comment #29)
> Simply putting NS_ASSERTION(totalByteCount < aCount, "...") before the if
> statement would avoid the #ifdef
> 
> r=me either way.

Thanks. I made that change.
http://hg.mozilla.org/mozilla-central/rev/47327da4d134
http://hg.mozilla.org/mozilla-central/rev/a8327985f6e4
This patch works around the thread-unsafety of nsITimer::Cancel.
Attachment #439220 - Attachment is obsolete: true
Attachment #439510 - Flags: review?(jonas)
Comment on attachment 432403 [details] [diff] [review]
Address review comments in nsHtml5Parser.cpp

Looks good. Though please include a diff created with -w as well as the full patch when you're doing indentation changes.
Attachment #432403 - Flags: review?(jonas) → review+
Attachment #438967 - Flags: review?(jonas) → review+
Attachment #439222 - Flags: review?(jonas) → review+
Comment on attachment 439222 [details] [diff] [review]
Address more review comments in Java parts

>@@ -793,39 +793,33 @@
>+		if (foreignFlag != IN_FOREIGN) {
>+			switch (mode) {
>+			case INITIAL:
>+			case BEFORE_HTML:
>+			case AFTER_AFTER_BODY:
>+			case AFTER_AFTER_FRAMESET:

Indentation here is a bit strange, elsewhere you've indented the cases. Also, there's tabs here, which isn't used elsewhere.

>+				/*
>+				 * A comment token Append a Comment node to the Document object
>+				 * with the data attribute set to the data given in the comment
>+				 * token.
>+				 */
>+				appendCommentToDocument(buf, start, length);
>+				return;
>+			case AFTER_BODY:
>+				/*
>+				 * A comment token Append a Comment node to the first element in
>+				 * the stack of open elements (the html element), with the data
>+				 * attribute set to the data given in the comment token.
>+				 */
>+				flushCharacters();
>+				appendComment(stack[0].node, buf, start, length);
>+				return;
>+			default:
>+				break;

This used to break out differently. Was that change intentional or am I reading it wrong?

Looks good otherwise.
Comment on attachment 439510 [details] [diff] [review]
Address more review comments in C++ only parts

>+class nsHtml5TimerKungFu : public nsRunnable
>+{
>+private:
>+  nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser;
>+public:
>+  nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser)
>+    : mStreamParser(aStreamParser)
>+  {}
>+  NS_IMETHODIMP Run()
>+  {
>+    if (mStreamParser->mFlushTimer) {
>+      mStreamParser->mFlushTimer->Cancel();
>+      mStreamParser->mFlushTimer = nsnull;
>+    }
>+    return NS_OK;
>+  }
>+};
>+
>+void
>+nsHtml5StreamParser::DropTimer()
>+{
>+  if (mFlushTimer) {
>+    nsCOMPtr<nsIRunnable> event = new nsHtml5TimerKungFu(this);
>+    if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
>+      NS_WARNING("Failed to dispatch TimerKungFu event");
>+    }
>+  }
>+}

I'm not sure I understand the purpose of canceling the timer on the parser thread. The problem that was brought up earlier was that calling cancel from the main thread might not prevent the timer from just having started firing.

However this solution has the same problem. The timer might fire between the call to nsHtml5StreamParser::DropTimer and the call to nsHtml5TimerKungFu::Run.

For that matter, the problem exists in the code that is in the tree already. By the time we cancel the main-thread-only timer, nsHtml5StreamParserTimerFlusher::Run might have just started running on the parser thread.

Also, it would be great if you could diff patches with -P. Not a huge deal though.
(In reply to comment #35)
> I'm not sure I understand the purpose of canceling the timer on the parser
> thread. The problem that was brought up earlier was that calling cancel from
> the main thread might not prevent the timer from just having started firing.
> 
> However this solution has the same problem. The timer might fire between the
> call to nsHtml5StreamParser::DropTimer and the call to nsHtml5TimerKungFu::Run.

The problem being addressed is the timer firing after the nsHtml5StreamParser instance it is associated with has been deleted. The timer doesn't hold a strong reference back to the nsHtml5StreamParser for two reasons: 1) chardet already takes nsHtml5StreamParser::Notify and 2) nsHtml5StreamParser is a CC participant and can't be refcounted off the main thread.

nsHtml5TimerKungFu addrefs nsHtml5StreamParser (hence kungFu) so it stays alive until ::Run cancels the timer on the parser thread. This way, the timer may fire after the stream parser has been terminated but not after the stream parser has been deleted.
(In reply to comment #34)
> (From update of attachment 439222 [details] [diff] [review])
> >@@ -793,39 +793,33 @@
> >+		if (foreignFlag != IN_FOREIGN) {
> >+			switch (mode) {
> >+			case INITIAL:
> >+			case BEFORE_HTML:
> >+			case AFTER_AFTER_BODY:
> >+			case AFTER_AFTER_FRAMESET:
> 
> Indentation here is a bit strange, elsewhere you've indented the cases. Also,
> there's tabs here, which isn't used elsewhere.

Fixed. I had failed to import my old Eclipse code formatter settings when I migrated from Mac to Ubuntu.

> >+				/*
> >+				 * A comment token Append a Comment node to the Document object
> >+				 * with the data attribute set to the data given in the comment
> >+				 * token.
> >+				 */
> >+				appendCommentToDocument(buf, start, length);
> >+				return;
> >+			case AFTER_BODY:
> >+				/*
> >+				 * A comment token Append a Comment node to the first element in
> >+				 * the stack of open elements (the html element), with the data
> >+				 * attribute set to the data given in the comment token.
> >+				 */
> >+				flushCharacters();
> >+				appendComment(stack[0].node, buf, start, length);
> >+				return;
> >+			default:
> >+				break;
> 
> This used to break out differently. Was that change intentional or am I reading
> it wrong?

There's no intentional change. After re-reading this several times, I fail to see the problem. The non-default cases return in both old and new version, so no change there. The default case in the old case jumps out of the pointless loop. In the new version, there's no loop, so the default case has a no-op break just to silence compiler warnings.

> Looks good otherwise.

Thanks.

http://hg.mozilla.org/mozilla-central/rev/2ba94ee301bf
http://hg.mozilla.org/mozilla-central/rev/715e63b786d4
http://hg.mozilla.org/mozilla-central/rev/46be92d24873
(In reply to comment #35)
> Also, it would be great if you could diff patches with -P. Not a huge deal
> though.

Did you mean -p? Here's the same patch with -p from hg diff. The previous patch was from hg export. (It would be so nice if hg export formatted the patches according to the hg diff config in .hgrc...)
Attachment #439510 - Attachment is obsolete: true
Attachment #439894 - Flags: review?(jonas)
Attachment #439510 - Flags: review?(jonas)
This patch is diffed with -p. Unfortunately, startTag() became mostly unreviewable due to formatting changes. I tried diffing with various whitespace-related flags, but the result is just as bad. I'll attach the changed .java file, too.

(In reply to comment #28)
> > Moving the destruction of the detector
> > off-the-main-thread would be annoying in the nsIParser::Terminate and cycle
> > collector cases. Wouldn't those then require creating a runnable solely to
> > clean up the detector?
> 
> One solution would be to only do this in debug builds, as that is the only
> place where we'll hit the refcounting assertions.

Do you actually want to hit refcounting assertions, though? What about making chardet implement addref/release using the thread-safe variants?

> > > > I put an NS_WARNING here.
> > > 
> > > Why not NS_ASSERTION?
> > 
> > Because I'm not sure if there's a legitimate situation where that condition
> > holds true. (I realize that not being sure about that isn't exactly good.)
> 
> That's fine. If the assertion starts firing in legitimate situations we'll see
> what that situation is and can replace the current comment with a description
> of why this is needed.
> 
> Assertions aren't in release builds so no risk that millions of users will be
> affected.

Changed to NS_NOTREACHED.

> > > > > If so, would calling it something like MetaCharsetFound make sense?
> > > > 
> > > > "Meta charset" is not HTML5 spec terminology. "[character] encoding
> > > > declaration" is spec terminology. That's why the method is named the way it is
> > > > named. In general, I have tried to stick to spec terminology when naming things
> > > > even when the spec terminology deviates from colloquial usage. (Consider
> > > > "entities" vs. "named character refernces".)
> > > 
> > > How about EncodingDeclarationFound? Function names that are simple nouns sound
> > > like getters to me.
> > 
> > The other callback names, for better or worse, use SAX-style naming and SAX
> > uses spec-wise correct nouns, so having this one not be a noun would be
> > inconsistent with the naming of the rest of the callbacks.
> 
> My impression is that right now it's sort of a mix, which is what makes things
> hard to understand. Yes, some "interfaces" use SAX-style naming, but they are
> implemented on classes which implemented both sax-style naming as well as other
> interfaces, thus creating the mix.
> 
> The main problem was that it was very hard to find docs. One solution is to
> group functions together based on which interface they implement and then
> create clear headers showing which interface is being implemented. That makes
> it easier to find your way to the interface where the function should be
> documented. For example like:
> 
> http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsGenericElement.h#351

I've actually used this pattern in this particular case, but in this case, it didn't help locate the documentation in the .java files.

Anyway, I'd rather add more comments pointing to the docs than rename stuff inconsistently.

> > So for perf reasons, I don't want to go back to enums on the Java side. As for
> > the C++ side, I've viewed the generated code as a way of talking to the C++
> > compiler--not as much a way of expressing things to human readers--and when
> > talking to the compiler, it shouldn't matter either way. Therefore, in the
> > interest of not pouring effort excessively into the translator, I haven't tried
> > to make the translator turn groups of ints into enums in the translation.
> 
> I think this is where we differ in view :)
> 
> I think the C++ code needs to be readable to humans too, for a few reasons. The
> most obvious reason is that during profiling and debugging, it's the C++ code
> gecko developers will be looking at.
> 
> Also, I wouldn't be surprised if we'll eventually end up forking the java and
> C++ code. Though by no means I want that to happen anytime soon. If that
> happens we'll be working with the C++ code directly.
> 
> I am however fine with using #defines until we deem forking necessary, if that
> ever happens.

Added #defines.

> > > nsHtml5Tokenizer::tokenizeBuffer
> > > 
> > > 460   if (pos == buffer->getEnd()) {
> > > 461     buffer->setStart(pos);
> > > 462   } else {
> > > 463     buffer->setStart(pos + 1);
> > > 464   }
> > > 
> > > Why is this needed? Shouldn't we treat buffers as empty if the start == end?
> > 
> > The first is about getting ready to return because the tokenizer loop hit the
> > end of the buffer. The else branch is about two things: Either the parser is
> > suspending after a </script> or a charset <meta> and pos is the index of '>' or
> > the tokenizer is burping at a CR and pos is the index of the CR. In either of
> > those cases, when the buffer is examined the next time, the position should
> > point to the character after the '>' or the CR (which may actually mean that
> > the buffer gets marked as empty if pos pointed to the last character).
> 
> Ah, so pos always points to the last character consumed?

Yes.

> > > nsHtml5TreeBuilder::doctype
...
> > > What is the purpose of the loop here? It doesn't seem like we actually ever
> > > loop. 
> > 
> > The purpose is to be able to be break of it to jump forward.
> 
> But the switch statement is right inside it, so any break will break out of
> that. Also, there aren't actually any break statements in the loop/switch.

Simplified.

> > Yeah, that switch on foreignFlag is mainly about being pointlessly consistent
> > with startTag.
> > 
> > Fixed.
> 
> For what it's worth, the switch in startTag seems unneeded too and better
> written as an if-statement.

Whoa. I wonder if this code looked different when I put the switch in there. Simplified.

> > In all these cases, there's a method call for reporting an error inside the if.
> > With the exception of the one that calls isCurrent(), it should be safe to
> > assume that the C++ compiler optimizes these away. Apart from relying on the
> > C++ compiler's optimizer, I haven't tried to put effort into making the
> > translator produce more elegant output here, because I have been assumming we
> > might actually want to report tree builder errors to the error console
> > eventually (bug 512229).
> 
> I really don't think we'll want to do that anytime soon. Netscape tried this
> back in the days and the result was a ton of added complexity to the parser
> which we're still paying for a decade later and yet have never taken advantage
> of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko
> which have restricted HTML5 was also a result of this.
> 
> Code is changeable, if we'll ever want to do error reporting for parse errors
> we can modify the code to deal with that then.

I've changed the translator to output nicer code in this case. TANSTAAFL, though: The translator is now uglier.
Attachment #440199 - Flags: review?(jonas)
(In reply to comment #28)
> > > I'll file a separate bug. If we're lucky (which it sounds like based on your
> > > description) all we need to do is to move the creation of the detector to the
> > > thread where its used.
> > 
> > What would be the benefit except being able to avoid the detector creation in
> > some cases and being able to delete it earlier in others? It seems to me that
> > it's not a thread/ problem that it's allocated on one thread, used on another
> > and destroyed on the creator thread but only after the other thread isn't going
> > to use the object anyway.
> 
> The problem with the current situation is that if someone changes the
> implementation of the detector to do an extra addref/release inside one of the
> functions that you are calling on the separate thread, then that person will
> start hitting a bunch of threadsafty warnings. Despite that person not doing
> anything wrong, or there really being a problem anywhere. However that person
> won't be able to check in his/her changes until they've figured out how the
> HTML5 parser works and how to change it to not fire the assertions.

I recalled why I didn't just make the chardet addref/release thread-safe: The detector addrefs/releases its observer. The observer is the stream parser, which is a cycle collection participant and, therefore, can't be addrefed or released off the main thread.

Since your problem scenario is someone changing the detector, since there isn't a current thread-safety problem and since fixing the anticipated problem ahead of time is complex due to cycle collection, can we consider this a bridge we cross when someone actually wants to change the detector?

The detector is about as stable as code in Gecko gets.

> > In all these cases, there's a method call for reporting an error inside the if.
> > With the exception of the one that calls isCurrent(), it should be safe to
> > assume that the C++ compiler optimizes these away. Apart from relying on the
> > C++ compiler's optimizer, I haven't tried to put effort into making the
> > translator produce more elegant output here, because I have been assumming we
> > might actually want to report tree builder errors to the error console
> > eventually (bug 512229).
> 
> I really don't think we'll want to do that anytime soon. Netscape tried this
> back in the days and the result was a ton of added complexity to the parser
> which we're still paying for a decade later and yet have never taken advantage
> of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko
> which have restricted HTML5 was also a result of this.

Adding error reporting to the parser wouldn't add a ton of complexity. Since the Java version already supports error reporting, the required complexity is already known. I'd expect the perf cost in the tree builder to be tiny. In the tokenizer, I'd expect the perf cost to be notable but smaller than the cost of supporting view source highlighting. My current plan is to generate a second copy of the main tokenizer loop for view source. One option would be putting tokenizer error reporting only in that copy and reporting errors only if the user views source (possibly by baking it into the source highlighting and not reporting to the console at all).
Attachment #439894 - Attachment is obsolete: true
Attachment #440481 - Flags: review?(jonas)
Attachment #439894 - Flags: review?(jonas)
Comment on attachment 440481 [details] [diff] [review]
Address more review comments in C++ only parts, with an added null check

>+nsHtml5StreamParser::FlushTreeOpsAndDisarmTimer()
>+{

Please add an assertion to check that this is called on the right (main?) thread. Those assertions were awesome to have when reviewing and trying to grokk the code.




> nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> {
>   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
>   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
>-  Uninterrupt();
>   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
>   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
>     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
>   }
> }

Is grabbing the mutex still needed?

>+class nsHtml5TimerKungFu : public nsRunnable
>+{
>+private:
>+  nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser;
>+public:
>+  nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser)
>+    : mStreamParser(aStreamParser)
>+  {}
>+  NS_IMETHODIMP Run()
>+  {
>+    if (mStreamParser->mFlushTimer) {
>+      mStreamParser->mFlushTimer->Cancel();
>+      mStreamParser->mFlushTimer = nsnull;
>+    }
>+    return NS_OK;
>+  }
>+};

Ok, now I see what is happening. The important part here isn't that the cancel call is happening on any particular thread. But that you addref the stream parser on the main thread, and then post a message to the parser thread, and then post a message back again to the main thread and only then release the stream parser.

This ensures that the stream parser is being kept alive until after we're done processing the timer, or cancelling it. And the nsHtml5RefPtr is what's making it all happen.

Is this correct?

And it seems like the same setup is being used to keep the stream parser alive when sending necko data to it?

Please document this thoroughly. Also document the fact that while we're calling various functions on the stream parser on the parser thread, it's critical that it's never addreffed or released there.

r=me with that.
Attachment #440481 - Flags: review?(jonas) → review+
Comment on attachment 440199 [details] [diff] [review]
Address even more review comments (vol6)

Ugh, yeah, this patch unreadable :(

I sort of skimmed the resulting java file, but if there was anything in particular you'd like me to look at let me know.

I still see a few overuses of switch statements, for example when looking for whitespace and in generateImpliedEndTagsExceptFor and generateImpliedEndTags. But we can always deal with that over time.
Attachment #440199 - Flags: review?(jonas) → review+
(In reply to comment #45)
> (From update of attachment 440481 [details] [diff] [review])
> >+nsHtml5StreamParser::FlushTreeOpsAndDisarmTimer()
> >+{
> 
> Please add an assertion to check that this is called on the right (main?)
> thread. Those assertions were awesome to have when reviewing and trying to
> grokk the code.

Added NS_ASSERTION(IsParserThread(), "Wrong thread!");

> > nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch()
> > {
> >   NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> >   mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
> >-  Uninterrupt();
> >   nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
> >   if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
> >     NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
> >   }
> > }
> 
> Is grabbing the mutex still needed?

No. Removed.

> >+class nsHtml5TimerKungFu : public nsRunnable
> >+{
> >+private:
> >+  nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser;
> >+public:
> >+  nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser)
> >+    : mStreamParser(aStreamParser)
> >+  {}
> >+  NS_IMETHODIMP Run()
> >+  {
> >+    if (mStreamParser->mFlushTimer) {
> >+      mStreamParser->mFlushTimer->Cancel();
> >+      mStreamParser->mFlushTimer = nsnull;
> >+    }
> >+    return NS_OK;
> >+  }
> >+};
> 
> Ok, now I see what is happening. The important part here isn't that the cancel
> call is happening on any particular thread. But that you addref the stream
> parser on the main thread, and then post a message to the parser thread, and
> then post a message back again to the main thread and only then release the
> stream parser.
> 
> This ensures that the stream parser is being kept alive until after we're done
> processing the timer, or cancelling it. And the nsHtml5RefPtr is what's making
> it all happen.
> 
> Is this correct?

Correct.

> And it seems like the same setup is being used to keep the stream parser alive
> when sending necko data to it?

Yes.

> Please document this thoroughly. Also document the fact that while we're
> calling various functions on the stream parser on the parser thread, it's
> critical that it's never addreffed or released there.
> 
> r=me with that.

Thanks.

I'll land the patch with these changes tomorrow so that I can watch the tinderbox.
(In reply to comment #46)
> (From update of attachment 440199 [details] [diff] [review])
> Ugh, yeah, this patch unreadable :(
> 
> I sort of skimmed the resulting java file, but if there was anything in
> particular you'd like me to look at let me know.

There was a bug in the unreviewable part. I've uploaded a corrected patch here, but it's unreviewable, too. And the interdiff is unreviewable, so I just say what changed since the previous patch:

The big switch in startTag was in the |else| branch of |if (inForeign)|. In this patch, there's no |else| branch and the big switch is just after the |if|.

Sorry about bothering you a second time with this patch. This didn't get caught earlier, because we don't have full html5lib tree builder test mochitest coverage (bug 559023) and because when I ran the full test suite with the Java version of the parser, I failed to notice that the test harness got stuck in an infinite loop instead of stopping running.
Attachment #440199 - Attachment is obsolete: true
Attachment #440200 - Attachment is obsolete: true
Attachment #440201 - Attachment is obsolete: true
Attachment #440769 - Flags: review?(jonas)
This patch (on top of attachment 440481 [details] [diff] [review]) should make the case where the cycle collector unlinks nsHtml5Parser before nsHtml5StreamParser safer.
Attachment #440978 - Flags: review?(jonas)
Attachment #440769 - Flags: review?(jonas) → review+
Attachment #440978 - Flags: review?(jonas) → review+
Thanks for the reviews. Pushed two patches:
http://hg.mozilla.org/mozilla-central/rev/9732d2b16662
http://hg.mozilla.org/mozilla-central/rev/a1c53e4881ca

The "vol 6" is the only one unlanded patch on this bug. I didn't land it in the hopes that I might get r+ on bug 548232 soon, which would avoid merge conflicts from reordering the "vol 6" patch and the patch from bug 548232.
Pushed:
http://hg.mozilla.org/mozilla-central/rev/cdacbfe66545
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: