Closed Bug 1410848 Opened 7 years ago Closed 7 years ago

Upgrade all nsHtml5TreeOpExecutor::mFlushState checks to actually enforce correct state in all types of builds

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

(Keywords: sec-audit, Whiteboard: [adv-main58-][post-critsmash-triage])

Attachments

(1 file, 2 obsolete files)

Use-after-free avoidance on nsHtml5TreeOpExecutor::mOpQueue is based on the checks of nsHtml5TreeOpExecutor::mFlushState doing the right thing.

The problem is that some of those checks are mere NS_ASSERTION or NS_PRECONDITION. Those don't actually enforce anything.

We should upgrade NS_ASSERTION/NS_PRECONDITION checks of mFlushState to MOZ_RELEASE_ASSERT. In some cases, we might be able to avoid MOZ_CRASHing by making them non-crashing early returns instead.

Bug 1364399 is the closest we have to steps to reproduce a violation of these constraints.
But MOZ_RELEASE_ASSERT just makes us crash safely. We should ensure the assertion never ever fires.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
(In reply to Olli Pettay [:smaug] from comment #1)
> But MOZ_RELEASE_ASSERT just makes us crash safely. We should ensure the
> assertion never ever fires.

Here's how it's supposed to work:
 * RunFlushLoop() returns right way if another copy is already on the stack, thanks to mRunFlushLoopOnStack.
 * FlushDocumentWrite() returns after flushing speculative loads if there is already another flush loop executing ("if (mFlushState != eNotFlushing)").
 * Ops can be added to the queue via:
   * When moving speculations over from the parser thread only after the last <script> has finished after all document.write layers have finished. (This is indeed rather hand-wavy.)
   * From within RunFlushLoop() before entering the actual flush loop.
   * From document.write(), in a manner which
     - Happens right before FlushDocumentWrite() is called.
     - Is not supposed to happen during RunFlushLoop() except as a tail call (after the mOpQueue processing) when executing a <script>.

The interaction of doc update pauses with the last point is the most questionable thing.
The screenshot orange is likely a known problem from Tuesday. Trying again (this time browser-chrome is known orange):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=21849249e91435ae97da1f38e8ddd57c2b90791e
Attached patch Use RAII and MOZ_RELEASE_ASSERT (obsolete) — Splinter Review
The assertions are upgraded. RAII is used to increase confidence in mFlushState transitions not getting forgotten. C++11 loops are used to enhance code readability.
Attachment #8922258 - Flags: review?(bugs)
ranged-for loop are less safe than old style loops.
Though, it shouldn't cause anymore security issues, since we'll just crash if the array is modified
https://bugzilla.mozilla.org/show_bug.cgi?id=1299489
Comment on attachment 8922258 [details] [diff] [review]
Use RAII and MOZ_RELEASE_ASSERT

>+class MOZ_RAII nsHtml5AutoFlush final
>+{
>+private:
>+  RefPtr<nsHtml5TreeOpExecutor> mExecutor;
>+  size_t mOpsToRemove;
>+
>+public:
>+  explicit nsHtml5AutoFlush(nsHtml5TreeOpExecutor* aExecutor)
>+    : mExecutor(aExecutor)
>+    , mOpsToRemove(aExecutor->OpQueueLength())
I don't understand this mOpsToRemove part.
This looks like as if we were expecting the op queue to get more entries while nsHtml5AutoFlush is on stack.
But modifying the opqueue array while iterating is not safe.

I think I'm missing something here. What guarantees the op queue isn't modified while it is being iterated?
Please explain and ask review again.
Attachment #8922258 - Flags: review?(bugs)
Keywords: sec-audit
Comment on attachment 8922258 [details] [diff] [review]
Use RAII and MOZ_RELEASE_ASSERT

(In reply to Olli Pettay [:smaug] from comment #9)
> Comment on attachment 8922258 [details] [diff] [review]
> Use RAII and MOZ_RELEASE_ASSERT
> 
> >+class MOZ_RAII nsHtml5AutoFlush final
> >+{
> >+private:
> >+  RefPtr<nsHtml5TreeOpExecutor> mExecutor;
> >+  size_t mOpsToRemove;
> >+
> >+public:
> >+  explicit nsHtml5AutoFlush(nsHtml5TreeOpExecutor* aExecutor)
> >+    : mExecutor(aExecutor)
> >+    , mOpsToRemove(aExecutor->OpQueueLength())
> I don't understand this mOpsToRemove part.

By default, the destructor of nsHtml5AutoFlush removes all ops in mOpQueue, but when we exit the flush loop early due to the having taken too much wall-clock time in the loop, we need to remove only the ops that we had time to loop over. That's why before exiting the loop in that case, mOpsToRemove is set to a smaller number.

The number of ops in the queue can't change between the constructor and destructor of nsHtml5AutoFlush, so the default case is handled by initializing mOpsToRemove to the number of ops. mOpsToRemove exists so that the time-based interrupt case can be handled with RAII. (If the time-based interrupt didn't exist, mOpsToRemove wouldn't exist and the destructor would call ClearOpQueue().)

> This looks like as if we were expecting the op queue to get more entries
> while nsHtml5AutoFlush is on stack.

No, there's an assertion checking that if mOpsToRemove is modified after the construction of nsHtml5AutoFlush, it is set to a smaller number than it was initialized to.

> But modifying the opqueue array while iterating is not safe.

Correct.

> I think I'm missing something here. What guarantees the op queue isn't
> modified while it is being iterated?

All modifications MOZ_RELEASE_ASSERT on mFlushState == eNotFlushing.

mFlushState is always set to != eNotFlushing while iterating over the queue (using nsHtml5AutoFlush around iteration).
Attachment #8922258 - Flags: review?(bugs)
MOZ_*_ASSERT don't guarantee anything. I mean, sure we crash safely, but they should never fire.
If one would remove the assertions, is there any way mOpQueue might change while it is being iterated?
Group: core-security → dom-core-security
(In reply to Olli Pettay [:smaug] from comment #11)
> MOZ_*_ASSERT don't guarantee anything.

MOZ_RELEASE_ASSERT before each modification guarantees that if there's a way to try to cause a bad modification, it won't be a security bug.

> I mean, sure we crash safely, but they should never fire.
> If one would remove the assertions, is there any way mOpQueue might change
> while it is being iterated?

Per comment 3, there is not supposed to be, and there isn't anything in our test suite that triggers the MOZ_RELEASE_ASSERTs added here.

As for the last remark in comment 3, here's how the document.write() part works:
The nsHtml5AutoFlush in this patch sets mFlushState to not eNotFlushing. nsHTMLDocument, upon document.write(), calls nsHtml5Parser::IsInsertionPointDefined(), which return false when mFlushState is not eNotFlushing. If the insertion point isn't defined, document.write() blows away the document instead of adding ops to the op queue. (There are kungfuDeathGrips to prevent the nsHtml5TreeOpExecutor getting deleted immediately during the mOpQueue iteration when the document is blown away.)

However, bug 1364399 indicates that some edge case existed when that bug was filed despite the reasoning in comment 3 and the paragraph above. I suggest we land this patch and fix the edge case if the MOZ_RELEASE_ASSERTs ever fire and give us steps to reproduce.
Comment on attachment 8922258 [details] [diff] [review]
Use RAII and MOZ_RELEASE_ASSERT

Going to rebase now that bug 1378079 has landed.
Attachment #8922258 - Attachment is obsolete: true
Attachment #8922258 - Flags: review?(bugs)
Attached patch Use MOZ_RELEASE_ASSERT and RAII (obsolete) — Splinter Review
Attachment #8922691 - Flags: review?(bugs)
FWIW, I'm not worried about document.write, but the case when data comes from network while we spin event loop inside some parser operation (because of running scripts there)
I think http://searchfox.org/mozilla-central/rev/40e8eb46609dcb8780764774ec550afff1eed3a5/parser/html/nsHtml5TreeOpExecutor.cpp#829 is the code I'm worried.
What guarantees that never runs when we're processing op queue?
And saying that there is an assertion isn't the answer ;)
Flags: needinfo?(hsivonen)
The worry with just adding release asserts I have is that the possible security critical issue there is happens only if the current array allocation isn't enough for taking more elements to the array, so that the array capacity needs to be increased. If that that happens very very rarely, but we do
still append to the array somewhat often when we shouldn't, we end up causing quite a few crashes (technically even in cases which aren't unsafe).

So I think we really should try to find the reason for bug 1364399, and also
figure out if the parser thread can push more ops to the queue while main thread is spinning the event loop nestedly.
(In reply to Olli Pettay [:smaug] from comment #17)
> I think
> http://searchfox.org/mozilla-central/rev/
> 40e8eb46609dcb8780764774ec550afff1eed3a5/parser/html/nsHtml5TreeOpExecutor.
> cpp#829 is the code I'm worried.
> What guarantees that never runs when we're processing op queue?

This runs in two cases, AFAICT,
1) document.write().
2) When scripts are done and the main thread checks if there are pending speculations waiting in nsHtml5StreamParser.

The document.write() case is covered in comment 12.

Case 2 happens from within nsHtml5TreeOpExecutor::RunFlushLoop() after it has checked that there isn't another copy of itself on the stack and before it has entered the loop that reads the op queue (and before it has instantiated nsHtml5AutoFlush).

The case of FlushDocumentWrite() being called in response to document.write from within an inline script is OK, because in that case nsHtml5TreeOpExecutor::RunFlushLoop() is on stack and can't be re-entered per the above paragraph.

However, when FlushDocumentWrite() is being called in response to document.write() form within an external script and there's a nested event loop, we may have a bug! I guess I'll need to pay more attention to that case.

(In reply to Olli Pettay [:smaug] from comment #19)
> So I think we really should try to find the reason for bug 1364399,

I agree, but can we still land this on its own? Bug 1364399 seems to be related to an async script blowing away the document using document.write(), so even if the bug persists, sites shouldn't be triggering it legitimately, so we shouldn't expect this patch to cause a lot of MOZ_CRASHes in the wild for legitimate sites.

> and also
> figure out if the parser thread can push more ops to the queue while main
> thread is spinning the event loop nestedly.

The parser thread can't push to mOpQueue directly. It pushes to mStage and nsHtml5TreeOpExecutor::RunFlushLoop() moves the ops from there to mOpQueue when nsHtml5TreeOpExecutor::RunFlushLoop() evidently isn't iterating over mOpQueue.
(In reply to Henri Sivonen (:hsivonen) from comment #20)
> However, when FlushDocumentWrite() is being called in response to
> document.write() form within an external script and there's a nested event
> loop, we may have a bug! I guess I'll need to pay more attention to that
> case.

This covered by
    if (mFlushState != eNotFlushing) {
      // XXX Can this happen? In case it can, let's avoid crashing.
      return;
    }
in nsHtml5TreeOpExecutor::RunFlushLoop().
Flags: needinfo?(hsivonen)
ok, thanks for clarifying. I'm still not quite sure if MOZ_RELEASE_ASSERT is the right thing to do, or should we just bail out and stop parsing or something if we're in a weird state.
But I'll review tomorrow.
(In reply to Henri Sivonen (:hsivonen) from comment #23)
> I'm guessing bug 1364399 relates to

data: URLs loading synchronously seems like a never-ending source of bugs.
Might be worth to land this stuff only once bug 1364399 is fixed.
Comment on attachment 8922691 [details] [diff] [review]
Use MOZ_RELEASE_ASSERT and RAII

>+      // autoFlush clears mOpQueue in its destructor unless
>+      // SetNumberOfOpsToRemove is called first, in which case only
>+      // some ops from the start of the queue are cleared.
>+      nsHtml5AutoFlush autoFlush(this);
>+
>+      auto first = mOpQueue.begin();
>+      auto last = mOpQueue.end() - 1;
>+      for (auto iter = first;; ++iter) {
(huh, I hate auto usage. Makes code so much harder to read when one needs to explicitly
 go and check what kind of type the auto actually is.
 But I guess we've agreed that in case of iteration and static_cast and such use of auto is ok)
Attachment #8922691 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #25)
> Might be worth to land this stuff only once bug 1364399 is fixed.

OK.

(In reply to Olli Pettay [:smaug] from comment #26)
> (huh, I hate auto usage. Makes code so much harder to read when one needs to
> explicitly
>  go and check what kind of type the auto actually is.
>  But I guess we've agreed that in case of iteration and static_cast and such
> use of auto is ok)

I guess I should revert the "auto" bits. Logically they don't really belong in this patch anyway.
Attachment #8922691 - Attachment is obsolete: true
Attachment #8923499 - Flags: review?(bugs)
Attachment #8923499 - Flags: review?(bugs) → review+
(In reply to Henri Sivonen (:hsivonen) from comment #23)
> I'm guessing bug 1364399 relates to step 5. at
> https://html.spec.whatwg.org/#dom-document-close

Step 5 is OK. (It's in nsHtml5Parser.cpp instead of being in nsHTMLDocument.cpp.)

The actual root cause is in bug 1364399 comment 11.
https://hg.mozilla.org/mozilla-central/rev/3660e2cf9fdb

I assume this is fine riding the 58 train and doesn't require backport anywhere.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Group: dom-core-security → core-security-release
Whiteboard: [adv-main58-]
Flags: qe-verify-
Whiteboard: [adv-main58-] → [adv-main58-][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: