Closed Bug 1366812 Opened 2 years ago Closed 2 years ago

Crash in mozilla::dom::TimeoutManager::MaybeStartThrottleTrackingTimout

Categories

(Core :: DOM: Core & HTML, defect, critical)

55 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: marcia, Assigned: ehsan)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-07eb7eb6-fdff-4f47-935b-fbcd80170522.
=============================================================

Seen while looking at crash stats - these crashes started using 20170522030207: http://bit.ly/2qNcbVX

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5b74bbf20e803e299790d266fc6ebf5d53b7a1b7&tochange=9851fcb0bf4d855c36729d7de19f0fa5c9f69776

Maybe Bug 1355311? ni on :farre
Flags: needinfo?(afarre)
It is. I'm not at a computer at the moment, so I can't back this out myself. Hello appreciated.
Flags: needinfo?(afarre)
[Tracking Requested - why for this release]: Regression

ni per bug #1366817 comment #0
Flags: needinfo?(bkelly)
OS: Windows 10 → All
Hardware: x86 → All
Andreas has asked for a backout of bug 1355311.
Depends on: 1355311
Flags: needinfo?(bkelly)
Fixed by the backout of bug 1355311.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Crash Signature: [@ mozilla::dom::TimeoutManager::MaybeStartThrottleTrackingTimout] → [@ mozilla::dom::TimeoutManager::MaybeStartThrottleTrackingTimout] [@ libxul.so@0x11f1cbf | libxul.so@0x2d9f446 | libxul.so@0x2d9f11e | libxul.so@0x2d9df67 | libxul.so@0x369a924 | libxul.so@0x373a44c | libxul.so@0x303eb71 | libxul.so@0x2c21b49 | libxul.s…
Crash Signature: libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] → libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] [@ libxul.so@0x11f1cbf | nsGlobalWindow::PostHandleEvent ]
I got the crash again with 0524 nightly(with reland of bug 1355311).
report: bp-8fe4ef9c-a2e2-4d7a-8acc-7ce7d0170524
Flags: needinfo?(ehsan)
I think I know what causes this.  This isn't a real bug, just a bad diagnostic assert misfiring.

TimeoutManager's constructor runs from nsPIDOMWindow's constructor here: <https://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/dom/base/nsGlobalWindow.cpp#1015> which runs before nsGlobalWindow's constructor.  mThrottleTrackingTimeouts is initialized here <https://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/dom/base/TimeoutManager.cpp#309>.  For the first window object, gTrackingTimeoutThrottlingDelay is zero initialized, and then initialized in TimeoutManager::Initialize() here <https://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/dom/base/TimeoutManager.cpp#346>.  But that only runs from nsGlobalWindows ctor <https://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/dom/base/nsGlobalWindow.cpp#1648> which is after our mThrottleTrackingTimeouts has been initialized.  This way, by the time we get to the diagnostic assert that is firing here, mThrottleTrackingTimeouts will be true and we will crash.
Assignee: nobody → ehsan
Flags: needinfo?(ehsan)
This way we won't rely on code which may have potentially not run yet.
Attachment #8870917 - Flags: review?(bugs)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8870917 - Flags: review?(bugs) → review+
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8547e166d6d0
Initialize TimeoutManager::mThrottleTrackingTimeouts to false; r=smaug
https://hg.mozilla.org/mozilla-central/rev/8547e166d6d0
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Hmm...

While I think my fix may have fixed some crashes, I *just* reproduced this crash bp-b6a6c7c0-f27f-494e-9a1e-212bf0170525 which happened with the following rough STR:

1. Go to https://blog.mozilla.org/javascript/feed/ (which loads about:feeds)
2. Click on one of the links to go to a different page.
3. Wait for a long time...  Switch to other tabs, do stuff.
4. Switch back to this tab, and then press back.  Boom!  Crash happens.

And then as I was typing these comments:

5. (I had restored the tab after the crash so I am looking at the feed page right now)
6. Press forward.  Crash happens again! bp-e57c549b-6d54-4863-ba7a-1095a0170525

I also reproduced by creating a long chain of history and going back and forward in it.  I think what needs to happen is for the page to get evicted from bf-cache and get reloaded when we navigate to it again...  And perhaps about:feeds is special somehow!

Andreas, do you mind seeing if you can reproduce with this STR please?
Flags: needinfo?(afarre)
Assignee: ehsan → nobody
There are 38 crashes with signature "libxul.so@0x11f7c69 | nsGlobalWindow::PostHandleEvent", all in nightly 55 with buildid 20170524100215. The moz crash reason is always "MOZ_RELEASE_ASSERT(!mThrottleTrackingTimeouts)".
Status: RESOLVED → REOPENED
Crash Signature: libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] [@ libxul.so@0x11f1cbf | nsGlobalWindow::PostHandleEvent ] → libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] [@ libxul.so@0x11f1cbf | nsGlobalWindow::PostHandleEvent ] [@ libxul.so@0x11f7c69 | nsGlobalWindow::PostHandleEvent ]
Resolution: FIXED → ---
Crash Signature: libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] [@ libxul.so@0x11f1cbf | nsGlobalWindow::PostHandleEvent ] [@ libxul.so@0x11f7c69 | nsGlobalWindow::PostHandleEvent ] → libxul.so@0x2c1fea2 | libxul.so@0x2c20f24 | libxul.so@0x2b68b7f | libxul.so@...] [@ libxul.so@0x11f1cbf | nsGlobalWindow::PostHandleEvent ] [@ libxul.so@0x11f7c69 | nsGlobalWindow::PostHandleEvent ] [@ libxul.so@0xc1d36f | nsGlobalWindow::PostHandleEv…
Crash Signature: nsGlobalWindow::PostHandleEvent ] → nsGlobalWindow::PostHandleEvent ] [@ libxul.so@0x11d6ee9 | nsGlobalWindow::PostHandleEvent ]
The signature "mozilla::dom::TimeoutManager::MaybeStartThrottleTrackingTimout" is #2 in content topcrashes and #15 in browser topcrashes.
Keywords: topcrash
OK, this is pretty easy to reproduce actually, and I have it recorded under rr and starting to debug it now.  I suspect that this is a different bug biting us because of this:

[Child 29989] WARNING: NS_ENSURE_TRUE(mDocViewer) failed: file /home/ehsan/moz/src.1347035/layout/base/nsDocumentViewer.cpp, line 3807
[Child 29989] ###!!! ASSERTION: Bad readystate: 'mDocument->IsXULDocument() || mDocument->GetReadyStateEnum() == nsIDocument::READYSTATE_INTERACTIVE || (mDocument->GetReadyStateEnum() == nsIDocument::READYSTATE_UNINITIALIZED && NS_IsAboutBlank(mDocument->GetDocumentURI()))', file /home/ehsan/moz/src.1347035/layout/base/nsDocumentViewer.cpp, line 1045
Assertion failure: !mThrottleTrackingTimeouts, at /home/ehsan/moz/src.1347035/dom/base/TimeoutManager.cpp:1586

It seems that by the time we get to the TimeoutManager code the state of things is severely messed up, but the problem is elsewhere.
Assignee: nobody → ehsan
Flags: needinfo?(afarre)
The issue is quite simple here.  In order to explain, let me first explain a simple STR:

1. Go to https://blog.mozilla.org/javascript/feed/.
2. Click on the first link.
3. Wait for a few minutes.
4. Click Back.
5. Click Forward.  The assertion fires.

What happens is we first run nsGlobalWindow::Resume(), and mThrottleTrackingTimeouts is true, and due to this check <https://searchfox.org/mozilla-central/rev/a14524a72de6f4ff738a5e784970f0730cea03d8/dom/base/TimeoutManager.cpp#1405> MaybeStartThrottleTrackingTimout() isn't called and things are good.  But the load event will still fire even though mIsDocumentLoaded is true, and the TimeoutManager code isn't accounting for this possibility.  So we end up calling MaybeStartThrottleTrackingTimout() when we shouldn't have, and we assert fatally as a result.  :-(

(And after this happens rr gets into a super confused state when nothing including the assertion stack even look sane any more which wasted a few hours for me when investigating this otherwise simple bug!)
Requesting review from cpearce in order to try to get this into tomorrow's Nightly with a post-landing review from Andreas with the pledge of addressing his review comments after I land this fix.
Attachment #8871587 - Flags: review?(cpearce)
Attachment #8871587 - Flags: review?(afarre)
Comment on attachment 8871587 [details] [diff] [review]
Account for the fact that the load event may fire even though mIsDocumentLoaded is true in TimeoutManager

Review of attachment 8871587 [details] [diff] [review]:
-----------------------------------------------------------------

Uh, yeah, that'll stop the diagnostic assert... I trust you'll follow up and fix the underlying issue!
Attachment #8871587 - Flags: review?(cpearce) → review+
I think this is actually all there is to it as far as the fix goes, FWIW!  This was just all a diagnostic assert misfiring all along.

We should be really careful adding these asserts.  I think we have started to use them way too liberally and have forgotten the possibility of causing top crashers like this in case of simple benign mistakes, etc.
I couldn't find an active sheriff on IRC, but this fix is trivial and I think it doesn't make sense to land this on inbound and potentially miss tomorrow's Nightly.  I'm going to land this directly on central.
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/5d6fe59a9a5d
Account for the fact that the load event may fire even though mIsDocumentLoaded is true in TimeoutManager; r=cpearce a=topcrasher
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Ehsan, is the load event firing twice here?  Or because we are suspended/resumed before the load event happened in the first place?

I don't think we are using diagnostic asserts too liberally.  Clearly there was a misunderstanding of the invariants involved.  This helped us understand the code better and so we can avoid other bugs hitting release.
Flags: needinfo?(ehsan)
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #21)
> Ehsan, is the load event firing twice here?  Or because we are
> suspended/resumed before the load event happened in the first place?

The latter.

> I don't think we are using diagnostic asserts too liberally.  Clearly there
> was a misunderstanding of the invariants involved.  This helped us
> understand the code better and so we can avoid other bugs hitting release.

The point is, there was no real bug here, besides a bogus assertion crashing the user's browser.  Had we not had this diagnostic assert, the code would successfully schedule a new nsITimer, which would reset the mThrottleTrackingTimeouts member to true.  That's needless work, for sure, but it's not a bug, and has no adverse side effects beyond wasting some CPU cycles.

I should have resisted when you made me make this a diagnostic assert in bug 1338691 comment 2.  In the future I'm going to ask for a clear explanation as to why a debug only assert firing would be severe enough to justify crashing the user's browser, causing top crash issues, and affecting the Nightly quality especially as we're moving to the model where Nightly stability is being treated as more of a goal by release management.
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #22)
> The point is, there was no real bug here, besides a bogus assertion crashing
> the user's browser.  Had we not had this diagnostic assert, the code would
> successfully schedule a new nsITimer, which would reset the
> mThrottleTrackingTimeouts member to true.  That's needless work, for sure,
> but it's not a bug, and has no adverse side effects beyond wasting some CPU
> cycles.
> 
> I should have resisted when you made me make this a diagnostic assert in bug
> 1338691 comment 2.  In the future I'm going to ask for a clear explanation
> as to why a debug only assert firing would be severe enough to justify
> crashing the user's browser, causing top crash issues, and affecting the
> Nightly quality especially as we're moving to the model where Nightly
> stability is being treated as more of a goal by release management.

But the MOZ_ASSERT() still would have been *wrong*.  Our understanding was *wrong*.  Without the diagnostic assert we would not have discovered this.

This is important because someone reading the code infers information from the assertion.

And IMO nightly is exactly the place to discover these things.  We did not ship this to beta or release.

If you want to argue for depending less on invariants like this and just being more flexible about state, that is fine.  But to say we should just hear-no-evil-see-no-evil about assertions doesn't make sense to me.
Comment on attachment 8871587 [details] [diff] [review]
Account for the fact that the load event may fire even though mIsDocumentLoaded is true in TimeoutManager

r+ing this even tough it has already landed.
Attachment #8871587 - Flags: review?(afarre) → review+
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #23)
> But the MOZ_ASSERT() still would have been *wrong*.  Our understanding was
> *wrong*.  Without the diagnostic assert we would not have discovered this.
> 
> This is important because someone reading the code infers information from
> the assertion.
> 
> And IMO nightly is exactly the place to discover these things.  We did not
> ship this to beta or release.
> 
> If you want to argue for depending less on invariants like this and just
> being more flexible about state, that is fine.  But to say we should just
> hear-no-evil-see-no-evil about assertions doesn't make sense to me.

Just to clarify, it is the former thing I was arguing for, certainly not the latter!  IOW, if we have an assertion about a state, promoting it to be a diagnostic assert only makes sense if we think that the state that it is checking is so vital that its check not passing is worth crashing the user's browser and ruining all of the other testing that we would like to achieve out of the user's Nightly session, since each one of these assertions isn't the only testing experiment that lives in Nightly in isolation, we should balance the potential risk of one of them fatally terminating the chances of running the rest against the reward of such a fatal termination uncovering a bug.  In this case, the state being asserted wasn't really all that important to begin with...  (Also, to be fair, I am talking now with hindsight on my side, these things are sometimes not super obvious during code development/review.  But we should learn from experience.  I've seen this in a few diagnostic asserts that I have put in or r+ed at least!)
If you're not confident an assertion is true, you probably shouldn't put it in at all.  You should make it a dynamic check instead.

If you think an assertion should be true, then you should make it a diagnostic so that it actually gets coverage.  People just don't run with DEBUG builds.

Anything else and we're just fooling ourselves and confusing people who read the code later.  Soft assertions are a net negative IMO.

And you act like we shipped this on release.  We did not.  If we can't take some risks on nightly then we are substantially limited in our development.

Anyway, I suspect we won't agree here.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.