Closed Bug 1272942 Opened 4 years ago Closed 3 years ago

Intermittent browser_aboutCertError.js | Uncaught exception - TypeError: learnMoreLink is null, exceptionButton is null, advancedButton is null, Argument 1 of Window.getComputedStyle is not an object

Categories

(Firefox :: General, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Iteration:
52.1 - Oct 3
Tracking Status
firefox49 --- wontfix
firefox50 --- affected
firefox51 --- affected

People

(Reporter: philor, Assigned: johannh)

References

Details

(Keywords: intermittent-failure, Whiteboard: [fxprivacy])

Summary: Intermittent browser_aboutCertError.js | Uncaught exception - TypeError: learnMoreLink is null → Intermittent browser_aboutCertError.js | Uncaught exception - TypeError: learnMoreLink is null, exceptionButton is null, advancedButton is null
Duplicate of this bug: 1293829
Duplicate of this bug: 1291489
I've been trying to narrow this one down. From my latest attempts on try:

https://treeherder.mozilla.org/#/jobs?repo=try&author=rwood@mozilla.com&fromchange=9de8f271e18b35c8d1de9b635a44e2c82947ba9c&tochange=28a8dac4ce1de786646df2802e963f115fe0fd44

2ea3d51ba1bb from Friday July 29th ==> intermittent not seen, at least in those 50 retriggers
e5859dfe0bcb from Saturday July 30th ==> reproduced the failure

Going to do 50 more retriggers on 2ea3d51ba1bb to see if it is consistent or not.
> Going to do 50 more retriggers on 2ea3d51ba1bb to see if it is consistent or
> not.

Looks consistent, so looks like *maybe* the first time this intermittent occurred is after one of these two merges/pushes:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=c3565c8b1cdb575db1c80c7791984a6490598b84&tochange=e5859dfe0bcbd40f4e33f4a633f73ea3473a7849

However, I cannot be certain, and after reading the latest comments in the duplicate Bug 1291489 I'm not sure it is worth trying to pin-point it to a specific change (in case it is a race condition).
I vote disabling at this point.
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Brian, I see you wrote this test originally. It's plagued with a variety of issues at the moment [1], can you please take a look or suggest someone who can as it's on the path to being disabled otherwise.

FWIW, the timing of when this test got really flaky seems to correspond decently well to when bug 712612 landed.

[1] https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13197180&short_desc=browser_aboutCertError.js&resolution=---&query_format=advanced&short_desc_type=allwordssubstr
Flags: needinfo?(bgrinstead)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)
> Brian, I see you wrote this test originally. It's plagued with a variety of
> issues at the moment [1], can you please take a look or suggest someone who
> can as it's on the path to being disabled otherwise.
> 
> FWIW, the timing of when this test got really flaky seems to correspond
> decently well to when bug 712612 landed.
> 
> [1]
> https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-
> failure%2C%20&keywords_type=allwords&list_id=13197180&short_desc=browser_abou
> tCertError.js&resolution=---
> &query_format=advanced&short_desc_type=allwordssubstr

I haven't worked on this in quite some time and don't know why it started failing.  But I created a two try pushes to try and help track it down:

1) try push with extra logging to see state of DOM before failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea1247a09f3f.
2) try push that switches from DOMContentLoaded to the custom event "AboutNetErrorLoad".  Since this seems to fail on different buttons at different times it makes me think that waitForCertErrorLoad is resolving too soon sometimes.  From what I remember, load events are hard to detect on about pages and that's why it's trying to wait for DOMContentLoaded.  https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0d1ccdc5042

Let's see if anything comes out of those pushes.  If not, I don't think I'm the right person to decide about disabling all or part of it - we should ask Panos or Johann about that.
Flags: needinfo?(bgrinstead)
Whiteboard: [fxprivacy][triage]
See Also: → 1271202
Whiteboard: [fxprivacy][triage] → [fxprivacy]
Looking into this I just wanted to remark that the failures are happening because Firefox is simply intermittently crashing when loading CertError pages. It'd be interesting to know if this is happening in production or just in our tests.

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x2000A,name=???) Route error: message sent to unknown actor ID
Assertion failure: aCode == MsgDropped (Processing error in CompositorBridgeChild), at /builds/slave/m-cen-m64-00000000000000000000/build/src/gfx/layers/ipc/CompositorBridgeChild.cpp:1087

Also, why is it crashing only on about:certerror?

So afaict it doesn't really make sense to inspect DOM state or anything, the DOM isn't loaded at all.
Duplicate of this bug: 1269012
Maybe Kan-Ru can provide some insight?
Flags: needinfo?(kchen)
(In reply to Johann Hofmann [:johannh] from comment #24)
> Looking into this I just wanted to remark that the failures are happening
> because Firefox is simply intermittently crashing when loading CertError
> pages. It'd be interesting to know if this is happening in production or
> just in our tests.
> 
> ###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x2000A,name=???) Route
> error: message sent to unknown actor ID
> Assertion failure: aCode == MsgDropped (Processing error in
> CompositorBridgeChild), at
> /builds/slave/m-cen-m64-00000000000000000000/build/src/gfx/layers/ipc/
> CompositorBridgeChild.cpp:1087
> 
> Also, why is it crashing only on about:certerror?
> 
> So afaict it doesn't really make sense to inspect DOM state or anything, the
> DOM isn't loaded at all.

Where did you find this error message? I can't find this particular msgtype=0x2000A in the test logs I sampled. Instead I find some other very different crashes in the logs but none of them seem to be directly related to the test failure.
Flags: needinfo?(kchen)
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #27)
> Where did you find this error message? I can't find this particular
> msgtype=0x2000A in the test logs I sampled. Instead I find some other very
> different crashes in the logs but none of them seem to be directly related
> to the test failure.

Mmh I didn't notice that the msgtype is 0x2000A on my machine (I ran an artifact build on OSX) instead of 0x2000C which appears in all logs I looked at, e.g. here:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-aurora&job_id=3450307#L2937

The test failures are happening because we're trying to use DOM elements that don't exist on the crashed page, as far as I see it. The screenshots also look like it didn't load successfully.
Assignee: nobody → jhofmann
Status: NEW → ASSIGNED
Iteration: --- → 51.3 - Sep 12
Kan-Ru, sorry, I'm not really certain what your resolution was here. Are you sure that these crashes (https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=11533424#L1824) are not related to this problem? They happen in every log I looked at and I can reproduce them locally. Could you explain what's happening there if it's not causing the failures?

Thanks!
Flags: needinfo?(kchen)
msgtype 0x2000A is mozilla::layers::PAPZ::Msg_NotifyAPZStateChange so I think this another instance of PAPZ shutdown error. Kats?
Flags: needinfo?(kchen) → needinfo?(bugmail)
I reproduced in rr and it looks like mCanSend should be getting set to false in RemoteContentController::Destroy(). When that function calls SendDestroy, it sends a destroy message to APZChild which promptly deletes itself during the processing of that message. So the parent side shouldn't be sending any more messages to the child after it has called SendDestroy(). I'll write a patch and test it, but I'll put it on a new bug because I'm not sure it there are other issues that are contributing to this intermittent failure.
Flags: needinfo?(bugmail)
Iteration: 51.3 - Sep 19 → 52.1 - Oct 3
Bug 1304457 (now merged to central) should fix the 0x2000A crashes. However this intermittent failure is still showing up (without the process crash) so whoever owns this test should continue investigating.
Thanks for fixing these! Too bad that it didn't seem to fix it, I'll try to investigate further...
Whiteboard: [fxprivacy] → [fxprivacy][triage]
So the only recent occurrences of this outside of Beta [0] (where the patch hasn't landed yet) are due to bug 569229, which seems to conveniently have gotten a patch ready after 6 years.

[0] https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1272942&tree=all&startday=2016-09-25&endday=2016-10-01

I'd say let's wait for bug 569229 to be fixed and keep an eye on the results then, we might be able to resolve this. One thing that seems certain is that fixing bug 1304457 drastically reduced the number of intermittents here.
Depends on: 569229
Duplicate of this bug: 1285681
Summary: Intermittent browser_aboutCertError.js | Uncaught exception - TypeError: learnMoreLink is null, exceptionButton is null, advancedButton is null → Intermittent browser_aboutCertError.js | Uncaught exception - TypeError: learnMoreLink is null, exceptionButton is null, advancedButton is null, Argument 1 of Window.getComputedStyle is not an object
Bug 1304457 is marked as not affecting Beta, so we're thinking that the failures there are all bug 569229?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #39)
> Bug 1304457 is marked as not affecting Beta, so we're thinking that the
> failures there are all bug 569229?

Nope, that's bug 1304457. It's exactly the same error, so they must've identified a wrong bug as regressor.
(In reply to Johann Hofmann [:johannh] - partially unresponsive until 11/14 from comment #40)
> Nope, that's bug 1304457. It's exactly the same error, so they must've
> identified a wrong bug as regressor.

That's not necessarily the case, we need to map back from the msgtype to the protocol/message using the steps at [1]. On beta the set of protocols is different from central so it might map back to something else. And the PAPZ RemoteContentController code in particular already has the fix applied on beta [2], so I'd be quite surprised it turned up as the culprit.

[1] https://wiki.mozilla.org/Electrolysis/Debugging#Working_backwards_from_a_C.2B.2B_Message_to_its_IPDL_message
[2] http://hg.mozilla.org/releases/mozilla-beta/file/e8610794c397/gfx/layers/ipc/RemoteContentController.cpp#l287
Whiteboard: [fxprivacy][triage] → [fxprivacy]
Closing this as successful due to robot inactivity \o/
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.