Crash in IPCError-browser | PURLClassifierParent::StartClassify

RESOLVED FIXED in Firefox 53

Status

()

defect
--
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: ting, Assigned: Ehsan)

Tracking

({crash, regression, topcrash})

Trunk
mozilla53
Unspecified
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 unaffected, firefox51 unaffected, firefox52 unaffected, firefox53+ fixed)

Details

(crash signature)

Attachments

(2 attachments)

This bug was filed from the Socorro interface and is 
report bp-73286bd8-5009-4df8-a805-f50152161206.
=============================================================
#3 top crash of Nightly 20161204030210 on Windows, 53 crashes from 10 installations. It was first observed at build 20161203030204.
Crash Signature: [@ IPCError-browser | PURLClassifierParent::StartClassify ] → [@ IPCError-browser | PURLClassifierParent::StartClassify ] [@ mozilla::ipc::FatalError | mozilla::dom::indexedDB::PBackgroundIDBDatabaseChild::Write ]
Crash Signature: [@ IPCError-browser | PURLClassifierParent::StartClassify ] [@ mozilla::ipc::FatalError | mozilla::dom::indexedDB::PBackgroundIDBDatabaseChild::Write ] → [@ IPCError-browser | PURLClassifierParent::StartClassify ] [@ mozilla::ipc::FatalError | mozilla::dom::indexedDB::PBackgroundIDBDatabaseChild::Write ] [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IProtocol::Fa…
75% of the crashes with the PURLClassifierParent::StartClassify signature also contain mozilla::ipc::MessageChannel::WaitForSyncNotifyWithA11yReentry(). Out of the remaining 50 crashes I spot checked, the stacks all looked junky. Does this mean we're waiting for some sync IPC a11y messages unrelated to the URL classfier? It would be good to improve the crash signature for this case if that's true. The other two signatures don't seem to involve a11y at all, so maybe that's a red herring.

The URL classifier IPC stuff seems to have been added recently, in bug 1318768, but that hit Nightly around 11/22 or 11/23, which doesn't explain this regression.
Maybe this is a regression from bug 1321868? That landed in the regression range, and it sounds like it could add uses of the URL classfier in the content process. Ehsan, could you look into that? Thanks.
Flags: needinfo?(ehsan)
Keywords: regression, topcrash
(In reply to Andrew McCreight [:mccr8] from comment #1)
> 75% of the crashes with the PURLClassifierParent::StartClassify signature
> also contain
> mozilla::ipc::MessageChannel::WaitForSyncNotifyWithA11yReentry(). Out of the
> remaining 50 crashes I spot checked, the stacks all looked junky. Does this
> mean we're waiting for some sync IPC a11y messages unrelated to the URL
> classfier? It would be good to improve the crash signature for this case if
> that's true. The other two signatures don't seem to involve a11y at all, so
> maybe that's a red herring.

I think the a11y thing appears in the stack because the ipc message is a sync message so we are waiting replys from the parent. I didn't load the mindumps but I think the parent probably killed the content process here:

http://searchfox.org/mozilla-central/rev/a8b5f53e7df90df655a0982e94087ee83290c22e/dom/ipc/URLClassifierParent.cpp#24
Duplicate of this bug: 1322399
Crash Signature: mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] → mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] [@ mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write]
Definitely looks like a regression from bug 1321868.
Assignee: nobody → ehsan
Blocks: 1321868
Flags: needinfo?(ehsan)
I think this happens in safe mode, because of <http://searchfox.org/mozilla-central/rev/b9c9d257366379ac984622dbef38432a7fecd2e9/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#1275>.  All of the reports I looked at were in safe mode, and they all had low runtimes, because this crash will happen the first time Web content loads any scripts.

Sadly Socorro was unable to detect the 100% correlation with safe mode.  Nick, any idea why?  Or do you know who is the right person to ask?  Thanks!
Flags: needinfo?(n.nethercote)
Attachment #8817458 - Flags: review?(bugs) → review+
> Sadly Socorro was unable to detect the 100% correlation with safe mode. 
> Nick, any idea why?  Or do you know who is the right person to ask?  Thanks!

Marco, is the SafeMode field looked at by the correlation detector? If not, perhaps it should be.
Flags: needinfo?(n.nethercote) → needinfo?(mcastelluccio)
Duplicate of this bug: 1322038
Crash Signature: mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] [@ mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write] → mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] [@ mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write] [@ mozilla::ipc::FatalError | mozilla::dom::PBrowserChild::Write]
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7849f81b44de
Fail gracefully when the URL classifier service isn't available because we're in safe mode; r=smaug
https://hg.mozilla.org/mozilla-central/rev/7849f81b44de
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
The signatures "mozilla::ipc::FatalError | mozilla::dom::indexedDB::PBackgroundIDBDatabaseChild::Write", "mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write", "mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write", "mozilla::ipc::FatalError | mozilla::dom::PBrowserChild::Write" are not correlated with safe mode (100% false in all of them).

The signature "IPCError-browser | PURLClassifierParent::StartClassify" is correlated to safe mode, but there was a bug in the correlation tool (or rather, in the version of Spark we have on analysis.telemetry.mozilla.org: https://issues.apache.org/jira/browse/SPARK-16664) that caused it to only look for correlations in addons and modules.
I've fixed it now with a workaround, and the correlation to safe mode is showing up: "(100.0% in signature vs 01.87% overall) safe_mode = 1".
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #12)
> The signatures "mozilla::ipc::FatalError |
> mozilla::dom::indexedDB::PBackgroundIDBDatabaseChild::Write",
> "mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError |
> mozilla::ipc::IProtocol::FatalError |
> mozilla::dom::PContentBridgeChild::Write", "mozilla::ipc::FatalError |
> mozilla::dom::PContentChild::Write", "mozilla::ipc::FatalError |
> mozilla::dom::PBrowserChild::Write" are not correlated with safe mode (100%
> false in all of them).
> 
> The signature "IPCError-browser | PURLClassifierParent::StartClassify" is
> correlated to safe mode

Does this mean we have combined distinct bugs into this bug report? Should those other signatures be put into a new bug? (I'm not sure who is the best person to answer this question.)

> but there was a bug in the correlation tool (or
> rather, in the version of Spark we have on analysis.telemetry.mozilla.org:
> https://issues.apache.org/jira/browse/SPARK-16664) that caused it to only
> look for correlations in addons and modules.
> I've fixed it now with a workaround, and the correlation to safe mode is
> showing up: "(100.0% in signature vs 01.87% overall) safe_mode = 1".

Good to hear it's been fixed. Are there tests for the correlation tool? Could a new test be added to catch problems like this in the future?
Flags: needinfo?(mcastelluccio)
Duplicate of this bug: 1323082
(In reply to Nicholas Nethercote [:njn] from comment #13)
> Does this mean we have combined distinct bugs into this bug report? Should
> those other signatures be put into a new bug? (I'm not sure who is the best
> person to answer this question.)

I looked over the signatures, and they are all gone from recent builds, except for mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write. I'll file a new bug for that, and remove it from here.
See Also: → 1323220
Crash Signature: mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] [@ mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write] [@ mozilla::ipc::FatalError | mozilla::dom::PBrowserChild::Write] → mozilla::ipc::IProtocol::FatalError | mozilla::dom::PContentBridgeChild::Write ] [@ mozilla::ipc::FatalError | mozilla::dom::PBrowserChild::Write]
(In reply to Andrew McCreight [:mccr8] from comment #15)
> (In reply to Nicholas Nethercote [:njn] from comment #13)
> > Does this mean we have combined distinct bugs into this bug report? Should
> > those other signatures be put into a new bug? (I'm not sure who is the best
> > person to answer this question.)
> 
> I looked over the signatures, and they are all gone from recent builds,
> except for mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write.
> I'll file a new bug for that, and remove it from here.

The original mozilla::ipc::FatalError | mozilla::dom::PContentChild::Write actually should be constrained by |proto signature contains PURLClassifer|

The last build id with this signature is 20161213030215

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Aipc%3A%3AFatalError%20%7C%20mozilla%3A%3Adom%3A%3APContentChild%3A%3AWrite&proto_signature=~PURLClassifier&date=%3E%3D2016-09-14T08%3A38%3A36.000Z&date=%3C2016-12-14T08%3A38%3A36.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
(In reply to Nicholas Nethercote [:njn] from comment #13)
> > but there was a bug in the correlation tool (or
> > rather, in the version of Spark we have on analysis.telemetry.mozilla.org:
> > https://issues.apache.org/jira/browse/SPARK-16664) that caused it to only
> > look for correlations in addons and modules.
> > I've fixed it now with a workaround, and the correlation to safe mode is
> > showing up: "(100.0% in signature vs 01.87% overall) safe_mode = 1".
> 
> Good to hear it's been fixed. Are there tests for the correlation tool?
> Could a new test be added to catch problems like this in the future?

I have some basic tests for the tool, but I wouldn't have caught this anyway as it's a pretty unexpected bug (Spark 2.0.0 silently fails if a dataset contains more than _exactly_ 200 columns) and I was using the version which happened to have the fix (2.0.1).
BTW, I've now added a test specifically for this particular failure case and I'm locally using 2.0.0, the same version we have on analysis.telemetry.mozilla.org.

I've also implemented some sanity checks to make sure the results are generated every day and are reasonable.
Flags: needinfo?(mcastelluccio)
[Tracking Requested - why for this release]:
A crash regression introduced in this release.
Comment on attachment 8817458 [details] [diff] [review]
Fail gracefully when the URL classifier service isn't available because we're in safe mode

Approval Request Comment
[Feature/Bug causing the regression]: This feature is needed for the Shield study to be run on Release 52 (bug 1335232), which we'll use to study the effect of making flash click-to-play by default.
[User impact if declined]: Can't run the study as intended
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: Not for this feature independently. We'll do QE on the study as a whole to make sure all pieces work as expected
[List of other uplifts needed for the feature/fix]: Bug 1318768, Bug 1323220, Bug 1325255, Bug 1322204, Bug 1325651, Bug 1319571, Bug 1321377, Bug 1307604, Bug 1323064, Bug 1335549, Bug 1333303, Bug 1333483, Bug 1336714, Bug 1338287
[Is the change risky?]: No
[Why is the change risky/not risky?]: Fixes Bug 1318768
[String changes made/needed]: none
Attachment #8817458 - Flags: approval-mozilla-aurora?
Comment on attachment 8817458 [details] [diff] [review]
Fail gracefully when the URL classifier service isn't available because we're in safe mode

It's been in Aurora53 already. Aurora53-.
Attachment #8817458 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment on attachment 8837400 [details] [diff] [review]
Fail gracefully when the URL classifier service isn't available because we're in safe mode (beta uplift)

this was deemed too risky for beta
Attachment #8837400 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.