Closed Bug 1404020 Opened 7 years ago Closed 7 years ago

stylo: Crash in CheckCrossStyleBackendAdoption

Categories

(Core :: CSS Parsing and Computation, defect, P4)

58 Branch
Unspecified
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 --- wontfix
firefox58 --- fix-optional
firefox59 --- fix-optional

People

(Reporter: calixte, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [clouseau])

Crash Data

This bug was filed from the Socorro interface and is report bp-fa77c0d5-e5b1-4aa4-9527-033660170928. ============================================================= There is 1 crash in nightly 58 with buildid 20170928100123. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1403024. [1] https://hg.mozilla.org/mozilla-central/rev?node=b8b62d3ba3799f380c30e5b8a41066a48b49a94e
Flags: needinfo?(xidorn+moz)
Notes from the crash report: Servo, with style data: InlineStyle Gecko doc: system, XUL, content-type: application/vnd.mozilla.xul+xml, url: chrome://browser/content/browser.xul
Priority: -- → P3
"e10sEnabled":false :/ Someone is trying to move an element from browser.xul to content document?
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2) > "e10sEnabled":false :/ > > Someone is trying to move an element from browser.xul to content document? Yeah, presumably somebody forcing legacy addons. Probably not worth spending time on, but we should see if any crashes appear on beta.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #3) > Yeah, presumably somebody forcing legacy addons. Probably not worth spending > time on, but we should see if any crashes appear on beta. Yeah... "extensions.legacy.enabled":true
Flags: needinfo?(xidorn+moz)
There are two more crash reports show up: https://crash-stats.mozilla.com/report/index/cfb7f8cf-ad75-4818-a322-de3300170929 Gecko -> Servo, with style data: InlineStyle Gecko doc: content-type: text/html, url: https: (omitted) https://crash-stats.mozilla.com/report/index/5b03673b-0db7-4fa1-b8d7-05e710170929 Servo -> Gecko, with style data: InlineStyle Gecko doc: content-type: text/html, url: https: (omitted) Their Gecko docs are neither system nor XUL, and are just normal https html, how can this happen...
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #3) > Yeah, presumably somebody forcing legacy addons. Probably not worth spending > time on, but we should see if any crashes appear on beta. they are on 57.0b4: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&signature=CheckCrossStyleBackendAdoption&date=%3E%3D2017-09-01#reports
The more I read the crash report, the more I find things confusing... I cannot really understand why an https page could get Gecko backend without being system or XUL... Given that we are going to proceed bug 1403077, there would be a higher risk of violating this in more situations. Probably we should avoid asserting this and simply clear style data attached? (and reparse style attribute?) bholley, what do you think?
Flags: needinfo?(bobbyholley)
Per the meeting, I'm pretty skeptical about adding code to handle this, given that it would be poorly-tested and we don't have an understanding of why this happens. a MOZ_RELEASE_ASSERT is safer, and I think the crash volume isn't a huge risk. That said, I think we should continue trying to figure out what's going on. One theory is that the pref is flipping, which might be the result of people changing experiment groups (so they wouldn't know they'd done it). To that end, we could record whether the pref ever changes over the lifetime of the process, and include that in the crash report annotation. Alternatively, we could simply cache the value of StyloEnabled() on the document alongside the value of mStyleBackendType, and report whether they're different in the notes. Another good thing to check would be to see if the two principals are EqualConsideringDomain.
Flags: needinfo?(bobbyholley)
Chris points out that we can figure out of a crash report came from somebody in the 5% Gecko control group by looking for |branch: "gecko"| in the TelemetryEnvironment. I looked at three, and all three contained that field. We should verify a few more, but that probably gives us a pretty good idea of why this is happening. If that is the answer here, I'd imagine the crash volume will drop as the experiment population stabilizes. We could also consider reading the stylo pref at startup on beta, but that does have drawbacks.
For the e10s disabled cases, I wonder if this is due to (a) chrome or extensions poking at the document, or (b) something to do with the e10s-disabling pref also being set a bit later.
(In reply to Cameron McCormack (:heycam) (away 4 Oct) from comment #11) > For the e10s disabled cases, I wonder if this is due to (a) chrome or > extensions poking at the document, or (b) something to do with the > e10s-disabling pref also being set a bit later. All e10s disabled cases are "Gecko -> Servo, url: chrome://browser/content/browser.xul", so it's (a)?
Sounds like it then.
Sounds like we probably don't need to do anything at the moment, and majority of crashes are from the pref-flip experiment. Some of the non-e10s crash reports indicate there are legacy addons, but it is not clear to me whether they are enabled. I suppose that if the legacy addon pref is not listed in telemetry enviroment, they cannot be enabled? In that case, it sounds like something in Firefox's chrome code tries to do that. But I guess the number for that case is quite low that it's probably not worth doing anything at the moment? It would be great if the pref study can switch the pref only at the beginning of the process rather than in the middle.
Deprioritize this bug given comment 14.
Priority: P3 → P4
The crash is spiking in the last days in 58: - the 2017-10-11 there were 4 single installations crashing; - the 2017-10-12 there were 5; - the 2017-10-13 there are 19.
The crash is spiking in 57 beta 7: - beta 7: 205 crashes - beta 6: 37 crashes - beta 5: 12 crashes
I first suspected that bug 1403077 can somehow involves, but it seems bug 1403077 isn't there until beta 8, so luckily that doesn't matter. Those again happens on normal content page, and I cannot think out any reason except flipping the perf, but this time there is no "branch":"gecko" anymore. Are we finishing the experiment so that those who were in the control group now start loading pages in Stylo backend again? Chris, do you have any idea about this?
Flags: needinfo?(cpeterson)
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #18) > I first suspected that bug 1403077 can somehow involves, but it seems bug > 1403077 isn't there until beta 8, so luckily that doesn't matter. > > Those again happens on normal content page, and I cannot think out any > reason except flipping the perf, but this time there is no "branch":"gecko" > anymore. Are we finishing the experiment so that those who were in the > control group now start loading pages in Stylo backend again?
Flags: needinfo?(cpeterson) → needinfo?(mgrimes)
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #18) > I first suspected that bug 1403077 can somehow involves, but it seems bug > 1403077 isn't there until beta 8, so luckily that doesn't matter. > > Those again happens on normal content page, and I cannot think out any > reason except flipping the perf, but this time there is no "branch":"gecko" > anymore. Are we finishing the experiment so that those who were in the > control group now start loading pages in Stylo backend again? Matt, is the Stylo pref-flipping experiment still running? I see the experiment population climbing in [1] but I no longer see the experiment names "pref-flip-quantum-css-style-r1-1381147" or "pref-flip-quantum-css-stylo-beta-1381147" in crash reports' Telemetry Environments or in my own about:telemetry, even though I see other experiment names such as "pref-flip-activity-stream-58-nightly-about-home-bug-1406513" and "e10sCohort". [1] https://sql.telemetry.mozilla.org/queries/6154#table
Xidorn, if we add a domain to the Stylo blocklist #ifdef NIGHTLY_BUILD in bug 1407098, we could watch if the CheckCrossStyleBackendAdoption crash volume goes up or not. That might help us know whether this bug is related to pref flips (though, in the new case, not by the experiment). This crash signature is content process top crash #8 in Beta 57 and #16 in Nightly 58. This is pretty high, so we should try to patch this crash somehow. There have been 1365 reports since this crash started about a month ago and 100% of them are on Windows. Is there something about CheckCrossStyleBackendAdoption() that would make it only happen on Windows? As for the legacy add-ons theory, I thought the "extensions.legacy.enabled" pref only worked in the Nightly channel? We see crash reports from both Nightly 58 and Beta 57.
(In reply to Chris Peterson [:cpeterson] from comment #21) > Xidorn, if we add a domain to the Stylo blocklist #ifdef NIGHTLY_BUILD in > bug 1407098, we could watch if the CheckCrossStyleBackendAdoption crash > volume goes up or not. That might help us know whether this bug is related > to pref flips (though, in the new case, not by the experiment). That doesn't help. This is expected to happen only when some pages have been opened, and then the pref gets flipped, and the pages load some new subdocuments in iframe and tries to move element between them. I cannot think of any way it can happen if the perf is not changed after page open. > This crash signature is content process top crash #8 in Beta 57 and #16 in > Nightly 58. This is pretty high, so we should try to patch this crash > somehow. This is a gatekeeping crash to avoid further security issue from this kind of element adoption. I raised this concern in stylo meeting several weeks (?) ago, and at that time Bobby thought we shouldn't try to wallpaper it because that would be some code which is poorly tested. And I think at that time we concluded that it was low volume so wasn't a big concern. It doesn't seem to me anything has changed since then. > There have been 1365 reports since this crash started about a month ago and > 100% of them are on Windows. Is there something about > CheckCrossStyleBackendAdoption() that would make it only happen on Windows? I don't think it is something Windows-specific. Maybe we just don't have enough experiment population on other platforms?
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #22) > > There have been 1365 reports since this crash started about a month ago and > > 100% of them are on Windows. Is there something about > > CheckCrossStyleBackendAdoption() that would make it only happen on Windows? > > I don't think it is something Windows-specific. Maybe we just don't have > enough experiment population on other platforms? We have reports on other platforms, e.g <bp-0e758f14-4b66-424c-b4d8-fe4890171015>. I have no idea why the symbol, CheckCrossStyleBackendAdoption, is not shown there.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #23) > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #22) > > > There have been 1365 reports since this crash started about a month ago and > > > 100% of them are on Windows. Is there something about > > > CheckCrossStyleBackendAdoption() that would make it only happen on Windows? > > > > I don't think it is something Windows-specific. Maybe we just don't have > > enough experiment population on other platforms? > > We have reports on other platforms, e.g > <bp-0e758f14-4b66-424c-b4d8-fe4890171015>. I have no idea why the symbol, > CheckCrossStyleBackendAdoption, is not shown there. That may be because of inlining, then, since AdoptNode is the only callsite of that function.
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #24) > (In reply to Hiroyuki Ikezoe (:hiro) from comment #23) > > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #22) > > > > There have been 1365 reports since this crash started about a month ago and > > > > 100% of them are on Windows. Is there something about > > > > CheckCrossStyleBackendAdoption() that would make it only happen on Windows? > > > > > > I don't think it is something Windows-specific. Maybe we just don't have > > > enough experiment population on other platforms? > > > > We have reports on other platforms, e.g > > <bp-0e758f14-4b66-424c-b4d8-fe4890171015>. I have no idea why the symbol, > > CheckCrossStyleBackendAdoption, is not shown there. > > That may be because of inlining, then, since AdoptNode is the only callsite > of that function. Oh I see. It's cumbersome.
This is the #5 Windows topcrash in Nightly 20171013100112.
(In reply to Nicholas Nethercote [:njn] from comment #26) > This is the #5 Windows topcrash in Nightly 20171013100112. It doesn't seem to me this should be bound to any specific version. It seems to be more time-related, that lots of crashes happen in some particular days. This is the graph for nsIDocument::AdoptNode which is where the same crash happens before: https://crash-stats.mozilla.com/signature/?signature=nsIDocument%3A%3AAdoptNode&date=%3E%3D2017-09-16T00%3A19%3A15.000Z&date=%3C2017-10-16T00%3A19%3A15.000Z#graphs Bug 1403024 adds some diagnose information and changes the signature on Windows (other platforms seem to still be using nsIDocument::AdoptNode). And this time the crash graph looks like: https://crash-stats.mozilla.com/signature/?signature=CheckCrossStyleBackendAdoption&date=%3E%3D2017-10-09T00%3A13%3A21.000Z&date=%3C2017-10-16T00%3A13%3A21.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#graphs The pick number seems to be tripled this time, but like before the number drops very quick, so I suspect this is again because of some perf flipping study. Another indirect evidence is that majority of the crashes happen in distinct instances, and those happen in the same instance are likely either the same crash gets reported multiple times, or multiple content processes crash at the same time (they have almost identical Install Age in crash reports).
> The pick number seems to be tripled this time, but like before the number drops very quick, so I suspect this is again because of some perf flipping study. I mean, the peak number.
Ilana on the Strategy & Insights experiments team says that they accidentally turned off the Stylo experiment last Friday (October 13). That would reset the Gecko users (20% of Nightly 58 and 5% of Beta 57) back to Stylo, the default pref value. That might explain some of the CheckCrossStyleBackendAdoption crashes, except the most recent spike *before* Friday. Ilana is going to re-enable the Stylo experiments today, so we will probably see more CheckCrossStyleBackendAdoption crashes as all those Nightly and Beta users rejoin the experiment as Gecko users.
(In reply to Chris Peterson [:cpeterson] from comment #29) > That might explain some of the CheckCrossStyleBackendAdoption crashes, except the most recent spike > *before* Friday. I'm not seeing any spikes before October 13. Am I looking at the wrong graphs? https://crash-stats.mozilla.com/signature/?signature=CheckCrossStyleBackendAdoption&date=%3E%3D2017-09-18T05%3A09%3A52.000Z&date=%3C2017-10-18T05%3A09%3A52.000Z#graphs
status-firefox57=wontfix unless someone thinks this bug should block 57
This signature is either infrequent or gone in 59 release.
CheckCrossStyleBackendAdoption no longer exists
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Resolution: INCOMPLETE → WORKSFORME
Flags: needinfo?(mgrimes)
You need to log in before you can comment on or make changes to this bug.