Closed Bug 1784983 Opened 2 years ago Closed 2 years ago

Crash in [@ mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel]

Categories

(Core :: Privacy: Anti-Tracking, defect, P1)

defect

Tracking

()

VERIFIED FIXED
105 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox103 --- unaffected
firefox104 blocking verified
firefox105 blocking verified

People

(Reporter: mccr8, Assigned: mccr8)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/8e8416b7-b5fe-423c-ba99-a2a750220815

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 XUL mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel netwerk/url-classifier/UrlClassifierFeatureEmailTrackingDataCollection.cpp:201
1 XUL mozilla::detail::RunnableFunction<mozilla::net::AsyncUrlChannelClassifier::CheckChannel xpcom/threads/nsThreadUtils.h:531
2 XUL mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:851
3 XUL NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:430
4 XUL nsAppShell::ProcessGeckoEvents widget/cocoa/nsAppShell.mm:509
5 CoreFoundation __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ 
6 CoreFoundation __CFRunLoopDoSource0 
7 CoreFoundation __CFRunLoopDoSources0 
8 CoreFoundation __CFRunLoopRun 
9 CoreFoundation CFRunLoopRunSpecific 

I hit this crash when I was re-loading a Reddit page shortly after closing it. Maybe there's some missing null check.

[Tracking Requested - why for this release]: The volume looks reasonably high, and this should be straightforward to fix. I'm not sure why we didn't have a bug on file for this already. It looks like a regression in 104, which is when some of this email tracking telemetry work seems to have landed, but I'm not sure exactly which bug.

Looks like it was regressed by bug 1773701, though there was some followup work in bug 1775627, so maybe the patch will need a bit of rebasing on 104. Tim, could you take a look? Thanks.

Flags: needinfo?(tihuang)
Regressed by: 1773701

I can reliably reproduce with this STR

  • go to https://www.theverge.com/
  • Ctrl/Cmd/middle click on their (current) top story "Apple is allegedly threatening to fire an employee over a viral TikTok video"
  • switch to the tab that got opened

[@ mozilla::dom::CanonicalBrowsingContext::GetTopWindowContext ] looks like a very similar crash, except that it is happening during the calculation of topWindowParent. I don't know if it is the same issue or a similar one, but it also looks like a missing null check. This signature is actually the top crashes on beta right now, so it is a bit odd that there's no bug on file for it. I'll bundle it into this bug for now.

Crash Signature: [@ mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel] → [@ mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel] [@ mozilla::dom::CanonicalBrowsingContext::GetTopWindowContext ]

Combined these appear to be around 75% of all beta crashes in the last week...

It looks like it massively spiked in the 20220811191329 build, but I'm not sure why. No changes landed in that beta for the UrlClassifierFeatureEmailTrackingDataCollection.cpp file as far as I can see.

Another set of steps I was able to reproduce this crash with, based on a comment on a crash report:

  1. go to this URL: https://www.yakimaherald.com/opinion/commentary-the-search-of-trump-s-mar-a-lago-compound-as-seen-through-the-eyes/article_61287388-1b34-11ed-b265-af00f9a309c9.html
  2. close the tab
Assignee: nobody → continuation

For mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel specifically, I see a lot of crashes in the last day or so, but some from older builds like build id 20220722214625 (https://crash-stats.mozilla.org/report/index/57d9709d-1fd8-4235-855f-c36190220815 for example). Can this be caused by a blocklist update or something like that?

(If not, I apologize for the noise. :))

Doing a bit of investigation, both the Verge case and the Yakima Herald case seem to be hitting a case where the browsing context is discarded. This function seems to be running off a runnable so I guess a navigation can happen while the runnable is waiting or something. I'll add checks for that and also similar checks for topWindowParent, just to be safe, as a number of callers also seem to do that.

The diff rebased for beta is here. It is basically identical, as the changed code is after the code I change, but I don't know how finicky hg is about rebasing so I've included it.

I ran into this crash today with 105 after opening https://sharedrop.io and closing the tab (I only clicked the initial message window which explains what the page is about) after a few seconds. For me this crash appears to be 100% reproducible right now. Here are the crash stats link for my attempts:
https://crash-stats.mozilla.org/report/index/562d2a45-5ffc-4b4b-bc92-6890a0220816
https://crash-stats.mozilla.org/report/index/9dad0a8d-4750-4f6d-b331-9f7740220816
https://crash-stats.mozilla.org/report/index/377c9073-7ebb-417f-b816-4f6240220816
https://crash-stats.mozilla.org/report/index/1897e817-14a9-4a35-a9e1-5a6a20220816

Note: for of them the stack trace doesn't point to mozilla::net::UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel, but I included it anyway just in case.

Forgot to mention that the same crash is for me not reproducible in 103 (although I'm using the different profile for the release channel).

I am getting this crash exclusively when I close (some) tabs. Approximately every fifth or so tab closing ends with this crash now.

https://crash-stats.mozilla.org/report/index/bp-532628ae-f67d-47bf-af8d-a3dfd0220816
https://crash-stats.mozilla.org/report/index/bp-8103de12-df51-492e-8752-841550220816
https://crash-stats.mozilla.org/report/index/bp-49d1c59c-f4ec-4798-95cb-e26640220816
https://crash-stats.mozilla.org/report/index/bp-39461282-ac0a-4126-866a-1561b0220816

After last update I've been browsing and presumably closing without crashes for a day, but have not hard data not reliable memory to swear on it (you know, typical tab hoarder, cleaning tabs only after midnight).

Confirming crash upon closing https://sharedrop.io/ on the first try and many times upon closing https://angular.io/ .
Closing unloaded tab using context menu works OK.

This suddenly crashed on me tonight, and I can no longer open DevEdition due to it (crashes on every startup attempt)

One of my crashes: https://crash-stats.mozilla.org/report/index/a264aa2e-4daf-4b9b-b814-0924a0220816#tab-details

Happy to try and narrow this down if there is more need.

(In reply to Justin Wood (:Callek) from comment #16)

Ooops, submitted too early, DevEd: 104.0b9

As a workaround, I was able to get the browser launched if I set privacy.trackingprotection.emailtracking.data_collection.enabled to false in my prefs.js manually.

Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/038d31437d8e
Don't try to gather telemetry for a discarded bc in UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel. r=dimi

Comment on attachment 9289989 [details]
Bug 1784983 - Don't try to gather telemetry for a discarded bc in UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel.

Beta/Release Uplift Approval Request

  • User impact if declined: very high crash rates
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It just ignores a few simple cases. I think the worst possibility is that it won't fix the crash. See comment 11 for the patch rebased for beta. Patch hasn't merged to m-c yet.
  • String changes made/needed: none
  • Is Android affected?: Yes
Attachment #9289989 - Flags: approval-mozilla-beta?
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 105 Branch

Comment on attachment 9289989 [details]
Bug 1784983 - Don't try to gather telemetry for a discarded bc in UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel.

Approved for 104.0rc2

Attachment #9289989 - Flags: approval-mozilla-beta? → approval-mozilla-release+

Comment on attachment 9289989 [details]
Bug 1784983 - Don't try to gather telemetry for a discarded bc in UrlClassifierFeatureEmailTrackingDataCollection::ProcessChannel.

Approved for 104.0b10 devedition

Attachment #9289989 - Flags: approval-mozilla-beta+

This seems to have hit about 95% of all 104.0rc1 crashes in the last 24 hours.

Odd that the regression is old, but my crashing is just started today, twice with 104.0, both news.google.com

Yeah my nightly suddenly started crashing on every startup today. Reverting to a nightly from a month ago didn't help, I ended up having to switch to release until the fix landed.

Desktop Nightly builds should be ready with this fix now. Would be great to get some confirmation that the crashes are resolved. We're working on respins for all the other affected releases (desktop & mobile) as well.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #31)

Desktop Nightly builds should be ready with this fix now. Would be great to get some confirmation that the crashes are resolved. We're working on respins for all the other affected releases (desktop & mobile) as well.

My build is working fine now.

(In reply to Dave Townsend [:mossop] from comment #30)

Yeah my nightly suddenly started crashing on every startup today. Reverting to a nightly from a month ago didn't help, I ended up having to switch to release until the fix landed.

Justin Wood (:Callek) pointed out here that you can temporarily hot-fix/circumvent it with about:config pref. If your profile crashes on load, try to locate <your Firefox profile folder>\prefs.js file and ensure the is present

user_pref("privacy.trackingprotection.emailtracking.data_collection.enabled", false);

there before launch.

Thanks Justin I could use my primary ordinary Developer edition profile without juggling with release. (Thanks again!)

There's a low crash volume so far, but I don't see any crashes in the build with the fix that have either signature.

(In reply to Wayne Mery (:wsmwk) from comment #29)

Odd that the regression is old, but my crashing is just started today, twice with 104.0, both news.google.com

Mine is 104.0 beta, not nightly, on Mac. I've had some more crashes that were not news.google.com - mostly when closing tabs.

bp-b1f09e0c-63a2-4ea0-8fcd-826f00220816
bp-62237cf3-832e-4b7f-83c1-5630d0220816
bp-8fcd8729-74ed-4254-9ee9-cd2b00220816
bp-d7b0c33d-0a2a-4c0b-9749-d44e70220816
bp-ab41f48c-245c-487c-a9e1-76f150220816

DevEdition 104.0b10 and Firefox 104.0 RC2 are both now live with the fix. Fenix & Focus Nightly & Beta respins are also in review in Google Play.

Severity: S2 → S1
Priority: -- → P1

Dimi, given the volume of this crash in such short order, I strongly think we should still take the time to fully understand what happened here to see what possible hardening we can do to avoid such cases in the future.

Flags: needinfo?(dlee)

We tested the issue on several devices and it no longer occurs on RC 104.1.0 build2 and Firefox Beta 104.0b7. Tabs can be closed without Fenix crashing.

Devices used:

  • Huawei MediaPad M3 (Android 7)
  • Google Pixel 6 (Android 13)
  • Samsung Galaxy S22 Ultra (Android 12)
  • Samsung A32 5G (Android 11)
  • Oppo Find X5 (Android 12)
  • P40 Lite (Android 10)
  • Sony Xperia (Android 6.0.1)
  • Xperia 10 (Android 11)

The crash happened because the google tag manager script sends a beacon to report the region analytics data via the navigator.sendBeacon(). The beacon request is async and has the lowest loading priority. If the page was closed before the beacon request starts to load, the top-level WindowGlobalParent won't be available in this case. So, it will crash when we access it without a null check.

is there a possible way to enforce a null check when getting either a BrowsingContext or a WindowGlobalParent?

Flags: needinfo?(tihuang)
Flags: needinfo?(dlee)

(In reply to Tim Huang[:timhuang] from comment #40)

The crash happened because the google tag manager script sends a beacon to report the region analytics data via the navigator.sendBeacon(). The beacon request is async and has the lowest loading priority. If the page was closed before the beacon request starts to load, the top-level WindowGlobalParent won't be available in this case. So, it will crash when we access it without a null check.

Was this a change on Google's side? I'm wondering why old nightly builds that were previously fine suddenly started crashing.

Also a little confused why this was happening to me on startup, I wasn't closing any pages. Maybe the webpages that were loading from the session were creating and then immediately removing frames to trigger this?

The crash is verified fixed on latest Focus RC 104.1.0 build2 and Focus Beta 104.0b7, as well. No crash occurs when closing tabs from various sites, including those mentioned above.

Devices used:

  • Google Pixel 6 (Android 12)
  • Xiaomi 12 Pro (Android 12)
  • Lenovo Yoga Tab 11 (Android 11)
  • OPPO A15s (Android 10)
  • LG G7 fit (Android 8.1.0)

(In reply to Dave Townsend [:mossop] from comment #41)

(In reply to Tim Huang[:timhuang] from comment #40)

The crash happened because the google tag manager script sends a beacon to report the region analytics data via the navigator.sendBeacon(). The beacon request is async and has the lowest loading priority. If the page was closed before the beacon request starts to load, the top-level WindowGlobalParent won't be available in this case. So, it will crash when we access it without a null check.

Was this a change on Google's side? I'm wondering why old nightly builds that were previously fine suddenly started crashing.

Did it do this in Google Chrome or any other browsers? Maybe there is your answer.

Hello,
I was able to reproduce the crash on Firefox RC1(build ID: 20220815150936) on Windows 10 using the steps from Comment 3 and Comment 7. Verified as fixed on Firefox RC2(build ID: 20220816115024), DevEd.0b10(build ID: 20220816121120) and Nightly 105.0a1(build ID: 20220816190318) on Windows 10, macOS 12, Ubuntu 22.04.

Status: RESOLVED → VERIFIED

(In reply to Tim Huang[:timhuang] from comment #40)

is there a possible way to enforce a null check when getting either a BrowsingContext or a WindowGlobalParent?

This issue is pretty common, so for IPC we have MaybeDiscarded<BrowsingContext>, which requires checking that the browsing context isn't discarded (in my testing, it was actually the bc->IsDiscarded() check that caused us to avoid the crash). It would probably be possible to go through and find various APIs like LoadInfo::GetBrowsingContext where people are checking IsDiscarded on the return value and change them to return MaybeDiscarded<BrowsingContext>. Presumably it could also be done for WindowGlobalParent. I'll file a bug about that.

See Also: → 1785261

(In reply to Dave Townsend [:mossop] from comment #41)

(In reply to Tim Huang[:timhuang] from comment #40)

The crash happened because the google tag manager script sends a beacon to report the region analytics data via the navigator.sendBeacon(). The beacon request is async and has the lowest loading priority. If the page was closed before the beacon request starts to load, the top-level WindowGlobalParent won't be available in this case. So, it will crash when we access it without a null check.

Was this a change on Google's side? I'm wondering why old nightly builds that were previously fine suddenly started crashing.

Also a little confused why this was happening to me on startup, I wasn't closing any pages. Maybe the webpages that were loading from the session were creating and then immediately removing frames to trigger this?

It's not on Google's side.

It's because the email tracking data collection feature never worked until yesterday. We have published the email tracker tables on our Shavar server yesterday, so it started to collect data and crash since then.

Would it be practical/useful to un-publish them for a week or so to allow affected clients to update to a fixed version? I'm worried about how many users are currently stuck on an affected build due to crashing on startup and being unable to update.

Flags: needinfo?(tihuang)

Yeah, it's reasonable. I have requested a rolling back of the Shavar changes.

Flags: needinfo?(tihuang)
Depends on: 1785815
See Also: → 1809215
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: