Closed Bug 1745660 Opened 2 years ago Closed 2 years ago

Crash in [@ mozilla::ipc::MessageChannel::Send | mozilla::ipc::IPDLResolverInner::ResolveOrReject | IPC_Message_Name=PContent::Reply_FlushFOGData] (recent regression)

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox95 --- wontfix
firefox96 --- fixed
firefox97 --- fixed

People

(Reporter: ole+mozilla, Assigned: chutten)

References

Details

(Whiteboard: qa-not-actionable)

Crash Data

Attachments

(1 file)

Since Firefox 94.0 I have been suffering from regular, but not very frequent crashes, often happening after a page has been opened for longer time.

https://crash-stats.mozilla.org/signature/?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ASend%20%7C%20mozilla%3A%3Aipc%3A%3AIPDLResolverInner%3A%3AResolveOrReject%20%7C%20IPC_Message_Name%3DPContent%3A%3AReply_FlushFOGData&date=%3E%3D2021-06-01T17%3A43%3A00.000Z&date=%3C2021-12-12T17%3A43%3A00.000Z shows 145 crashes with this signature, so it seems like a ~recent regression happening in Windows 7 - Windows 11.

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/45d1f8bc-dc9a-4ff9-baac-777e90211212

MOZ_CRASH Reason: MOZ_CRASH(IPC message size is too large)

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::MessageChannel::Send ipc/glue/MessageChannel.cpp:888
1 xul.dll mozilla::ipc::IPDLResolverInner::ResolveOrReject ipc/glue/ProtocolUtils.cpp:944
2 xul.dll std::_Func_impl_no_alloc<`lambda at /builds/worker/workspace/obj-build/ipc/ipdl/PContentChild.cpp:15893:45', void, mozilla::ipc::ByteBuf&&>::_Do_call 
3 xul.dll mozilla::glean::FlushFOGData toolkit/components/glean/ipc/FOGIPC.cpp:64
4 xul.dll mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:15910
5 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:1968
6 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:771
7 xul.dll mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:391
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1175
9 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:107

First crash with this signature on my system:
https://crash-stats.mozilla.org/report/index/51777331-d720-42a9-bc8f-b52d70211111

Component: General → Crash Reporting
Product: Firefox → Toolkit
Whiteboard: qa-not-actionable
Component: Crash Reporting → Telemetry

The limit is like 256MB, how in the world are we hitting that. Ugh.

Assignee: nobody → chutten
Severity: -- → S2
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P1
See Also: → 1641989, 1743683

(I say "how in the world", but I have a pretty good idea its the same thing inflating our db size in bug 1743683)

Notes to self:

  1. Fixing this will increase the db size issue by sending more data (though not much given the frequencies of these crashes)
  2. I should send out an email to FOG data consumers to warn them that there is some missing data due to crashes
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/383986e2f5cb
Flush FOG IPC every 100k samples r=TravisLong
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch

Comment on attachment 9255091 [details]
Bug 1745660 - Flush FOG IPC every 100k samples r?TravisLong!

Beta/Release Uplift Approval Request

  • User impact if declined: Unlikely but unavoidable tab crash
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Medium risk because this crash has escaped us before and I have a healthy distrust for anything that touches IPC... that being said, this is tested and is very small so only has so much room for bugs.
  • String changes made/needed:
Attachment #9255091 - Flags: approval-mozilla-beta?

For manual testing, here's the STR I used to verify this on Nightly. I don't think QE needs to follow this, but in case we end up needing it:

  1. Enable remote debugging in devtools settings
  2. Load about:glean and open a devtools console (privileged parent process JS)
  3. Load a page that starts a content process (mozilla.org will do)
  4. On the content-process-having tab, open Tools > Browser Tools > Browser Content Toolbox (privileged content process JS)
  5. Open Tools > Browser Tools > Browser Console so you can see the logging
  6. In the Browser Content Toolbox, run this code:
Cu.importGlobalProperties(["Glean"]);
const { setTimeout } = ChromeUtils.import("resource://gre/modules/Timer.jsm");
const { console } = ChromeUtils.import("resource://gre/modules/Console.jsm");
var iterationCount = 0;

function iteration() { for (let i = 0; i < 2600; i++) { Glean.testOnlyIpc.anEvent.record({extra1: "A string that isn't 100 bytes but it's long enough to be annoying. Oh, okay, let's make it 96B."}); }; if (iterationCount < 1000) { iterationCount++; setTimeout(iteration, 0); } else { console.log("DONE!"); } }

This will set up all the things we need to record over 256MB of data to the FOG IPC Payload, yielding the main thread (So IPC can happen) every 2600 * (100 + overhead) bytes.
7) In the same Browser Content Toolbox, run the now-set-up code with

iteration()
  1. After a short while (under 10min definitely) you should see DONE! in the Browser Console
  2. If the tab hasn't yet crashed you can use the privileged parent process JS console to
await Services.fog.testFlushAllChildren();
Glean.testOnlyIpc.anEvent.testGetValue().length

If the tab crashed, congratulations, you've reproduced the bug. You are running a build without the fix.

If the tab doesn't crash, congratulations, you've not reproduced the bug. You should get a value of around 2602601 in the parent process JS console. You are running a build with the fix.

Comment on attachment 9255091 [details]
Bug 1745660 - Flush FOG IPC every 100k samples r?TravisLong!

Approved for 96.0b6

Attachment #9255091 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Hi Chris, this patch was uplifted to beta and there is now an ESlint: https://treeherder.mozilla.org/logviewer?job_id=361502134&repo=mozilla-beta&lineNumber=127

Can you please take a look?

[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/testing  
[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/layout  
[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/dom  
[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/chrome  
[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/xpfe  
[task 2021-12-16T15:27:08.252Z] /builds/worker/checkouts/gecko/remote  
[task 2021-12-16T15:27:08.253Z] /builds/worker/checkouts/gecko/config  
[task 2021-12-16T15:27:08.253Z] /builds/worker/checkouts/gecko/memory  
[task 2021-12-16T15:27:08.253Z] /builds/worker/checkouts/gecko/intl  
[task 2021-12-16T15:27:08.253Z] /builds/worker/checkouts/gecko/caps  
[task 2021-12-16T15:27:08.253Z] /builds/worker/checkouts/gecko/taskcluster
[task 2021-12-16T15:27:08.263Z] 15:27:08.262 eslint (93) | Command: /usr/local/bin/node /builds/worker/checkouts/gecko/node_modules/eslint/bin/eslint.js --ext [js,jsm,jsx,xul,html,xhtml,sjs] --format json --no-error-on-unmatched-pattern --ignore-pattern testing/mochitest/pywebsocket3 --ignore-pattern dom/media/webspeech/recognition/endpointer.cc --ignore-pattern testing/mochitest/MochiKit --ignore-pattern dom/media/platforms/ffmpeg/ffmpeg58 --ignore-pattern dom/webauthn/tests/pkijs --ignore-pattern dom/canvas/test/webgl-conf/checkout --ignore-pattern dom/media/gmp/widevine-adapter/content_decryption_module_proxy.h --ignore-pattern dom/media/gmp/widevine-adapter/content_decryption_module_ext.h --ignore-pattern testing/talos/talos/tests/kraken --ignore-pattern dom/imptests --ignore-pattern dom/media/webvtt/vtt.jsm --ignore-pattern testing/modules/ajv-6.12.6.js --ignore-pattern dom/media/webspeech/recognition/energy_endpointer.cc --ignore-pattern dom/u2f/tests/pkijs --ignore-pattern dom/tests/mochitest/ajax --ignore-pattern testing/mochitest/tests/MochiKit-1.4.2 --ignore-pattern dom/media/webspeech/recognition/energy_endpointer_params.h --ignore-pattern testing/web-platform/tests/tools/third_party --ignore-pattern intl/icu --ignore-pattern dom/media/gmp/widevine-adapter/content_decryption_module.h --ignore-pattern testing/xpcshell/dns-packet --ignore-pattern remote/test/puppeteer --ignore-pattern dom/tests/mochitest/dom-level2-html --ignore-pattern remote/cdp/test/browser/chrome-remote-interface.js --ignore-pattern dom/tests/mochitest/dom-level1-core --ignore-pattern intl/unicharutil/util/nsUnicodePropertyData.cpp --ignore-pattern testing/talos/talos/tests/dromaeo --ignore-pattern dom/media/platforms/ffmpeg/libav54 --ignore-pattern dom/media/platforms/ffmpeg/libav55 --ignore-pattern layout/docs/css-gap-decorations --ignore-pattern testing/gtest/gtest --ignore-pattern testing/xpcshell/odoh-wasm --ignore-pattern testing/xpcshell/node-http2 --ignore-pattern dom/media/webspeech/recognition/energy_endpointer.h --ignore-pattern testing/gtest/gmock --ignore-pattern intl/unicharutil/util/nsUnicodeScriptCodes.h --ignore-pattern dom/media/webrtc/transport/third_party --ignore-pattern dom/media/webaudio/test/blink --ignore-pattern dom/media/webspeech/recognition/energy_endpointer_params.cc --ignore-pattern dom/tests/mochitest/dom-level2-core --ignore-pattern dom/media/platforms/ffmpeg/ffmpeg57 --ignore-pattern dom/media/gmp/rlz --ignore-pattern testing/modules/sinon-7.2.7.js --ignore-pattern testing/web-platform/tests/resources/webidl2 --ignore-pattern testing/talos/talos/tests/v8_7 --ignore-pattern testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/catapult --ignore-pattern dom/media/webspeech/recognition/endpointer.h --ignore-pattern dom/webauthn/cbor-cpp --ignore-pattern dom/media/platforms/ffmpeg/libav53 --ignore-pattern testing/xpcshell/node-ip --ignore-pattern intl/unicharutil/util/nsSpecialCasingData.cpp --ignore-pattern dom/media/gmp/widevine-adapter/content_decryption_module_export.h /builds/worker/checkouts/gecko/gradle /builds/worker/checkouts/gecko/testing /builds/worker/checkouts/gecko/layout /builds/worker/checkouts/gecko/dom /builds/worker/checkouts/gecko/chrome /builds/worker/checkouts/gecko/xpfe /builds/worker/checkouts/gecko/remote /builds/worker/checkouts/gecko/config /builds/worker/checkouts/gecko/memory /builds/worker/checkouts/gecko/intl /builds/worker/checkouts/gecko/caps /builds/worker/checkouts/gecko/taskcluster
[task 2021-12-16T15:28:49.997Z] 15:28:49.997 eslint (94) | Finished in 101.93 seconds
[task 2021-12-16T15:30:44.456Z] 15:30:44.456 eslint (91) | Finished in 216.40 seconds
[task 2021-12-16T15:31:34.152Z] 15:31:34.152 eslint (93) | Finished in 266.09 seconds
[task 2021-12-16T15:33:33.730Z] 15:33:33.730 eslint (92) | Finished in 385.67 seconds
[task 2021-12-16T15:33:33.739Z] TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/toolkit/components/glean/tests/xpcshell/test_FOGIPCLimit.js:19:5 | 'Services' is not defined. (no-undef)
[taskcluster 2021-12-16 15:33:34.287Z] === Task Finished ===
[taskcluster 2021-12-16 15:33:34.906Z] Unsuccessful task run with exit code: 1 completed in 476.507 seconds
Flags: needinfo?(chutten)

Huh. Could've sworn Services was in scope. How do you want this :apavel? Another patch on the same stack? Should just need a const { Services } = ChromeUtils.import("resource://gre/modules/Services.jsm");

...though mozilla-beta probably doesn't have Services.fog (came in bug 1715542) and requires instead

const FOG = Cc["@mozilla.org/toolkit/glean;1"].createInstance(Ci.nsIFOG);
FOG.initializeFOG();

Instead of Services.fog.initializeFOG();

Flags: needinfo?(chutten) → needinfo?(apavel)
Flags: needinfo?(apavel)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: