[meta] Crash in [@ IPCError-browser | ShutDownKill]
Categories
(Core :: DOM: Content Processes, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox47 | --- | wontfix |
| firefox48 | --- | wontfix |
| firefox49 | --- | wontfix |
| firefox-esr45 | --- | wontfix |
| firefox50 | + | wontfix |
| firefox51 | --- | wontfix |
| firefox52 | --- | wontfix |
| firefox-esr52 | --- | wontfix |
| firefox53 | --- | wontfix |
| firefox-esr115 | --- | affected |
| firefox54 | --- | wontfix |
| firefox55 | --- | wontfix |
| firefox56 | --- | wontfix |
| firefox57 | --- | wontfix |
| firefox58 | --- | wontfix |
| firefox59 | --- | wontfix |
| firefox60 | --- | wontfix |
| firefox61 | --- | wontfix |
| firefox68 | --- | wontfix |
| firefox69 | --- | wontfix |
| firefox70 | --- | wontfix |
| firefox73 | --- | wontfix |
| firefox74 | --- | wontfix |
| firefox75 | --- | wontfix |
| firefox83 | --- | wontfix |
| firefox84 | --- | wontfix |
| firefox85 | --- | wontfix |
| firefox86 | --- | wontfix |
| firefox87 | --- | wontfix |
| firefox88 | --- | wontfix |
| firefox89 | --- | wontfix |
| firefox90 | --- | wontfix |
| firefox91 | --- | wontfix |
| firefox92 | --- | wontfix |
| firefox93 | --- | wontfix |
| firefox94 | --- | wontfix |
| firefox131 | --- | wontfix |
| firefox132 | --- | wontfix |
| firefox133 | --- | wontfix |
| firefox136 | --- | wontfix |
| firefox137 | --- | wontfix |
| firefox138 | --- | wontfix |
| firefox143 | --- | affected |
| firefox144 | --- | affected |
| firefox145 | --- | affected |
People
(Reporter: marvinhk, Unassigned)
References
(Depends on 10 open bugs, )
Details
(5 keywords)
Crash Data
Attachments
(2 files)
Comment 1•9 years ago
|
||
Comment 2•9 years ago
|
||
Comment 3•9 years ago
|
||
Comment 6•9 years ago
|
||
Comment 8•9 years ago
|
||
Comment 9•9 years ago
|
||
Comment 12•9 years ago
|
||
Comment 14•9 years ago
|
||
Comment 15•9 years ago
|
||
Comment 16•9 years ago
|
||
Comment 17•9 years ago
|
||
Comment 18•9 years ago
|
||
Updated•9 years ago
|
Comment 20•9 years ago
|
||
Comment 21•9 years ago
|
||
Comment 23•9 years ago
|
||
Comment 24•9 years ago
|
||
Comment 25•9 years ago
|
||
Updated•9 years ago
|
Updated•9 years ago
|
Updated•9 years ago
|
Comment 28•9 years ago
|
||
Comment 29•9 years ago
|
||
Updated•9 years ago
|
Updated•9 years ago
|
Comment 35•9 years ago
|
||
Comment 37•9 years ago
|
||
Comment 38•9 years ago
|
||
Comment 39•9 years ago
|
||
Comment 40•9 years ago
|
||
Comment 41•9 years ago
|
||
Comment 42•9 years ago
|
||
Comment 43•9 years ago
|
||
Comment 44•9 years ago
|
||
Comment 45•9 years ago
|
||
Comment 46•9 years ago
|
||
Comment 47•9 years ago
|
||
Comment 48•9 years ago
|
||
Comment 49•9 years ago
|
||
Comment 50•9 years ago
|
||
Updated•9 years ago
|
Comment 51•9 years ago
|
||
Comment 52•9 years ago
|
||
Comment 53•9 years ago
|
||
Comment 58•9 years ago
|
||
Comment 59•9 years ago
|
||
Comment 60•9 years ago
|
||
Comment 61•9 years ago
|
||
Comment 62•9 years ago
|
||
Comment 63•9 years ago
|
||
Comment 64•9 years ago
|
||
Comment 65•9 years ago
|
||
Comment 72•9 years ago
|
||
Comment 73•9 years ago
|
||
Comment 74•9 years ago
|
||
Comment 75•9 years ago
|
||
Comment 76•9 years ago
|
||
Comment 77•9 years ago
|
||
Comment 79•9 years ago
|
||
Comment 80•9 years ago
|
||
Comment 81•9 years ago
|
||
Updated•9 years ago
|
Comment 82•9 years ago
|
||
Comment 83•9 years ago
|
||
Comment 84•9 years ago
|
||
Comment 85•9 years ago
|
||
Comment 86•9 years ago
|
||
Comment 87•9 years ago
|
||
Comment 88•9 years ago
|
||
Comment 89•8 years ago
|
||
Comment 90•8 years ago
|
||
Comment 92•8 years ago
|
||
Comment 93•8 years ago
|
||
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Comment 94•8 years ago
|
||
Comment 95•8 years ago
|
||
Comment 96•8 years ago
|
||
Updated•8 years ago
|
Comment 97•8 years ago
|
||
Comment 99•8 years ago
|
||
Comment 100•8 years ago
|
||
Comment 101•8 years ago
|
||
Comment 102•8 years ago
|
||
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Comment 106•7 years ago
|
||
¡Hola!
Adding the flag for 68 as per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill#summary it accounts for 91.2% of the total crashes.
FWIW all of the crashes on my Nightly in the last month or so are variants of this:
Submitted Crash Reports
Report ID Date Submitted
bp-9e113f57-ef91-403e-853f-2e06b0190515 5/14/2019, 7:44 PM
View
bp-d56a8ef4-0853-4d07-87cf-b9f170190515 5/14/2019, 7:44 PM
View
bp-2252da92-b4f8-4cef-915d-0d2670190514 5/13/2019, 8:27 PM
View
bp-3407c931-83e4-4a6f-96b1-b894d0190514 5/13/2019, 8:27 PM
View
bp-7a5a8541-7d1a-4815-9ccd-3a3250190514 5/13/2019, 8:27 PM
View
bp-57533b9a-a227-4995-a315-725f10190514 5/13/2019, 8:27 PM
View
bp-74da1097-3526-4b1a-9654-e34ae0190513 5/13/2019, 9:11 AM
View
bp-342bf38b-c00c-4c85-8454-447b70190510 5/10/2019, 10:17 AM
View
bp-1bcfe01d-3bed-4ff0-8a2f-d22710190510 5/10/2019, 8:56 AM
View
bp-be35f4be-6f78-4e48-8e73-a49a70190509 5/9/2019, 9:09 AM
View
bp-d42e065c-5710-492b-a2e5-4b33b0190509 5/9/2019, 9:09 AM
View
bp-04d12eba-5a5f-4351-8932-efddf0190509 5/9/2019, 9:09 AM
View
bp-eee693e1-0831-43f3-95a6-bb0980190509 5/9/2019, 9:09 AM
View
bp-658039b7-241a-4d98-8c1c-9f3780190509 5/9/2019, 9:09 AM
View
bp-ad35eb3d-8add-4730-85f8-3dcb00190509 5/9/2019, 9:09 AM
View
bp-6b0d5d47-535e-4dcc-97d7-0ccc60190509 5/9/2019, 9:09 AM
View
bp-4b7c1b5d-bdbf-4850-a775-f5ef90190509 5/9/2019, 9:09 AM
View
bp-001707f9-503e-41d6-bdd2-80de60190508 5/8/2019, 11:46 AM
View
bp-996a13c2-3bce-4226-98b5-d918c0190508 5/8/2019, 11:46 AM
View
bp-98c46dd3-a812-4f23-9011-bf22b0190506 5/6/2019, 4:03 PM
View
bp-3ad29622-5e18-4f48-945a-514390190503 5/3/2019, 3:22 PM
View
bp-106b7b55-3629-47c2-abfc-3cbd10190503 5/3/2019, 9:09 AM
View
bp-fa2c5fdd-17c2-4519-8236-f439b0190501 4/30/2019, 8:24 PM
View
bp-52f83e1a-9bd9-4332-89e1-dd8640190429 4/29/2019, 11:52 AM
View
bp-9c0376bc-200d-45ba-bd75-1d0770190429 4/29/2019, 11:52 AM
View
bp-75007eeb-2cd9-45dd-a4e2-d74280190429 4/29/2019, 11:52 AM
View
bp-c91bfdaf-d5ff-4f0c-8ed8-7ef250190429 4/29/2019, 11:52 AM
View
bp-8d079313-a3d8-44cb-b11e-1ab5c0190429 4/29/2019, 11:52 AM
View
bp-fcbfa40c-31da-4c8d-b138-406200190429 4/29/2019, 11:52 AM
View
bp-7d798392-8924-4c2b-bb6d-d3af90190429 4/29/2019, 11:52 AM
View
bp-ea25d6e2-0ccc-4faf-a537-37b360190429 4/29/2019, 11:52 AM
View
bp-05304540-cfb0-42c7-95c1-d197c0190425 4/25/2019, 9:12 AM
View
bp-750cc9de-d914-4406-b78f-53ba80190424 4/24/2019, 10:03 AM
View
bp-2bba28a8-3f01-494e-b20b-982590190424 4/24/2019, 10:03 AM
View
bp-74065cdd-52ba-4682-92f4-7676d0190424 4/24/2019, 10:03 AM
View
bp-c9e2255a-bd10-4471-8e59-cb2820190424 4/24/2019, 10:03 AM
View
bp-57861240-6f6f-415a-8312-4f0c50190424 4/24/2019, 10:03 AM
View
bp-cc5d3dec-9c8e-41eb-aad5-6e84e0190423 4/23/2019, 9:04 AM
View
bp-df0a195b-2441-4621-b56d-705f50190421 4/21/2019, 1:25 PM
View
bp-9817d377-101f-4b4b-990c-0b1f90190410 4/10/2019, 4:36 PM
View
bp-932d2082-6f7c-4eb2-91e4-ea71f0190410 4/10/2019, 9:10 AM
View
Do let me know if there's anything I could provide on this bug and I'll be happy to do so.
¡Gracias!
Alex
This saw a large uptick during March-2019 driven by the nightly channel. I understand that this bug is a catch all of sorts due to how our crash signatures are handled, but do we have an explanation for the uptick?
Edit: seems like this discussion has happened over at https://bugzilla.mozilla.org/show_bug.cgi?id=1219672#c79
Updated•6 years ago
|
Comment 117•6 years ago
|
||
I see nothing actionable here. This looks more like a meta bug. Removing the qawanted flag.
Comment 122•6 years ago
|
||
Now that bug 1612569 is fixed the signature here dropped to zero. I suggest we keep using this bug both as a meta for bugs with appropriate signatures (as blocking this) and we update the signature here to include crashes where the stacks are all over the place and thus are unlikely to be actionable. I've already identified a few of them such as:
- IPCError-browser | ShutDownKill | mozilla::ipc::MessagePump::Run
- IPCError-browser | ShutDownKill | Nt.*
- IPCError-browser | ShutDownKill | PR_MD_WAIT_CV | _PR_WaitCondVar | PR_Wait | nsThreadStartupEvent::Wait
Given the volume of these crashes we should also re-evaluate the shutdown killer time which is currently set at 5 seconds. We might want to bump it up and see if it affects the crash rate.
Another point: I've found crashes where the content process was still starting up when we killed it. I wonder if we do support this scenario, i.e. sending a process a shutdown message before it has finished initializing.
Comment 123•6 years ago
|
||
There's also a worrisome number of crashes where all the threads in the process are stuck waiting for something. In many cases these involve event queues and thread pools so I fear we might be seeing actual races caused by a deadlock.
Comment 124•6 years ago
|
||
As per my previous comment I'll start adding signatures that don't seem directly actionable to this bug so we can keep track of the non-actionable rate of these hangs.
Comment 125•6 years ago
|
||
Quick update: I may have figured out the reason behind many of the shutdown hangs that look like deadlocks. You can see the nitty gritty details in bug 1614570 comment 1 but in short what's happening is that we stop the event loops at the moment we receive the content process shutdown message. If some code relies on runnables or spinning an event loop for its shutdown procedure it will deadlock.
Comment 126•6 years ago
|
||
Crashes with this signature are content processes being slow, not stuck. They're signaling the main process that they're done shutting down but didn't do it in time.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 127•6 years ago
|
||
Adding more signatures that don't seem actionable. The crashes are in various states of shutdown so they're not stuck but probably just slow.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 129•6 years ago
|
||
Added another "we're just being slow" signature.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 130•6 years ago
|
||
The last signature belongs to bug 1578070, I'm moving it there.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 131•6 years ago
|
||
Added more "slow" signatures.
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 133•5 years ago
|
||
¡Hola!
Got
Firefox 79.0a1 Crash Report [@ IPCError-browser | ShutDownKill | PR_MD_WAIT_CV | _PR_WaitCondVar | PR_Wait | nsThreadStartupEvent::Wait ]
Submitted Crash Reports
Report ID Date Submitted
bp-92ef466b-a927-4eea-9d9f-6bc280200616 6/16/2020, 06:16
¡Gracias!
Alex
Comment 134•5 years ago
|
||
This is a meta bug, please leave status flags unset. If there's specific instances of these crashes that are reproducible or actionable they should have their own bugs and status flags.
Comment 135•5 years ago
|
||
¡Hola Gabriele!
For reasons unbeknownst to me Nightly has been hang happy in the most recent two weeks but the crash reporter wont' fire.
This is what I manage to find in about:crashes FWIW:
Submitted Crash Reports
Report ID Date Submitted
bp-82dbed26-65af-4dbe-8f25-2519a0200710 7/10/2020, 10:26
View
bp-2e6fb69d-d92e-47cb-bf80-80c3c0200709 7/9/2020, 18:40
View
bp-9933bec5-18ff-4ff3-ac14-aed1f0200707 7/7/2020, 11:28
View
bp-3f3c8853-1258-4699-9d80-96fb00200707 7/7/2020, 11:28
View
bp-a20e6067-7ac0-40b8-b00d-38a730200627 6/26/2020, 20:16
View
bp-db518be7-813b-4b10-83fd-fd2910200627 6/26/2020, 20:16
View
bp-becff340-b12f-4c18-a79b-1ba400200627 6/26/2020, 20:05
View
bp-0c224c7e-8df6-46ae-8a74-b82ce0200624 6/23/2020, 19:29
View
bp-4b538f74-a911-4b26-80f6-d92ff0200623 6/23/2020, 08:22
View
bp-d21e6c07-f95b-4129-9106-842a50200622 6/22/2020, 08:50
View
bp-6b517307-362a-4822-8ebe-52f5c0200622 6/22/2020, 08:34
View
Can you please comment on what could be done to try and pin down the cause of these crashes, please?
¡Gracias!
Alex
Comment 136•5 years ago
|
||
These reports won't trigger the crash reporter because they aren't actual crashes, they're hangs. Basically a content process took too long to shut down and we killed it after a certain time (currently 20s). Before killing it we grab a snapshot of it's state and that's how you got those. Do you have fission enabled? If you do then you'll have more content processes around and it will be more likely that one is too slow when shutting down.
Comment 137•5 years ago
|
||
¡Hola Gabriele!
Thanks for responding here.
fission.autostart is set to false in this Nightly.
I guess I'll continue to try and crash the hung Nightly with https://ftp.mozilla.org/pub/utilities/crashfirefox-intentionally/crashfirefox64.exe for the time being and see if an usable crash report is generated.
If anyone has any ideas on how to pin those down I'd be grateful if they share some ideas.
¡Gracias!
Alex
Comment 138•5 years ago
|
||
Bug 1637048 has significantly reduced the volume here. It's time to revisit the signatures and check if we've got some actual deadlocks in there. NI? me so I don't forget.
Comment 139•5 years ago
|
||
For starters I pruned all the signatures that have no crashes for recent versions.
Updated•5 years ago
|
Comment 141•5 years ago
|
||
¡Hola Gabriele!
Sorry to bother you again.
Some more crashes from the most recent past two weeks:
Submitted Crash Reports
Report ID Date Submitted
bp-f26ad0c1-1202-49cf-bd25-dcb8c0200902 9/2/2020, 16:34
bp-6533606c-4fb5-4ac7-b891-3f0410200828 8/28/2020, 11:18
bp-873447b7-869f-4772-8f6d-972700200825 8/25/2020, 08:41
bp-6da34ca0-f564-48e1-8aa1-9a5730200824 8/24/2020, 08:56
bp-217dec7c-f88a-4812-b2e7-e89400200823 8/23/2020, 09:53
bp-989f9c12-eb5c-46ba-ab5f-d076a0200822 8/21/2020, 22:03
bp-5f74b094-7a1b-4b3e-919c-b6cd30200821 8/21/2020, 12:11
bp-79ee51df-9d99-4721-a6cc-6f41a0200820 8/20/2020, 04:56
bp-fc0ae5be-5621-4534-a4c3-24d710200819 8/19/2020, 08:59
Please let me know here if any of these needs separate bug reports and I'd be happy to file them.
¡Gracias!
Alex
Comment 142•5 years ago
|
||
Thanks Alex. I glanced over the crashes and it seems we have everything on file already. One question however: most of your crashes appear to be content processes being slow but from the looks of it your machine is fairly powerful. Do you have an SSD or a regular HDD on your machine? I'd like to figure out what could be causing those content processes to respond slowly.
Comment 143•5 years ago
|
||
¡Hola Gabriele!
Thanks for taking a look into these.
Here are the main specs for this laptop:
Processor Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz 2.90 GHz
Installed RAM 24.0 GB (23.8 GB usable)
System type 64-bit operating system, x64-based processor
Edition Windows 10 Pro Insider Preview
Version 2004
Installed on 9/5/2020
OS build 20206.1000
Experience Windows Feature Experience Pack 120.22800.0.0
SSD SAMSUNG MZVLB1T0HALR-000L7
Hope this helps.
If you need anything else from this system, please just ni?
¡Gracias!
Alex
Comment 144•5 years ago
|
||
Thanks Alex, this is very useful. You've got a fast machine so there's no real reason why content processes might take so long to shut down. We'll have to do a proper profile of a content process shutting down to figure out what's going on. I also wonder if Windows task scheduling might be playing a part here.
Comment 145•5 years ago
|
||
Firefox just abruptly crashed for me while I was in another desktop. Switched back to the desktop, no crash reporter dialog.
bp-79d24fbe-41a3-4463-a636-8037b0201014
bp-724c3c22-e11e-4a14-a0d5-ad6870201014
Updated•5 years ago
|
Comment 146•5 years ago
|
||
¡Hola y'all!
Found this one on 85.0a1:
bp-5d378d24-1334-4e36-bc09-274930201123 11/23/2020, 17:53
Updating flags per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
Seems like it is happening much more on 85.0a1 in the most recent past week BTW.
Hope this is helpful.
¡Gracias!
Alex
Comment 147•5 years ago
|
||
Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.
Comment 148•5 years ago
|
||
¡Hola y'all!
Found this one on 86.0a1:
https://crash-stats.mozilla.org/report/index/11bfbef7-8317-4f9a-80e4-66b450201223#tab-details
Updating flags per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20NtYieldExecution
Seems like it is happening much more on 86.0a1 in the most recent past week BTW.
Hope this is helpful.
¡Gracias!
Alex
Updated•5 years ago
|
Comment 150•5 years ago
|
||
¡Hola!
Found this one on:
Product Firefox
Release Channel nightly
Version 87.0a1
Build ID 20210212100155 (2021-02-12)
bp-f305824e-bd5b-4a68-a7ee-74e980210213 2/13/2021, 02:10
Updating flags.
¡Gracias!
Alex
Updated•5 years ago
|
Updated•5 years ago
|
Comment 151•5 years ago
|
||
¡Hola y'all!
Found these on
Product Firefox
Release Channel nightly
Version 88.0a1
Build ID 20210227094458
Report ID Date Submitted
bp-43a0d1ca-dcae-42dd-ab6b-797ad0210302 3/2/2021, 15:07
bp-1a91f848-051d-40c5-97f3-fd9e80210227 2/27/2021, 20:26
Updating flags.
¡Gracias!
Alex
Updated•5 years ago
|
Comment 152•5 years ago
|
||
¡Hola Gabriele!
Is https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20servo_arc%3A%3AArc%3CT%3E%3A%3Adrop_slow%3CT%3E%20%7C%20nsIFrame%3A%3A~nsIFrame&date=%3E%3D2020-12-12T00%3A47%3A00.000Z&date=%3C2021-03-12T00%3A47%3A00.000Z also this one or a different one?
¡Gracias!
Alex
Comment 153•5 years ago
|
||
¡Hola Gabriele!
Also these:
¡Gracias!
Alex
Comment 154•5 years ago
|
||
Thanks Alex, these signatures belong in here.
Comment 155•5 years ago
|
||
Hi,
Although there are only a few crashes here, this signature belongs here too.
Thanks :)
Comment 156•5 years ago
|
||
I don't know, the stacks are a bit of a mish-mash under this signature.
Updated•5 years ago
|
Comment 157•5 years ago
|
||
¡Hola Gabriele!
Is https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20xpc%3A%3AXrayWrapper%3CT%3E%3A%3AgetPrototype&date=%3E%3D2021-03-09T03%3A42%3A00.000Z&date=%3C2021-04-09T03%3A42%3A00.000Z also this one or a different one?
Also got bp-ad653f77-731c-4669-9b2e-407ce0210409 in
Product Firefox
Release Channel nightly
Version 89.0a1
Build ID 20210407212527
so updating flags FWIW.
¡Gracias!
Alex
Comment 158•5 years ago
|
||
This one, thanks!
Comment 159•5 years ago
|
||
The volume for this has increased in total. On nightly, [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessagePump::Run] moved from 200-250 crashes/day to 350-600. Might have started with 20210321093903 but the changes look unrelated.
Comment 160•5 years ago
|
||
This is a natural effect of having turned on Fission for more users. With more content processes being active at a given time the chance of one being slow at shotdown has increased thus leading to more reports here. Given this is only going to increase in the future we probably must decide what to do with these reports: it's been a while since we found content processes that were genuinely stuck at shutdown in a way that was actionable.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 163•5 years ago
|
||
¡Hola y'all!
Got https://crash-stats.mozilla.org/report/index/57f5e1d9-e5e1-4992-b2c7-24c530210513 on 90.
Updating flags FWIW.
¡Gracias!
Alex
Comment 164•5 years ago
|
||
Found another signature. I should find time to go over all the existing ones and leave only the relevant ones.
Comment 165•5 years ago
|
||
Comment 166•4 years ago
|
||
¡Hola Gabriele!
FWIW per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2021-04-27T16%3A37%3A00.000Z&date=%3C2021-05-27T16%3A37%3A00.000Z IPCError-browser | ShutDownKill | js::jit::MaybeEnterJit has spiked a bit on 90.
¡Gracias!
Alex
Comment 167•4 years ago
|
||
(In reply to alex_mayorga from comment #166)
FWIW per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2021-04-27T16%3A37%3A00.000Z&date=%3C2021-05-27T16%3A37%3A00.000Z IPCError-browser | ShutDownKill | js::jit::MaybeEnterJit has spiked a bit on 90.
Yes, this might be an actual issue. None of the crashes have the ipc shutdown state annotation set which means they didn't even begin to shutdown; it's possible that they were genuinely stuck, possibly deadlocked.
Comment 168•4 years ago
|
||
I have been experiencing this crash and rarely see a crash reporter window, you have to go in and submit the crash reports. Losing the tab history with each crash just makes it hardly worth using daily.
Having the window simply disappear without anything is just not a satisfactory outcome for me. We are not really dealing with a crash on shutdown. The user interface disappears while I was using it without notice and without any of the normal crash reporting occurring and nothing in the windows event log. If my experience is anything to go by, you are not getting but a fraction of the reports submitted. When I went to look after this last disappearance I submitted a whole host of old reports and my first attempt to submit them resulted in a fail appearing in the troubleshooting list. I think the crash reporter may also be fundamentally broken here.
Comment 169•4 years ago
|
||
There are no crashes under this bug, these are a specific type of shutdown hangs. If Firefox disappears without showing the crash reporter please file another bug with a detailed description of your setup (OS, Firefox version, sites/scenarios where it use to happen). Use the Toolkit > Crash Reporting component because the crash reporter not showing up belongs there.
Comment 170•4 years ago
|
||
Here's another signature. I'm not sure why mozglue.dll is in the signature and not symbolized.
Updated•4 years ago
|
Comment 171•4 years ago
|
||
¡Hola Gabriele!
Hope these lines find you well.
Reviewing about:crashes on this Nightly I found
https://crash-stats.mozilla.org/report/index/11f5e988-7b45-494d-b49c-4b7930210713#tab-details
which per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20_tailMerge_d3dcompiler_47.dll%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
now seems to be a Fission crash.
Should that be a separate bug or is this one still the one that's relevant?
FWIW I've updated the flags per the links above.
¡Gracias!
Alex
Comment 172•4 years ago
|
||
These stacks are identical to previous ones, they look different because we had an issue with the scripts that fetched the symbols so the resulting unwinding information was missing. I've updated it and reprocessed the crashes.
As we turn on Fission for more and more users we get more and more of these reports but their usefulness is limited; I think it's time to discuss if we still want to gather these or not.
Comment 173•4 years ago
|
||
I noticed my Firefox (Nightly, macOS) was completely hung, so I took a sample and spindump, then killed the app. When I opened Firefox back up, I had an unsent crash report from a couple minutes prior, presumably from when the hang started. That report pointed me here. I do have Fission enabled. Would the process sample and spindump be helpful in diagnosing this? If so, I can provide. Thanks!
Comment 174•4 years ago
|
||
(In reply to Sam Johnson from comment #173)
I noticed my Firefox (Nightly, macOS) was completely hung, so I took a sample and spindump, then killed the app. When I opened Firefox back up, I had an unsent crash report from a couple minutes prior, presumably from when the hang started. That report pointed me here. I do have Fission enabled. Would the process sample and spindump be helpful in diagnosing this? If so, I can provide. Thanks!
Probably yes. The crash report you got was from a child process that was taking too long to shut down, we grab them if a child process does not respond for long enough. That being said that very crash report is captured in the main process, so it might be possible that the process was taking too long and was responsible for the freeze.
Comment 175•4 years ago
|
||
I've zipped up the sample and spindump, attached. FWIW, when I noticed the app had hung, it was not after any attempts I made to quit the app; it was in the background while I was working in another program.
Comment 176•4 years ago
|
||
(In reply to Sam Johnson from comment #175)
I've zipped up the sample and spindump, attached. FWIW, when I noticed the app had hung, it was not after any attempts I made to quit the app; it was in the background while I was working in another program.
Thanks, I'll have a look ASAP.
Comment 177•4 years ago
|
||
Heads up for everybody tracking this bug: we've decided to revert the change that split the crash signatures for this type of reports. When the change will be applied in the coming weeks the signatures will again collapse into one. The reasoning behind this is that we only found a handful of useful signatures and far too many unactionable ones. The latter are making nightly crash triage harder while providing little value. We'll keep gathering these reports for the time being until we figure out a better way to make them actionable.
Comment 178•4 years ago
|
||
¡Hola y'all!
My Firefox Nightly crashed like
https://crash-stats.mozilla.org/report/index/e3d52b60-7184-4c46-be35-06cff0210918
Updating flags per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
FWIW
¡Gracias!
Alex
Comment 179•4 years ago
|
||
We now have the possibility to aggregate in crash-stats on the "Xpcom spin event loop stack" annotations. This leads to the following results.
Outstanding are: RequestHelper::StartAndReturnResponse, browser-custom-element.js:permitUnload and various nsThread::Shutdown * flavors.
Comment 180•4 years ago
|
||
There are more SpinEventLoopUntil locations in the data.
The question is if the nsThreadShutdown: * have some common cause/pattern.
Comment 181•4 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #180)
The question is if the
nsThreadShutdown: *have some common cause/pattern.
I just clicked through some reports, but it seems that many of them do not even contain the thread we are waiting for. This sounds as if our book-keeping of closing threads could fail in some cases?
IIUC the flow, we will unblock only if SchedulerGroup::Dispatch(TaskCategory::Other, event.forget()); succeeds such that we will call nsThread::ShutdownComplete on the main thread. We do not check the return value here, and there are at least some mallocs on the way to a successful dispatch that could fail. Should we check for successful dispatch here and in case clear the context->mAwaitingShutdownAck by hand off main-thread?
Comment 182•4 years ago
|
||
I don't really know anything about threads. Kris Wright is a better person to ask about them.
Comment 183•4 years ago
•
|
||
So this keeps spinning in my mind as a nested event loop... Just a guess, but I see basically two ways how we can end up in this situation:
-
Unexpected thread death without cleanup
If a thread could die unexpectedly without cleaning up our in-process-memory book-keeping (maybe by some interrupt?) and then we arrive later innsThread::Shutdown(), we would still be able to dispatch an event to its queue and then wait forever for the shutdown ack message to come. It is not clear to me how this could happen, but it would be nice if there was a (system) function to check, if a thread handle is really alive before we even try to shut it down. -
Failed dispatch of shutdown ack to main thread
This is basically comment 181. Just settingcontext->mAwaitingShutdownAckmight not be enough in this situation if there are no more other messages on the main thread, though?
Any other thoughts, Kris?
Comment 184•4 years ago
|
||
Hmmm, there is a very puzzling thing: While the XPCOMSpinEventLoopStack annotation suggests we are stuck inside a nested event loop on the main thread, the stack of those crashes does not contain any SpinEventLoopUntil call but they are just sitting in the normal main event loop. I start to think this is a red herring...
Comment 185•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #184)
Hmmm, there is a very puzzling thing: While the
XPCOMSpinEventLoopStackannotation suggests we are stuck inside a nested event loop on the main thread, the stack of those crashes does not contain anySpinEventLoopUntilcall but they are just sitting in the normal main event loop. I start to think this is a red herring...
Just to be clear: In some cases like bug 1740889 we see a real SpinEventLoopUntil on the stack, but in some cases like bug 1740895 we see a misalignment between the main thread stack and the XPCOMSpinEventLoopStack annotation.
Kris, I do not think it is worth the time to look into the nsThreadShutdown: * instances for now, they seem to be a false alarm (at least in the terms of SpinEventLoopUntil caused hangs, the hangs are real, of course). The instances I cracked up so far do not contain any sign of really being inside nsThread::Shutdown at all. So I'll try to understand the apparently bogus annotations first.
Comment 186•4 years ago
•
|
||
So in the vast majority of cases we are not stuck in the child process in any SpinEventLoopUntil and we need to look out for other common patterns. :-(
There are some cases though where we find ourselves stuck in RequestHelper::StartAndReturnResponse which might indicate some pitfall in the request/response flow of local storage.
The remaining cases with an event loop stack from the parent process are almost all happening during nsThread:.Shutdown for some thread on the parent process.
Comment 187•4 years ago
|
||
Just to check: it would not be a bug if this crash was the result of a process in an infinite loop in Javascript being killed, correct?
Comment 188•4 years ago
|
||
(In reply to Justin Peter from comment #187)
Just to check: it would not be a bug if this crash was the result of a process in an infinite loop in Javascript being killed, correct?
This is not really a crash. It's just us killing the process because it's taking too long to shut down. The crash reports are snapshots of the process right before we killed it.
Comment 189•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #177)
Heads up for everybody tracking this bug: we've decided to revert the change that split the crash signatures for this type of reports. When the change will be applied in the coming weeks the signatures will again collapse into one. The reasoning behind this is that we only found a handful of useful signatures and far too many unactionable ones. The latter are making nightly crash triage harder while providing little value. We'll keep gathering these reports for the time being until we figure out a better way to make them actionable.
Indeed we now see only crashes for the basic signature, so it is time for a signature cleanup here.
Comment 190•4 years ago
•
|
||
So looking at the first reports from build 20220210213101 with the new annotations from bug 1754208, the interesting thing is that we do not see any of them:
1 SendFinishShutdown (sent) 7 21.88 %
So there is a significant 78% of cases where we do not even reach RecvShutdown at all, it seems.
IIUC all starts here in the parent process and there are indeed (intentional) ways to call that function without sending the shutdown message to the child. It feels a bit odd that we apparently did not give the child any chance to shutdown and then just kill it, or am I overlooking something?
Comment 191•4 years ago
|
||
Yes, I did a writeup some time ago but I can't find it anymore. In many cases the main process sends the IPC message to shut down the child, but the child is busy, so it doesn't see the parent's message right away. After a while the child gets killed and it often hasn't seen the IPC message yet. More often than not because there were other pending messages that it had to process first. There's another factor compounding this: we reduce the priority of content processes for tabs that are not visible. So not only the child process might be busy, but the OS might be in no rush to run it as it's been informed that it shouldn't prioritize it.
Comment 192•4 years ago
|
||
Hmm, could it then be that the SendFinishShutdown (sent) case is just the opposite (the child went all through its shutdown but the acknowledge message does never arrive to the parent) ?
It might be an option to raise a child's priority before sending the shutdown message, that could help in some of the cases you described above. But am I assuming right, that if the message queue is already well filled we get queued at the end? Then a higher process priority will not really help if we do not tweak our internal processing, IIUC?
Comment 193•4 years ago
|
||
Yes, that also seems to happen. I had filed bug 1619676 about exploring the priority changes but as someone pointed out in that bug it could have the opposite effect by slowing down the main process and thus ending up in the same place as we are now.
Comment 194•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #191)
...
There's another factor compounding this: we reduce the priority of content processes for tabs that are not visible. So not only the child process might be busy, but the OS might be in no rush to run it as it's been informed that it shouldn't prioritize it.
(In reply to Gabriele Svelto [:gsvelto] from comment #193)
Yes, that also seems to happen. I had filed bug 1619676 about exploring the priority changes but as someone pointed out in that bug it could have the opposite effect by slowing down the main process and thus ending up in the same place as we are now.
Maybe instead of reducing priority for not visible tabs, and instead of raising priority of other tabs, how about just leaving the priorities alone and see what happens? Less is more kind of thing. Maybe just slightly elevate what needs elevation, instead of dropping what you used to think could be reduced. Maybe do a build with no priority changes, and see what pans out?
Comment 195•4 years ago
|
||
Just to confirm the observation of comment 190 with more numbers:
| Rank | IPC shutdown state | # | % |
|---|---|---|---|
| 1 | SendFinishShutdown (sent) | 292 | 17.70 |
| 2 | ShutdownInternal entry | 6 | 0.36 |
| 3 | content-child-shutdown started | 2 | 0.12 |
It confirms that we have only very few cases of a real hang during processing the shutdown sequence. It seems that in most of the cases either:
- the child process is to busy to even receive and start to process the shutdown request (81%)
- the parent process is to busy to even receive and acknowledge the successful shutdown (17%)
I assume this can only be changed if we create kind of a "priority lane" for shutdown messages that bypasses the normal queue. I wonder if having an additional IPC channel only for shutdown messages could help? In particular on the child process side there is probably not much reason to keep up the normal processing order until we eventually arrive at the shutdown event in our queue (which will always be kind of unexpected and random wrt to our internal state, such that handling it out of order should not be worse, IIUC). But also the ack messages could be processed out of order on the parent side, probably.
Comment 196•4 years ago
|
||
You can add a Priority annotation to an IPC message to increase the priority. We use that, for instance, for things related to input.
Also, looking at the earlier discussion, we do already have some code to raise the priority of processes we're shutting down.
Comment 197•4 years ago
|
||
FWIW, from a short glance at 10 reports in a row, I see:
- 4 of them are inside
ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst. Interestingly, all of these haveIPCShutdownState: SendFinishShutdown (sent)set, which would suggest that we already finished shutdown, but the position in the stack seems to suggest, we are not yet there? - 3 of them seem to just idle on the main thread
- 1 seems to be waiting for a mutex related to GC held by another thread
- 1 seems to be busy with destroying a docshell as a reaction on
BrowserChild::RecvDestroy()- this one has noIPCShutdownStateset at all and could be a case of a long running task that blocks the main thread. - 1 has a stack without symbols (even if opened in VS)
Probably the ChildProfilerController and the GC mutex case could merit a second look based on those stack traces.
Comment 198•4 years ago
|
||
Starting from March 2022 we see a slight downwards trend, it seems. Of those:
- ~ half of the crashes show now
NotifyImpendingShutdown received. This means, the content process was alerted but too busy (or hanging) to even process theShutdownConfirmedHPsent with high priority. - almost a quarter of the crashes carry
SendFinishShutdown (sent). This would indicate that the content process was able to finish its shutdown but the parent process did never receive or process theFinishShutdownmessage. - almost another quarter do not have any
ipc_shutdownannotation set, which is weird.
The remainder are some rare sparse crashes with other ipc_shutdown annotations set.
Comment 199•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #198)
- almost a quarter of the crashes carry
SendFinishShutdown (sent). This would indicate that the content process was able to finish its shutdown but the parent process did never receive or process theFinishShutdownmessage.
We could filter those out and not send reports for them: after all the content processes did shut down correctly. WDYT, should I file a bug? It would be a simple fix.
Comment 200•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #199)
We could filter those out and not send reports for them: after all the content processes did shut down correctly. WDYT, should I file a bug? It would be a simple fix.
I do not think we should do this. It most probably means that the parent is not aware of that child shutdown and continues to block its own shutdown process until timeout?
I think we should do bug 1755376 comment 17 to give shutdown notifications the highest possible priority in both directions.
Comment 201•4 years ago
|
||
OK, I trust your assessment, thanks!
Comment 202•3 years ago
|
||
It seems that bug 1777198 did not move the needle much here. Let's wait for the improved annotations for a better understanding, though.
Comment 203•3 years ago
|
||
But: Roughly 25% of the crashes do not have any dump file (and thus stack) associated, it seems? Could it be that those processes actually ended but the parent did not understand it? Example, the upload_file_minidump is empty, the upload_file_minidump_browser contains data.
[Inlineframe] xul.dll!google_breakpad::ExceptionHandler::WriteMinidump() Zeile 805 C++
xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(const std::wstring & dump_path, bool(*)(const wchar_t *, const wchar_t *, void *, _EXCEPTION_POINTERS *, MDRawAssertionInfo *, const mozilla::phc::AddrInfo *, bool) callback, void * callback_context, _MINIDUMP_TYPE dump_type) Zeile 831 C++
xul.dll!CrashReporter::CreateMinidumpsAndPair(void * aTargetHandle, unsigned long aTargetBlamedThread, const nsTSubstring<char> & aIncomingPairName, mozilla::EnumeratedArray<CrashReporter::Annotation,CrashReporter::Annotation::Count,nsTString<char>> & aTargetAnnotations, nsIFile * * aMainDumpOut) Zeile 3794 C++
xul.dll!mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair<mozilla::dom::ContentParent>(mozilla::dom::ContentParent * aToplevelProtocol, const nsTSubstring<char> & aPairName) Zeile 70 C++
> xul.dll!mozilla::dom::ContentParent::GeneratePairedMinidump(const char * aReason) Zeile 4291 C++
xul.dll!mozilla::dom::ContentParent::KillHard(const char * aReason) Zeile 4326 C++
[Inlineframe] xul.dll!nsCOMPtr<nsITimer>::get() Zeile 851 C++
It seems that ContentParent::KillHard does not check if we actually killed something?
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Updated•3 years ago
|
Comment 204•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #203)
It seems that ContentParent::KillHard does not check if we actually killed something?
It does check what happened on both Linux/macOS and Windows. In both cases if the process already died KillProcess() returns false.
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Minidump generation can fail in weird ways, including empty or truncated minidumps and ATM we don't have good checking for that outside of Linux. As I rewrite this stuff in the coming months we can add diagnostic information to work around this issue, for example by providing rich errors instead of malformed minidumps like we do on Linux using the new oxidized machinery.
Comment 205•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #204)
(In reply to Jens Stutte [:jstutte] from comment #203)
It seems that ContentParent::KillHard does not check if we actually killed something?
It does check what happened on both Linux/macOS and Windows. In both cases if the process already died
KillProcess()returns false.
Hmm, at least in the linked code snippet I just see a NS_WARNING but no other consequences? I assumed that means that in either cases we transmit the collected crash telemetry?
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Minidump generation can fail in weird ways, including empty or truncated minidumps and ATM we don't have good checking for that outside of Linux. As I rewrite this stuff in the coming months we can add diagnostic information to work around this issue, for example by providing rich errors instead of malformed minidumps like we do on Linux using the new oxidized machinery.
Thanks for working on this, that sounds very promising !
Comment 206•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #205)
Hmm, at least in the linked code snippet I just see a
NS_WARNINGbut no other consequences? I assumed that means that in either cases we transmit the collected crash telemetry?
Ah yes, good point, we can still upload the minidump if the process was killed in-between the two, or right when the minidump was being written. I'll file a bug to discard those instead of submitting them.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 208•3 years ago
|
||
1 6473 46.44 % - NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
2 2043 14.66 % - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
3 1627 11.67 % - NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
4 486 3.49 % - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
Looking at IPC shutdown state I see 4 main cases:
- Tab is closing & Child gives no sign of live after receiving RequestContentJSInterrupt
- Parent shuts down & Child gives no sign of live after receiving RequestContentJSInterrupt
- Tab is closing & Child notified the parent about its shutdown but the parent did not process the notification
- Parent shuts down & Child notified the parent about its shutdown but the parent did not process the notification
Annotations interpretation:
Tab is closing: two times NotifiedImpendingShutdown, the first is from ContentParent::NotifyTabWillDestroy, the second from ContentParent::SignalImpendingShutdownToContentJS
Parent shuts down: the only NotifiedImpendingShutdown is from ContentParent::SignalImpendingShutdownToContentJS coming from ContentParent::BlockShutdown
Parent did not process shutdown notification in time: SendFinishShutdown (sent) is present
Updated•3 years ago
|
Comment 211•2 years ago
|
||
It might be early to tell but it seems that the patches from bug 1837467 had a pretty positive impact on the numbers here.
In summary, that patch has two consequences:
- We ensure that we start the
ForceKillTimeronly after we actually told the child process to shutdown, preventing that slow (unrelated) processing in the parent would hit us. - We indirectly increased the time the content process has to do shutdown related things by adding the second timer for the
Browseractor destroy cycle.
It is unclear, if just increasing the timeout would have had a similar effect (although I'd expect the slow parent case to be improved only statistically this way, not systematically). It might be worth to experiment with different timer settings (like split the overall time over the two timers with some ratio) to get closer to the former timeouts again.
Updated•1 year ago
|
Comment 214•1 year ago
|
||
¡Hola y'all!
Updating flags per what's visible over at
¡Gracias!
Alex
Comment 215•1 year ago
|
||
¡Hola y'all!
Just noticed these crashes on my personal Nightly install:
bp-540f18d0-3c98-4cd2-b245-586ad0250315
bp-8137a9e1-3eed-4469-852b-8a9d30250315
bp-5c41b607-a2f7-4731-a29b-80a570250315
bp-ef0ea947-3c2e-4eb4-a2b5-493750250315
bp-687dc1a1-e69a-4726-b2e4-997fd0250315
bp-f7d0ca27-a4f3-4cd2-8e4b-64a9b0250315
bp-74b6de71-d25b-43aa-9e5d-112c40250315
Updating flags per
It might be mildly interesting that 48.5% of the crashes there are on 138.0a1.
Is there anything obvious to others causing that spike perhaps?
¡Gracias!
Alex
Comment 216•1 year ago
|
||
There seemed to be a spike only with build 20250314211155. I think this could have been bug 1954244 that has been backed out two builds after.
Comment 217•7 months ago
|
||
¡Hola, y'all!
Got these myself:
bp-a198ce0f-c79c-4960-a6d0-23ed10251001 01/10/2025, 11:08 a.m.
bp-56e65642-e8b0-4827-8fd6-bb7250251001 01/10/2025, 11:08 a.m.
bp-13381eb4-26fd-41f4-a785-6b1ec0251001 01/10/2025, 11:08 a.m.
bp-175c1191-99c4-4b89-ba88-24c2b0251001 01/10/2025, 11:08 a.m.
bp-d8822cef-834c-432e-8c8a-f9fc90251001 01/10/2025, 11:08 a.m.
bp-4bdd8c93-6832-46ff-a1da-bdfc30251001 01/10/2025, 11:08 a.m.
bp-9ca80071-ba2d-4e4d-b2f7-583190251001 01/10/2025, 11:06 a.m.
bp-2f6955ec-ed07-4181-8a6e-b0cba0251001 01/10/2025, 11:06 a.m.
bp-d0e0952f-3aef-4293-90ae-f18b50251001 01/10/2025, 11:06 a.m.
bp-629a21f6-df03-4526-b5b5-40d510251001 01/10/2025, 11:06 a.m.
bp-685fa733-f9a0-481b-964e-d78920251001 01/10/2025, 11:06 a.m.
bp-140a00e8-179d-4187-9d30-49f540251001 01/10/2025, 11:06 a.m.
bp-c77e3d2b-83cf-466d-b51d-f4efc0251001 01/10/2025, 11:06 a.m.
bp-7ce7ce6b-fc7a-4d6a-8a36-538a90251001 01/10/2025, 11:06 a.m.
bp-3558c452-5c96-49f3-ae3b-913a10251001 01/10/2025, 11:06 a.m.
bp-c142233c-875b-440f-a18d-0dcc00251001 01/10/2025, 11:06 a.m.
bp-023f9c8e-6360-4779-b12a-bd4030251001 01/10/2025, 11:06 a.m.
bp-d3954e1f-ebd0-469f-a2f2-2bd5f0251001 01/10/2025, 11:06 a.m.
bp-2b47cb50-8487-418c-8fdb-4ef400251001 01/10/2025, 11:06 a.m.
bp-0e779420-626a-4814-90cb-e38da0251001 01/10/2025, 11:04 a.m.
bp-9e728cff-436a-42af-9b68-09b000250930 30/09/2025, 01:47 p.m.
bp-e7079878-aefd-41e1-ac42-3126e0250929 29/09/2025, 10:44 a.m.
Updating flags per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill
FWIW.
Mildly interesting that 145.0a1 represents 57.7% on last week's reports.
¡Gracias!
Alex
I've managed to encounter a few ANRs with child windows open, of which crash-stats.mozilla.org/report/index/fc759ed1-d3fb-45d5-b82b-ec48e0260117 is an example, from when I invoked a recently-closed window.
Comment 219•18 days ago
|
||
14-day ShutDownKill triage (nightly, 2026-04-22 -> 2026-05-06)
5288 IPCError-browser | ShutDownKill content-process crashes on nightly in the window. The top-100 proto_signatures account for 1814 of those (the long tail beyond #100 is small per-signature). BHR correlation uses build 20260430 (the "current" hangs_child file was empty/mis-served on 20260501-20260505).
Of the sampled 1814:
- 821 are idle-at-shutdown (all-boilerplate stacks: event loop, message pump, condition-variable wait). Not individually actionable.
- 993 land on identifiable application code, grouped below by top non-boilerplate frame.
Top clusters (kill counts, plus BHR percentage of total content-process hang time on 2026-04-30 where the same function appears):
-
ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst - 472 kills.
Spinning the event loop in GrabShutdownProfileAndShutdown under ContentChild::ShutdownInternal / RecvShutdown. Combined with sibling frames on the same teardown chain (nsThread::Shutdown 15 kills, BaseProfilerMutex::Lock 15 kills), this cluster is ~502 kills - by far the dominant signal. Tracked in bug 1622110. -
Style cascade clearing - ~161 kills combined:
- style::selector_map::MaybeCaseInsensitiveHashMap::clear (106)
- servo_arc::Arc Drop under SelectorMap::clear (20)
- gecko_string_cache::Atom::is_static under SelectorMap::clear (17)
- SelectorMap::clear directly (10)
- properties::gecko Drop -> Element::ClearServoData (8)
All on the CascadeData::clear_cascade_data / Servo_Element_ClearData path. Being addressed under bug 2021304 and bug 2037326.
-
once_cell::imp::initialize_or_wait - 49 kills. NSS-style lazy initialization stalling shutdown. Likely covered by bug 1370516 ("NSS should be initialized off main thread").
-
MemoryTelemetry::GatherReports / jemalloc_stats - 45 kills combined (#4 + #6). Allocator-state traversal driven by MemoryTelemetry::Poke during shutdown. Not tracked as a dependency of this meta - candidate for a new bug.
-
IPC::ChannelWin::Send (PWebRenderBridgeChild::SendEnsureConnected) - 12 kills, but 1.11% of total BHR hang time on 2026-04-30 (the highest BHR signal of any entry in this triage). Implies the same hang occurs in normal browsing, not just shutdown. Candidate for a new bug under both 1279293 and the BHR meta.
-
MessageChannel::WaitForSyncNotify -> WebRenderBridge SendEnsureConnected - 5 kills, 0.46% BHR. Same WebRender connect path as (5).
Smaller clusters (5-25 kills each): GC parallel-task joins, jemalloc free, JS arena finalization, nsTimerImpl::InitHighResolution, ScriptLoader cancel, mozannotation_record_cstring, ipc::shared_memory::Platform::Unmap, etc. None individually large; many are shutdown-only with no BHR match.
Recommendations:
- Bug 1622110 remains the highest-leverage fix (~9% of all 14-day kills, ~28% of identifiable kills).
- The style-cascade cluster (~3% of all kills) is being driven down by 2021304 / 2037326; the residual is consistent with the still-synchronous document-teardown drain.
- Files worth filing as new dependencies of this meta: MemoryTelemetry::GatherReports during shutdown (#4/#6), and IPC::ChannelWin::Send / WebRenderBridge SendEnsureConnected (#5/#6 above) - the latter has a clear BHR co-signal so the fix benefits day-to-day responsiveness too.
Comment 220•18 days ago
•
|
||
Follow-up on the idle-at-shutdown bucket from the previous comment (821 of 1814 sampled).
I pulled 20 idle-main-thread crash dumps from the last 14 days and walked every thread on every dump (~640 thread-instances) looking for any sign that the child had received or begun processing shutdown. The sample is small, but the pattern looks consistent:
- No frame matching RecvShutdown / ShutdownInternal / XPCOMShutdown / Quit / any shutdown-observer call appeared on any thread of any of these 20 dumps.
- ShutdownProgress, AsyncShutdownTimeout, MozCrashReason are all null on these dumps. The only shutdown-related field is the synthetic ipc_channel_error: "ShutDownKill" that the parent's KillHard path sets.
- On every sampled dump, every thread appears parked on a Windows kernel wait (ZwWaitForAlertByThreadId on the worker pool, NtRemoveIoCompletion on IPC I/O Child / AudioIPC, NtWaitForSingleObject/MultipleObjects elsewhere). On the dumps I checked the IPC I/O Child had nothing pending in its completion port.
If that pattern holds more broadly, these crashes seem unlikely to be "child stuck on something." They look more like the child being fully quiescent at the moment the parent terminates it, apparently before the shutdown message has been picked off the IO completion port - and the minidump is captured by ContentParent::KillHard immediately before TerminateProcess.
System demographics for the 20 sampled (sorted by uptime):
uptime avail_mb total_mb cpu os
31 202 3499 2 Win11
34 569 8090 4 Win10
138 1942 7971 4 Win11
149 22487 40894 6 Win11
245 569 16239 4 Win10
469 1104 16070 12 Win11
1048 4981 12184 4 Win11
2442 421 3517 2 Win10
2445 506 3517 2 Win10
2729 602 3918 2 Win11
2860 1334 8087 2 Win10
2980 835 8087 2 Win10
11577 536 3517 2 Win10
12642 2672 10170 4 Win11
61980 844 6020 2 Win11
66390 779 5965 2 Win11
73330 19454 32452 2 Win10
81549 4157 32272 8 Win11
15/20 have <=4 CPU cores and most show under 1 GB available physical RAM. Uptime spreads from 31s to 22h; both ends are represented but low-end / memory-pressured machines seem to dominate. With n=20 this is suggestive rather than conclusive.
If that demographic skew is real, the most plausible mechanisms appear to be variants of "the OS apparently doesn't give the child enough wall-clock to react inside the parent's KillHard timeout":
- Working set trimmed: a backgrounded content process may have its private working set evicted under system pressure; when RecvShutdown arrives the kernel might need to fault code/data pages back in just to schedule the IO Child thread.
- EcoQoS / Process State Manager throttling on Win11 background processes.
- Modern Standby / system-resume races on laptops.
- Below-Normal priority combined with foreground contention on 2-core systems.
If any of these is the dominant cause, none of them would really be a hang in the child to fix - the issue would be that the OS is giving the child less wall-clock than the parent assumed it would have.
A possible practical lever, rather than chasing these as per-signature bugs, could be to proactively retire idle background content children when the parent observes system-level memory pressure, instead of letting them go cold and become KillHard candidates later. The relevant pieces already seem to exist but apparently aren't joined up:
- In-process memory-pressure observers fire inside each content process; they don't appear to drive parent-side eviction.
- nsITabUnloader unloads tabs under pressure but a content child seems to stay alive until its keep-alive grace expires. The lag between unload and process termination might be the window where pages get trimmed.
- ProcessPriorityManager adjusts priority/visibility, and the parent does lift process priority and main-thread QoS just before sending RecvShutdown (ContentParent.cpp:3659 and 1694-1698) - but neither path terminates a child to relieve system pressure, and the kill-timeout itself is not gated on observed pressure.
One candidate next step would be exploring whether a system-wide pressure signal (Windows MemoryResourceNotification, equivalents elsewhere) could feed into the parent-side decision to prefer terminating empty/idle content children earlier. If that worked, it would presumably reduce both the cold-page footprint during the session and the number of children the parent has to wake at quit - which appears to be what produces the bulk of this 821-crash bucket. But this is speculative; it would need to be proofed by data once we land something like this.
Tentative recommendation for this meta: treat the idle bucket as probably non-actionable per-signature noise.
Description
•