[meta] Crash in [@ IPCError-browser | ShutDownKill]
Categories
(Core :: DOM: Content Processes, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox47 | --- | wontfix |
firefox48 | --- | wontfix |
firefox49 | --- | wontfix |
firefox-esr45 | --- | wontfix |
firefox50 | + | wontfix |
firefox51 | --- | wontfix |
firefox52 | --- | wontfix |
firefox-esr52 | --- | wontfix |
firefox53 | --- | wontfix |
firefox-esr115 | --- | affected |
firefox54 | --- | wontfix |
firefox55 | --- | wontfix |
firefox56 | --- | wontfix |
firefox57 | --- | wontfix |
firefox58 | --- | wontfix |
firefox59 | --- | wontfix |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox68 | --- | wontfix |
firefox69 | --- | wontfix |
firefox70 | --- | wontfix |
firefox73 | --- | wontfix |
firefox74 | --- | wontfix |
firefox75 | --- | wontfix |
firefox83 | --- | wontfix |
firefox84 | --- | wontfix |
firefox85 | --- | wontfix |
firefox86 | --- | wontfix |
firefox87 | --- | wontfix |
firefox88 | --- | wontfix |
firefox89 | --- | wontfix |
firefox90 | --- | wontfix |
firefox91 | --- | wontfix |
firefox92 | --- | wontfix |
firefox93 | --- | wontfix |
firefox94 | --- | wontfix |
firefox131 | --- | affected |
firefox132 | --- | affected |
firefox133 | --- | affected |
People
(Reporter: marvinhk, Unassigned)
References
(Depends on 11 open bugs, )
Details
(5 keywords)
Crash Data
Attachments
(2 files)
Comment 1•9 years ago
|
||
Comment 2•9 years ago
|
||
Comment 3•9 years ago
|
||
Comment 6•9 years ago
|
||
Comment 8•9 years ago
|
||
Comment 9•9 years ago
|
||
Comment 12•9 years ago
|
||
Comment 14•9 years ago
|
||
Comment 15•9 years ago
|
||
Comment 16•9 years ago
|
||
Comment 17•8 years ago
|
||
Comment 18•8 years ago
|
||
Updated•8 years ago
|
Comment 20•8 years ago
|
||
Comment 21•8 years ago
|
||
Comment 23•8 years ago
|
||
Comment 24•8 years ago
|
||
Comment 25•8 years ago
|
||
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Comment 28•8 years ago
|
||
Comment 29•8 years ago
|
||
Updated•8 years ago
|
Updated•8 years ago
|
Comment 35•8 years ago
|
||
Comment 37•8 years ago
|
||
Comment 38•8 years ago
|
||
Comment 39•8 years ago
|
||
![]() |
||
Comment 40•8 years ago
|
||
Comment 41•8 years ago
|
||
Comment 42•8 years ago
|
||
Comment 43•8 years ago
|
||
Comment 44•8 years ago
|
||
Comment 45•8 years ago
|
||
Comment 46•8 years ago
|
||
Comment 47•8 years ago
|
||
Comment 48•8 years ago
|
||
Comment 49•8 years ago
|
||
Comment 50•8 years ago
|
||
Updated•8 years ago
|
Comment 51•8 years ago
|
||
Comment 52•8 years ago
|
||
Comment 53•8 years ago
|
||
Comment 58•8 years ago
|
||
Comment 59•8 years ago
|
||
Comment 60•8 years ago
|
||
Comment 61•8 years ago
|
||
Comment 62•8 years ago
|
||
Comment 63•8 years ago
|
||
Comment 64•8 years ago
|
||
Comment 65•8 years ago
|
||
Comment 72•8 years ago
|
||
Comment 73•8 years ago
|
||
Comment 74•8 years ago
|
||
Comment 75•8 years ago
|
||
Comment 76•8 years ago
|
||
Comment 77•8 years ago
|
||
Comment 79•8 years ago
|
||
Comment 80•8 years ago
|
||
Comment 81•8 years ago
|
||
Updated•8 years ago
|
Comment 82•8 years ago
|
||
Comment 83•8 years ago
|
||
Comment 84•8 years ago
|
||
Comment 85•8 years ago
|
||
Comment 86•8 years ago
|
||
Comment 87•8 years ago
|
||
Comment 88•8 years ago
|
||
Comment 89•8 years ago
|
||
Comment 90•8 years ago
|
||
Comment 92•8 years ago
|
||
Comment 93•8 years ago
|
||
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Comment 94•8 years ago
|
||
Comment 95•8 years ago
|
||
Comment 96•8 years ago
|
||
Updated•8 years ago
|
Comment 97•8 years ago
|
||
Comment 99•7 years ago
|
||
Comment 100•7 years ago
|
||
Comment 101•7 years ago
|
||
Comment 102•7 years ago
|
||
![]() |
||
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Comment 106•6 years ago
|
||
¡Hola!
Adding the flag for 68 as per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill#summary it accounts for 91.2% of the total crashes.
FWIW all of the crashes on my Nightly in the last month or so are variants of this:
Submitted Crash Reports
Report ID Date Submitted
bp-9e113f57-ef91-403e-853f-2e06b0190515 5/14/2019, 7:44 PM
View
bp-d56a8ef4-0853-4d07-87cf-b9f170190515 5/14/2019, 7:44 PM
View
bp-2252da92-b4f8-4cef-915d-0d2670190514 5/13/2019, 8:27 PM
View
bp-3407c931-83e4-4a6f-96b1-b894d0190514 5/13/2019, 8:27 PM
View
bp-7a5a8541-7d1a-4815-9ccd-3a3250190514 5/13/2019, 8:27 PM
View
bp-57533b9a-a227-4995-a315-725f10190514 5/13/2019, 8:27 PM
View
bp-74da1097-3526-4b1a-9654-e34ae0190513 5/13/2019, 9:11 AM
View
bp-342bf38b-c00c-4c85-8454-447b70190510 5/10/2019, 10:17 AM
View
bp-1bcfe01d-3bed-4ff0-8a2f-d22710190510 5/10/2019, 8:56 AM
View
bp-be35f4be-6f78-4e48-8e73-a49a70190509 5/9/2019, 9:09 AM
View
bp-d42e065c-5710-492b-a2e5-4b33b0190509 5/9/2019, 9:09 AM
View
bp-04d12eba-5a5f-4351-8932-efddf0190509 5/9/2019, 9:09 AM
View
bp-eee693e1-0831-43f3-95a6-bb0980190509 5/9/2019, 9:09 AM
View
bp-658039b7-241a-4d98-8c1c-9f3780190509 5/9/2019, 9:09 AM
View
bp-ad35eb3d-8add-4730-85f8-3dcb00190509 5/9/2019, 9:09 AM
View
bp-6b0d5d47-535e-4dcc-97d7-0ccc60190509 5/9/2019, 9:09 AM
View
bp-4b7c1b5d-bdbf-4850-a775-f5ef90190509 5/9/2019, 9:09 AM
View
bp-001707f9-503e-41d6-bdd2-80de60190508 5/8/2019, 11:46 AM
View
bp-996a13c2-3bce-4226-98b5-d918c0190508 5/8/2019, 11:46 AM
View
bp-98c46dd3-a812-4f23-9011-bf22b0190506 5/6/2019, 4:03 PM
View
bp-3ad29622-5e18-4f48-945a-514390190503 5/3/2019, 3:22 PM
View
bp-106b7b55-3629-47c2-abfc-3cbd10190503 5/3/2019, 9:09 AM
View
bp-fa2c5fdd-17c2-4519-8236-f439b0190501 4/30/2019, 8:24 PM
View
bp-52f83e1a-9bd9-4332-89e1-dd8640190429 4/29/2019, 11:52 AM
View
bp-9c0376bc-200d-45ba-bd75-1d0770190429 4/29/2019, 11:52 AM
View
bp-75007eeb-2cd9-45dd-a4e2-d74280190429 4/29/2019, 11:52 AM
View
bp-c91bfdaf-d5ff-4f0c-8ed8-7ef250190429 4/29/2019, 11:52 AM
View
bp-8d079313-a3d8-44cb-b11e-1ab5c0190429 4/29/2019, 11:52 AM
View
bp-fcbfa40c-31da-4c8d-b138-406200190429 4/29/2019, 11:52 AM
View
bp-7d798392-8924-4c2b-bb6d-d3af90190429 4/29/2019, 11:52 AM
View
bp-ea25d6e2-0ccc-4faf-a537-37b360190429 4/29/2019, 11:52 AM
View
bp-05304540-cfb0-42c7-95c1-d197c0190425 4/25/2019, 9:12 AM
View
bp-750cc9de-d914-4406-b78f-53ba80190424 4/24/2019, 10:03 AM
View
bp-2bba28a8-3f01-494e-b20b-982590190424 4/24/2019, 10:03 AM
View
bp-74065cdd-52ba-4682-92f4-7676d0190424 4/24/2019, 10:03 AM
View
bp-c9e2255a-bd10-4471-8e59-cb2820190424 4/24/2019, 10:03 AM
View
bp-57861240-6f6f-415a-8312-4f0c50190424 4/24/2019, 10:03 AM
View
bp-cc5d3dec-9c8e-41eb-aad5-6e84e0190423 4/23/2019, 9:04 AM
View
bp-df0a195b-2441-4621-b56d-705f50190421 4/21/2019, 1:25 PM
View
bp-9817d377-101f-4b4b-990c-0b1f90190410 4/10/2019, 4:36 PM
View
bp-932d2082-6f7c-4eb2-91e4-ea71f0190410 4/10/2019, 9:10 AM
View
Do let me know if there's anything I could provide on this bug and I'll be happy to do so.
¡Gracias!
Alex
This saw a large uptick during March-2019 driven by the nightly channel. I understand that this bug is a catch all of sorts due to how our crash signatures are handled, but do we have an explanation for the uptick?
Edit: seems like this discussion has happened over at https://bugzilla.mozilla.org/show_bug.cgi?id=1219672#c79
Updated•6 years ago
|
Comment 117•5 years ago
|
||
I see nothing actionable here. This looks more like a meta bug. Removing the qawanted flag.
Comment 122•5 years ago
|
||
Now that bug 1612569 is fixed the signature here dropped to zero. I suggest we keep using this bug both as a meta for bugs with appropriate signatures (as blocking this) and we update the signature here to include crashes where the stacks are all over the place and thus are unlikely to be actionable. I've already identified a few of them such as:
- IPCError-browser | ShutDownKill | mozilla::ipc::MessagePump::Run
- IPCError-browser | ShutDownKill | Nt.*
- IPCError-browser | ShutDownKill | PR_MD_WAIT_CV | _PR_WaitCondVar | PR_Wait | nsThreadStartupEvent::Wait
Given the volume of these crashes we should also re-evaluate the shutdown killer time which is currently set at 5 seconds. We might want to bump it up and see if it affects the crash rate.
Another point: I've found crashes where the content process was still starting up when we killed it. I wonder if we do support this scenario, i.e. sending a process a shutdown message before it has finished initializing.
Comment 123•5 years ago
|
||
There's also a worrisome number of crashes where all the threads in the process are stuck waiting for something. In many cases these involve event queues and thread pools so I fear we might be seeing actual races caused by a deadlock.
Comment 124•5 years ago
|
||
As per my previous comment I'll start adding signatures that don't seem directly actionable to this bug so we can keep track of the non-actionable rate of these hangs.
Comment 125•5 years ago
|
||
Quick update: I may have figured out the reason behind many of the shutdown hangs that look like deadlocks. You can see the nitty gritty details in bug 1614570 comment 1 but in short what's happening is that we stop the event loops at the moment we receive the content process shutdown message. If some code relies on runnables or spinning an event loop for its shutdown procedure it will deadlock.
Comment 126•5 years ago
|
||
Crashes with this signature are content processes being slow, not stuck. They're signaling the main process that they're done shutting down but didn't do it in time.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 127•5 years ago
|
||
Adding more signatures that don't seem actionable. The crashes are in various states of shutdown so they're not stuck but probably just slow.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 129•5 years ago
|
||
Added another "we're just being slow" signature.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 130•5 years ago
|
||
The last signature belongs to bug 1578070, I'm moving it there.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 131•5 years ago
|
||
Added more "slow" signatures.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 133•5 years ago
|
||
¡Hola!
Got
Firefox 79.0a1 Crash Report [@ IPCError-browser | ShutDownKill | PR_MD_WAIT_CV | _PR_WaitCondVar | PR_Wait | nsThreadStartupEvent::Wait ]
Submitted Crash Reports
Report ID Date Submitted
bp-92ef466b-a927-4eea-9d9f-6bc280200616 6/16/2020, 06:16
¡Gracias!
Alex
Comment 134•5 years ago
|
||
This is a meta bug, please leave status flags unset. If there's specific instances of these crashes that are reproducible or actionable they should have their own bugs and status flags.
Comment 135•5 years ago
|
||
¡Hola Gabriele!
For reasons unbeknownst to me Nightly has been hang happy in the most recent two weeks but the crash reporter wont' fire.
This is what I manage to find in about:crashes FWIW:
Submitted Crash Reports
Report ID Date Submitted
bp-82dbed26-65af-4dbe-8f25-2519a0200710 7/10/2020, 10:26
View
bp-2e6fb69d-d92e-47cb-bf80-80c3c0200709 7/9/2020, 18:40
View
bp-9933bec5-18ff-4ff3-ac14-aed1f0200707 7/7/2020, 11:28
View
bp-3f3c8853-1258-4699-9d80-96fb00200707 7/7/2020, 11:28
View
bp-a20e6067-7ac0-40b8-b00d-38a730200627 6/26/2020, 20:16
View
bp-db518be7-813b-4b10-83fd-fd2910200627 6/26/2020, 20:16
View
bp-becff340-b12f-4c18-a79b-1ba400200627 6/26/2020, 20:05
View
bp-0c224c7e-8df6-46ae-8a74-b82ce0200624 6/23/2020, 19:29
View
bp-4b538f74-a911-4b26-80f6-d92ff0200623 6/23/2020, 08:22
View
bp-d21e6c07-f95b-4129-9106-842a50200622 6/22/2020, 08:50
View
bp-6b517307-362a-4822-8ebe-52f5c0200622 6/22/2020, 08:34
View
Can you please comment on what could be done to try and pin down the cause of these crashes, please?
¡Gracias!
Alex
Comment 136•5 years ago
|
||
These reports won't trigger the crash reporter because they aren't actual crashes, they're hangs. Basically a content process took too long to shut down and we killed it after a certain time (currently 20s). Before killing it we grab a snapshot of it's state and that's how you got those. Do you have fission enabled? If you do then you'll have more content processes around and it will be more likely that one is too slow when shutting down.
Comment 137•5 years ago
|
||
¡Hola Gabriele!
Thanks for responding here.
fission.autostart is set to false in this Nightly.
I guess I'll continue to try and crash the hung Nightly with https://ftp.mozilla.org/pub/utilities/crashfirefox-intentionally/crashfirefox64.exe for the time being and see if an usable crash report is generated.
If anyone has any ideas on how to pin those down I'd be grateful if they share some ideas.
¡Gracias!
Alex
Comment 138•5 years ago
|
||
Bug 1637048 has significantly reduced the volume here. It's time to revisit the signatures and check if we've got some actual deadlocks in there. NI? me so I don't forget.
Comment 139•5 years ago
|
||
For starters I pruned all the signatures that have no crashes for recent versions.
Updated•4 years ago
|
Comment 141•4 years ago
|
||
¡Hola Gabriele!
Sorry to bother you again.
Some more crashes from the most recent past two weeks:
Submitted Crash Reports
Report ID Date Submitted
bp-f26ad0c1-1202-49cf-bd25-dcb8c0200902 9/2/2020, 16:34
bp-6533606c-4fb5-4ac7-b891-3f0410200828 8/28/2020, 11:18
bp-873447b7-869f-4772-8f6d-972700200825 8/25/2020, 08:41
bp-6da34ca0-f564-48e1-8aa1-9a5730200824 8/24/2020, 08:56
bp-217dec7c-f88a-4812-b2e7-e89400200823 8/23/2020, 09:53
bp-989f9c12-eb5c-46ba-ab5f-d076a0200822 8/21/2020, 22:03
bp-5f74b094-7a1b-4b3e-919c-b6cd30200821 8/21/2020, 12:11
bp-79ee51df-9d99-4721-a6cc-6f41a0200820 8/20/2020, 04:56
bp-fc0ae5be-5621-4534-a4c3-24d710200819 8/19/2020, 08:59
Please let me know here if any of these needs separate bug reports and I'd be happy to file them.
¡Gracias!
Alex
Comment 142•4 years ago
|
||
Thanks Alex. I glanced over the crashes and it seems we have everything on file already. One question however: most of your crashes appear to be content processes being slow but from the looks of it your machine is fairly powerful. Do you have an SSD or a regular HDD on your machine? I'd like to figure out what could be causing those content processes to respond slowly.
Comment 143•4 years ago
|
||
¡Hola Gabriele!
Thanks for taking a look into these.
Here are the main specs for this laptop:
Processor Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz 2.90 GHz
Installed RAM 24.0 GB (23.8 GB usable)
System type 64-bit operating system, x64-based processor
Edition Windows 10 Pro Insider Preview
Version 2004
Installed on 9/5/2020
OS build 20206.1000
Experience Windows Feature Experience Pack 120.22800.0.0
SSD SAMSUNG MZVLB1T0HALR-000L7
Hope this helps.
If you need anything else from this system, please just ni?
¡Gracias!
Alex
Comment 144•4 years ago
|
||
Thanks Alex, this is very useful. You've got a fast machine so there's no real reason why content processes might take so long to shut down. We'll have to do a proper profile of a content process shutting down to figure out what's going on. I also wonder if Windows task scheduling might be playing a part here.
Comment 145•4 years ago
|
||
Firefox just abruptly crashed for me while I was in another desktop. Switched back to the desktop, no crash reporter dialog.
bp-79d24fbe-41a3-4463-a636-8037b0201014
bp-724c3c22-e11e-4a14-a0d5-ad6870201014
![]() |
||
Updated•4 years ago
|
Comment 146•4 years ago
|
||
¡Hola y'all!
Found this one on 85.0a1:
bp-5d378d24-1334-4e36-bc09-274930201123 11/23/2020, 17:53
Updating flags per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
Seems like it is happening much more on 85.0a1 in the most recent past week BTW.
Hope this is helpful.
¡Gracias!
Alex
Comment 147•4 years ago
|
||
Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.
Comment 148•4 years ago
|
||
¡Hola y'all!
Found this one on 86.0a1:
https://crash-stats.mozilla.org/report/index/11bfbef7-8317-4f9a-80e4-66b450201223#tab-details
Updating flags per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20NtYieldExecution
Seems like it is happening much more on 86.0a1 in the most recent past week BTW.
Hope this is helpful.
¡Gracias!
Alex
Updated•4 years ago
|
Comment 150•4 years ago
|
||
¡Hola!
Found this one on:
Product Firefox
Release Channel nightly
Version 87.0a1
Build ID 20210212100155 (2021-02-12)
bp-f305824e-bd5b-4a68-a7ee-74e980210213 2/13/2021, 02:10
Updating flags.
¡Gracias!
Alex
![]() |
||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 151•4 years ago
|
||
¡Hola y'all!
Found these on
Product Firefox
Release Channel nightly
Version 88.0a1
Build ID 20210227094458
Report ID Date Submitted
bp-43a0d1ca-dcae-42dd-ab6b-797ad0210302 3/2/2021, 15:07
bp-1a91f848-051d-40c5-97f3-fd9e80210227 2/27/2021, 20:26
Updating flags.
¡Gracias!
Alex
Updated•4 years ago
|
Comment 152•4 years ago
|
||
¡Hola Gabriele!
Is https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20servo_arc%3A%3AArc%3CT%3E%3A%3Adrop_slow%3CT%3E%20%7C%20nsIFrame%3A%3A~nsIFrame&date=%3E%3D2020-12-12T00%3A47%3A00.000Z&date=%3C2021-03-12T00%3A47%3A00.000Z also this one or a different one?
¡Gracias!
Alex
Comment 153•4 years ago
|
||
¡Hola Gabriele!
Also these:
¡Gracias!
Alex
Comment 154•4 years ago
|
||
Thanks Alex, these signatures belong in here.
Comment 155•4 years ago
|
||
Hi,
Although there are only a few crashes here, this signature belongs here too.
Thanks :)
Comment 156•4 years ago
|
||
I don't know, the stacks are a bit of a mish-mash under this signature.
![]() |
||
Updated•4 years ago
|
Comment 157•4 years ago
|
||
¡Hola Gabriele!
Is https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20xpc%3A%3AXrayWrapper%3CT%3E%3A%3AgetPrototype&date=%3E%3D2021-03-09T03%3A42%3A00.000Z&date=%3C2021-04-09T03%3A42%3A00.000Z also this one or a different one?
Also got bp-ad653f77-731c-4669-9b2e-407ce0210409 in
Product Firefox
Release Channel nightly
Version 89.0a1
Build ID 20210407212527
so updating flags FWIW.
¡Gracias!
Alex
Comment 158•4 years ago
|
||
This one, thanks!
![]() |
||
Comment 159•4 years ago
|
||
The volume for this has increased in total. On nightly, [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessagePump::Run]
moved from 200-250 crashes/day to 350-600. Might have started with 20210321093903 but the changes look unrelated.
Comment 160•4 years ago
|
||
This is a natural effect of having turned on Fission for more users. With more content processes being active at a given time the chance of one being slow at shotdown has increased thus leading to more reports here. Given this is only going to increase in the future we probably must decide what to do with these reports: it's been a while since we found content processes that were genuinely stuck at shutdown in a way that was actionable.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
![]() |
||
Updated•4 years ago
|
Comment 163•4 years ago
|
||
¡Hola y'all!
Got https://crash-stats.mozilla.org/report/index/57f5e1d9-e5e1-4992-b2c7-24c530210513 on 90.
Updating flags FWIW.
¡Gracias!
Alex
Comment 164•4 years ago
|
||
Found another signature. I should find time to go over all the existing ones and leave only the relevant ones.
Comment 165•4 years ago
|
||
Comment 166•4 years ago
|
||
¡Hola Gabriele!
FWIW per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2021-04-27T16%3A37%3A00.000Z&date=%3C2021-05-27T16%3A37%3A00.000Z IPCError-browser | ShutDownKill | js::jit::MaybeEnterJit has spiked a bit on 90.
¡Gracias!
Alex
Comment 167•4 years ago
|
||
(In reply to alex_mayorga from comment #166)
FWIW per https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2021-04-27T16%3A37%3A00.000Z&date=%3C2021-05-27T16%3A37%3A00.000Z IPCError-browser | ShutDownKill | js::jit::MaybeEnterJit has spiked a bit on 90.
Yes, this might be an actual issue. None of the crashes have the ipc shutdown state annotation set which means they didn't even begin to shutdown; it's possible that they were genuinely stuck, possibly deadlocked.
Comment 168•4 years ago
|
||
I have been experiencing this crash and rarely see a crash reporter window, you have to go in and submit the crash reports. Losing the tab history with each crash just makes it hardly worth using daily.
Having the window simply disappear without anything is just not a satisfactory outcome for me. We are not really dealing with a crash on shutdown. The user interface disappears while I was using it without notice and without any of the normal crash reporting occurring and nothing in the windows event log. If my experience is anything to go by, you are not getting but a fraction of the reports submitted. When I went to look after this last disappearance I submitted a whole host of old reports and my first attempt to submit them resulted in a fail appearing in the troubleshooting list. I think the crash reporter may also be fundamentally broken here.
Comment 169•4 years ago
|
||
There are no crashes under this bug, these are a specific type of shutdown hangs. If Firefox disappears without showing the crash reporter please file another bug with a detailed description of your setup (OS, Firefox version, sites/scenarios where it use to happen). Use the Toolkit > Crash Reporting component because the crash reporter not showing up belongs there.
Comment 170•4 years ago
|
||
Here's another signature. I'm not sure why mozglue.dll is in the signature and not symbolized.
![]() |
||
Updated•4 years ago
|
Comment 171•4 years ago
|
||
¡Hola Gabriele!
Hope these lines find you well.
Reviewing about:crashes on this Nightly I found
https://crash-stats.mozilla.org/report/index/11f5e988-7b45-494d-b49c-4b7930210713#tab-details
which per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20_tailMerge_d3dcompiler_47.dll%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
now seems to be a Fission crash.
Should that be a separate bug or is this one still the one that's relevant?
FWIW I've updated the flags per the links above.
¡Gracias!
Alex
Comment 172•4 years ago
|
||
These stacks are identical to previous ones, they look different because we had an issue with the scripts that fetched the symbols so the resulting unwinding information was missing. I've updated it and reprocessed the crashes.
As we turn on Fission for more and more users we get more and more of these reports but their usefulness is limited; I think it's time to discuss if we still want to gather these or not.
Comment 173•3 years ago
|
||
I noticed my Firefox (Nightly, macOS) was completely hung, so I took a sample and spindump, then killed the app. When I opened Firefox back up, I had an unsent crash report from a couple minutes prior, presumably from when the hang started. That report pointed me here. I do have Fission enabled. Would the process sample and spindump be helpful in diagnosing this? If so, I can provide. Thanks!
Comment 174•3 years ago
|
||
(In reply to Sam Johnson from comment #173)
I noticed my Firefox (Nightly, macOS) was completely hung, so I took a sample and spindump, then killed the app. When I opened Firefox back up, I had an unsent crash report from a couple minutes prior, presumably from when the hang started. That report pointed me here. I do have Fission enabled. Would the process sample and spindump be helpful in diagnosing this? If so, I can provide. Thanks!
Probably yes. The crash report you got was from a child process that was taking too long to shut down, we grab them if a child process does not respond for long enough. That being said that very crash report is captured in the main process, so it might be possible that the process was taking too long and was responsible for the freeze.
Comment 175•3 years ago
|
||
I've zipped up the sample and spindump, attached. FWIW, when I noticed the app had hung, it was not after any attempts I made to quit the app; it was in the background while I was working in another program.
Comment 176•3 years ago
|
||
(In reply to Sam Johnson from comment #175)
I've zipped up the sample and spindump, attached. FWIW, when I noticed the app had hung, it was not after any attempts I made to quit the app; it was in the background while I was working in another program.
Thanks, I'll have a look ASAP.
Comment 177•3 years ago
|
||
Heads up for everybody tracking this bug: we've decided to revert the change that split the crash signatures for this type of reports. When the change will be applied in the coming weeks the signatures will again collapse into one. The reasoning behind this is that we only found a handful of useful signatures and far too many unactionable ones. The latter are making nightly crash triage harder while providing little value. We'll keep gathering these reports for the time being until we figure out a better way to make them actionable.
Comment 178•3 years ago
|
||
¡Hola y'all!
My Firefox Nightly crashed like
https://crash-stats.mozilla.org/report/index/e3d52b60-7184-4c46-be35-06cff0210918
Updating flags per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=IPCError-browser%20%7C%20ShutDownKill%20%7C%20mozilla%3A%3Aipc%3A%3AMessagePump%3A%3ARun
FWIW
¡Gracias!
Alex
Comment 179•3 years ago
|
||
We now have the possibility to aggregate in crash-stats on the "Xpcom spin event loop stack" annotations. This leads to the following results.
Outstanding are: RequestHelper::StartAndReturnResponse
, browser-custom-element.js:permitUnload
and various nsThread::Shutdown *
flavors.
Comment 180•3 years ago
|
||
There are more SpinEventLoopUntil locations in the data.
The question is if the nsThreadShutdown: *
have some common cause/pattern.
Comment 181•3 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #180)
The question is if the
nsThreadShutdown: *
have some common cause/pattern.
I just clicked through some reports, but it seems that many of them do not even contain the thread we are waiting for. This sounds as if our book-keeping of closing threads could fail in some cases?
IIUC the flow, we will unblock only if SchedulerGroup::Dispatch(TaskCategory::Other, event.forget());
succeeds such that we will call nsThread::ShutdownComplete
on the main thread. We do not check the return value here, and there are at least some mallocs on the way to a successful dispatch that could fail. Should we check for successful dispatch here and in case clear the context->mAwaitingShutdownAck
by hand off main-thread?
Comment 182•3 years ago
|
||
I don't really know anything about threads. Kris Wright is a better person to ask about them.
Comment 183•3 years ago
•
|
||
So this keeps spinning in my mind as a nested event loop... Just a guess, but I see basically two ways how we can end up in this situation:
-
Unexpected thread death without cleanup
If a thread could die unexpectedly without cleaning up our in-process-memory book-keeping (maybe by some interrupt?) and then we arrive later innsThread::Shutdown()
, we would still be able to dispatch an event to its queue and then wait forever for the shutdown ack message to come. It is not clear to me how this could happen, but it would be nice if there was a (system) function to check, if a thread handle is really alive before we even try to shut it down. -
Failed dispatch of shutdown ack to main thread
This is basically comment 181. Just settingcontext->mAwaitingShutdownAck
might not be enough in this situation if there are no more other messages on the main thread, though?
Any other thoughts, Kris?
Comment 184•3 years ago
|
||
Hmmm, there is a very puzzling thing: While the XPCOMSpinEventLoopStack
annotation suggests we are stuck inside a nested event loop on the main thread, the stack of those crashes does not contain any SpinEventLoopUntil
call but they are just sitting in the normal main event loop. I start to think this is a red herring...
Comment 185•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #184)
Hmmm, there is a very puzzling thing: While the
XPCOMSpinEventLoopStack
annotation suggests we are stuck inside a nested event loop on the main thread, the stack of those crashes does not contain anySpinEventLoopUntil
call but they are just sitting in the normal main event loop. I start to think this is a red herring...
Just to be clear: In some cases like bug 1740889 we see a real SpinEventLoopUntil
on the stack, but in some cases like bug 1740895 we see a misalignment between the main thread stack and the XPCOMSpinEventLoopStack
annotation.
Kris, I do not think it is worth the time to look into the nsThreadShutdown: *
instances for now, they seem to be a false alarm (at least in the terms of SpinEventLoopUntil
caused hangs, the hangs are real, of course). The instances I cracked up so far do not contain any sign of really being inside nsThread::Shutdown
at all. So I'll try to understand the apparently bogus annotations first.
Comment 186•3 years ago
•
|
||
So in the vast majority of cases we are not stuck in the child process in any SpinEventLoopUntil
and we need to look out for other common patterns. :-(
There are some cases though where we find ourselves stuck in RequestHelper::StartAndReturnResponse which might indicate some pitfall in the request/response flow of local storage.
The remaining cases with an event loop stack from the parent process are almost all happening during nsThread:.Shutdown
for some thread on the parent process.
Comment 187•3 years ago
|
||
Just to check: it would not be a bug if this crash was the result of a process in an infinite loop in Javascript being killed, correct?
Comment 188•3 years ago
|
||
(In reply to Justin Peter from comment #187)
Just to check: it would not be a bug if this crash was the result of a process in an infinite loop in Javascript being killed, correct?
This is not really a crash. It's just us killing the process because it's taking too long to shut down. The crash reports are snapshots of the process right before we killed it.
Comment 189•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #177)
Heads up for everybody tracking this bug: we've decided to revert the change that split the crash signatures for this type of reports. When the change will be applied in the coming weeks the signatures will again collapse into one. The reasoning behind this is that we only found a handful of useful signatures and far too many unactionable ones. The latter are making nightly crash triage harder while providing little value. We'll keep gathering these reports for the time being until we figure out a better way to make them actionable.
Indeed we now see only crashes for the basic signature, so it is time for a signature cleanup here.
Comment 190•3 years ago
•
|
||
So looking at the first reports from build 20220210213101 with the new annotations from bug 1754208, the interesting thing is that we do not see any of them:
1 SendFinishShutdown (sent) 7 21.88 %
So there is a significant 78% of cases where we do not even reach RecvShutdown
at all, it seems.
IIUC all starts here in the parent process and there are indeed (intentional) ways to call that function without sending the shutdown message to the child. It feels a bit odd that we apparently did not give the child any chance to shutdown and then just kill it, or am I overlooking something?
Comment 191•3 years ago
|
||
Yes, I did a writeup some time ago but I can't find it anymore. In many cases the main process sends the IPC message to shut down the child, but the child is busy, so it doesn't see the parent's message right away. After a while the child gets killed and it often hasn't seen the IPC message yet. More often than not because there were other pending messages that it had to process first. There's another factor compounding this: we reduce the priority of content processes for tabs that are not visible. So not only the child process might be busy, but the OS might be in no rush to run it as it's been informed that it shouldn't prioritize it.
Comment 192•3 years ago
|
||
Hmm, could it then be that the SendFinishShutdown (sent)
case is just the opposite (the child went all through its shutdown but the acknowledge message does never arrive to the parent) ?
It might be an option to raise a child's priority before sending the shutdown message, that could help in some of the cases you described above. But am I assuming right, that if the message queue is already well filled we get queued at the end? Then a higher process priority will not really help if we do not tweak our internal processing, IIUC?
Comment 193•3 years ago
|
||
Yes, that also seems to happen. I had filed bug 1619676 about exploring the priority changes but as someone pointed out in that bug it could have the opposite effect by slowing down the main process and thus ending up in the same place as we are now.
Comment 194•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #191)
...
There's another factor compounding this: we reduce the priority of content processes for tabs that are not visible. So not only the child process might be busy, but the OS might be in no rush to run it as it's been informed that it shouldn't prioritize it.
(In reply to Gabriele Svelto [:gsvelto] from comment #193)
Yes, that also seems to happen. I had filed bug 1619676 about exploring the priority changes but as someone pointed out in that bug it could have the opposite effect by slowing down the main process and thus ending up in the same place as we are now.
Maybe instead of reducing priority for not visible tabs, and instead of raising priority of other tabs, how about just leaving the priorities alone and see what happens? Less is more kind of thing. Maybe just slightly elevate what needs elevation, instead of dropping what you used to think could be reduced. Maybe do a build with no priority changes, and see what pans out?
Comment 195•3 years ago
|
||
Just to confirm the observation of comment 190 with more numbers:
Rank | IPC shutdown state | # | % |
---|---|---|---|
1 | SendFinishShutdown (sent) | 292 | 17.70 |
2 | ShutdownInternal entry | 6 | 0.36 |
3 | content-child-shutdown started | 2 | 0.12 |
It confirms that we have only very few cases of a real hang during processing the shutdown sequence. It seems that in most of the cases either:
- the child process is to busy to even receive and start to process the shutdown request (81%)
- the parent process is to busy to even receive and acknowledge the successful shutdown (17%)
I assume this can only be changed if we create kind of a "priority lane" for shutdown messages that bypasses the normal queue. I wonder if having an additional IPC channel only for shutdown messages could help? In particular on the child process side there is probably not much reason to keep up the normal processing order until we eventually arrive at the shutdown event in our queue (which will always be kind of unexpected and random wrt to our internal state, such that handling it out of order should not be worse, IIUC). But also the ack messages could be processed out of order on the parent side, probably.
Comment 196•3 years ago
|
||
You can add a Priority annotation to an IPC message to increase the priority. We use that, for instance, for things related to input.
Also, looking at the earlier discussion, we do already have some code to raise the priority of processes we're shutting down.
Comment 197•3 years ago
|
||
FWIW, from a short glance at 10 reports in a row, I see:
- 4 of them are inside
ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst
. Interestingly, all of these haveIPCShutdownState: SendFinishShutdown (sent)
set, which would suggest that we already finished shutdown, but the position in the stack seems to suggest, we are not yet there? - 3 of them seem to just idle on the main thread
- 1 seems to be waiting for a mutex related to GC held by another thread
- 1 seems to be busy with destroying a docshell as a reaction on
BrowserChild::RecvDestroy()
- this one has noIPCShutdownState
set at all and could be a case of a long running task that blocks the main thread. - 1 has a stack without symbols (even if opened in VS)
Probably the ChildProfilerController
and the GC
mutex case could merit a second look based on those stack traces.
Comment 198•3 years ago
|
||
Starting from March 2022 we see a slight downwards trend, it seems. Of those:
- ~ half of the crashes show now
NotifyImpendingShutdown received
. This means, the content process was alerted but too busy (or hanging) to even process theShutdownConfirmedHP
sent with high priority. - almost a quarter of the crashes carry
SendFinishShutdown (sent)
. This would indicate that the content process was able to finish its shutdown but the parent process did never receive or process theFinishShutdown
message. - almost another quarter do not have any
ipc_shutdown
annotation set, which is weird.
The remainder are some rare sparse crashes with other ipc_shutdown
annotations set.
Comment 199•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #198)
- almost a quarter of the crashes carry
SendFinishShutdown (sent)
. This would indicate that the content process was able to finish its shutdown but the parent process did never receive or process theFinishShutdown
message.
We could filter those out and not send reports for them: after all the content processes did shut down correctly. WDYT, should I file a bug? It would be a simple fix.
Comment 200•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #199)
We could filter those out and not send reports for them: after all the content processes did shut down correctly. WDYT, should I file a bug? It would be a simple fix.
I do not think we should do this. It most probably means that the parent is not aware of that child shutdown and continues to block its own shutdown process until timeout?
I think we should do bug 1755376 comment 17 to give shutdown notifications the highest possible priority in both directions.
Comment 201•3 years ago
|
||
OK, I trust your assessment, thanks!
Comment 202•2 years ago
|
||
It seems that bug 1777198 did not move the needle much here. Let's wait for the improved annotations for a better understanding, though.
Comment 203•2 years ago
|
||
But: Roughly 25% of the crashes do not have any dump file (and thus stack) associated, it seems? Could it be that those processes actually ended but the parent did not understand it? Example, the upload_file_minidump is empty, the upload_file_minidump_browser contains data.
[Inlineframe] xul.dll!google_breakpad::ExceptionHandler::WriteMinidump() Zeile 805 C++
xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(const std::wstring & dump_path, bool(*)(const wchar_t *, const wchar_t *, void *, _EXCEPTION_POINTERS *, MDRawAssertionInfo *, const mozilla::phc::AddrInfo *, bool) callback, void * callback_context, _MINIDUMP_TYPE dump_type) Zeile 831 C++
xul.dll!CrashReporter::CreateMinidumpsAndPair(void * aTargetHandle, unsigned long aTargetBlamedThread, const nsTSubstring<char> & aIncomingPairName, mozilla::EnumeratedArray<CrashReporter::Annotation,CrashReporter::Annotation::Count,nsTString<char>> & aTargetAnnotations, nsIFile * * aMainDumpOut) Zeile 3794 C++
xul.dll!mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair<mozilla::dom::ContentParent>(mozilla::dom::ContentParent * aToplevelProtocol, const nsTSubstring<char> & aPairName) Zeile 70 C++
> xul.dll!mozilla::dom::ContentParent::GeneratePairedMinidump(const char * aReason) Zeile 4291 C++
xul.dll!mozilla::dom::ContentParent::KillHard(const char * aReason) Zeile 4326 C++
[Inlineframe] xul.dll!nsCOMPtr<nsITimer>::get() Zeile 851 C++
It seems that ContentParent::KillHard does not check if we actually killed something?
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Updated•2 years ago
|
Comment 204•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #203)
It seems that ContentParent::KillHard does not check if we actually killed something?
It does check what happened on both Linux/macOS and Windows. In both cases if the process already died KillProcess()
returns false.
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Minidump generation can fail in weird ways, including empty or truncated minidumps and ATM we don't have good checking for that outside of Linux. As I rewrite this stuff in the coming months we can add diagnostic information to work around this issue, for example by providing rich errors instead of malformed minidumps like we do on Linux using the new oxidized machinery.
Comment 205•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #204)
(In reply to Jens Stutte [:jstutte] from comment #203)
It seems that ContentParent::KillHard does not check if we actually killed something?
It does check what happened on both Linux/macOS and Windows. In both cases if the process already died
KillProcess()
returns false.
Hmm, at least in the linked code snippet I just see a NS_WARNING
but no other consequences? I assumed that means that in either cases we transmit the collected crash telemetry?
Gabriele, can you confirm that the weird crashes we see could be caused by that? Is there something we should check inside ContentParent::GeneratePairedMinidump ?
Minidump generation can fail in weird ways, including empty or truncated minidumps and ATM we don't have good checking for that outside of Linux. As I rewrite this stuff in the coming months we can add diagnostic information to work around this issue, for example by providing rich errors instead of malformed minidumps like we do on Linux using the new oxidized machinery.
Thanks for working on this, that sounds very promising !
Comment 206•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #205)
Hmm, at least in the linked code snippet I just see a
NS_WARNING
but no other consequences? I assumed that means that in either cases we transmit the collected crash telemetry?
Ah yes, good point, we can still upload the minidump if the process was killed in-between the two, or right when the minidump was being written. I'll file a bug to discard those instead of submitting them.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 208•2 years ago
|
||
1 6473 46.44 % - NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
2 2043 14.66 % - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
3 1627 11.67 % - NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
4 486 3.49 % - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
Looking at IPC shutdown state I see 4 main cases:
- Tab is closing & Child gives no sign of live after receiving RequestContentJSInterrupt
- Parent shuts down & Child gives no sign of live after receiving RequestContentJSInterrupt
- Tab is closing & Child notified the parent about its shutdown but the parent did not process the notification
- Parent shuts down & Child notified the parent about its shutdown but the parent did not process the notification
Annotations interpretation:
Tab is closing: two times NotifiedImpendingShutdown
, the first is from ContentParent::NotifyTabWillDestroy
, the second from ContentParent::SignalImpendingShutdownToContentJS
Parent shuts down: the only NotifiedImpendingShutdown
is from ContentParent::SignalImpendingShutdownToContentJS
coming from ContentParent::BlockShutdown
Parent did not process shutdown notification in time: SendFinishShutdown (sent)
is present
Updated•2 years ago
|
Comment 211•2 years ago
|
||
It might be early to tell but it seems that the patches from bug 1837467 had a pretty positive impact on the numbers here.
In summary, that patch has two consequences:
- We ensure that we start the
ForceKillTimer
only after we actually told the child process to shutdown, preventing that slow (unrelated) processing in the parent would hit us. - We indirectly increased the time the content process has to do shutdown related things by adding the second timer for the
Browser
actor destroy cycle.
It is unclear, if just increasing the timeout would have had a similar effect (although I'd expect the slow parent case to be improved only statistically this way, not systematically). It might be worth to experiment with different timer settings (like split the overall time over the two timers with some ratio) to get closer to the former timeouts again.
Updated•8 months ago
|
Comment 214•3 months ago
|
||
¡Hola y'all!
Updating flags per what's visible over at
¡Gracias!
Alex
Description
•