Crash in [@ mozilla::dom::JSWindowActor::ReceiveRawMessage]
Categories
(Core :: DOM: Navigation, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | wontfix |
firefox67 | --- | unaffected |
firefox67.0.1 | --- | unaffected |
firefox68 | --- | unaffected |
firefox69 | --- | disabled |
firefox70 | --- | disabled |
firefox71 | --- | disabled |
firefox72 | --- | wontfix |
firefox73 | --- | wontfix |
firefox74 | --- | wontfix |
firefox75 | --- | wontfix |
firefox76 | --- | wontfix |
firefox77 | --- | wontfix |
firefox78 | --- | wontfix |
firefox79 | --- | wontfix |
firefox80 | --- | wontfix |
firefox84 | --- | wontfix |
firefox85 | --- | wontfix |
firefox86 | --- | wontfix |
firefox87 | --- | wontfix |
firefox88 | --- | wontfix |
People
(Reporter: calixte, Unassigned)
References
(Depends on 2 open bugs, Blocks 1 open bug, Regression)
Details
(Keywords: crash, regression, Whiteboard: [not-a-fission-bug])
Crash Data
Attachments
(2 files)
This bug is for crash report bp-1ec19da7-8bfc-4eea-9b7e-c7a030190705.
Top 10 frames of crashing thread:
0 libxul.so mozilla::dom::JSWindowActor::ReceiveRawMessage dom/ipc/JSWindowActor.cpp:151
1 libxul.so mozilla::dom::WindowGlobalChild::ReceiveRawMessage dom/ipc/WindowGlobalChild.cpp:304
2 libxul.so mozilla::dom::WindowGlobalChild::RecvRawMessage dom/ipc/WindowGlobalChild.cpp:295
3 libxul.so mozilla::dom::PWindowGlobalChild::OnMessageReceived ipc/ipdl/PWindowGlobalChild.cpp:435
4 libxul.so mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:7197
5 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2158
6 libxul.so mozilla::ipc::MessageChannel::RunMessage ipc/glue/MessageChannel.cpp:1939
7 libxul.so mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:295
8 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1225
9 libxul.so <name omitted> xpcom/threads/nsThreadUtils.cpp:486
There is 1 crash in nightly 69 with buildid 20190705064618. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1541557.
[1] https://hg.mozilla.org/mozilla-central/rev?node=6680278c231b
Reporter | ||
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
![]() |
||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Neha, this is affecting Beta69 in the wild too. Any chance we can re-prioritize investigation?
Comment 2•6 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
Neha, this is affecting Beta69 in the wild too. Any chance we can re-prioritize investigation?
This is a diagnostic assert so it won't impact release or beta. The crashes are probably on devedition (or wherever we actually run MOZ_DIAGNOSTIC_ASSERT
s)
Comment 3•6 years ago
|
||
We'll add more logging in 70 for this assert. But this shouldn't block 69.
Comment 4•6 years ago
|
||
Ah indeed, the 69 reports are all from DevEdition. Thanks!
Updated•5 years ago
|
Comment 5•5 years ago
|
||
John, could you look into adding more logging to debug this further?
Comment 6•5 years ago
|
||
(In reply to Neha Kochar [:neha] from comment #5)
John, could you look into adding more logging to debug this further?
Sure. I'll take a look.
Updated•5 years ago
|
Comment 7•5 years ago
|
||
Roll some unfixed bugs from Fission Milestone M4 to M5
0ee3c76a-bc79-4eb2-8d12-05dc0b68e732
Comment 8•5 years ago
|
||
John, do you have any updates on this crash? We're still seeing about 10-20 crash reports per day.
kmag thinks we might be trying to send an invalid BrowsingContext. We should add more logging to help diagnose.
Curiously, 99.99% of the reports for this crash signature for the last six months (1419 out of 1420) are x86-64, compared to 83% x86-64 for all other Firefox Nightly crashes.
Comment 9•5 years ago
|
||
Hi Chris,
I am going to add more logging to help diagnose. Thank you.
Comment 10•5 years ago
|
||
We have shipped our last beta for 71 but the crash volume is low to medium, I am marking it as fix-optional in case a safe uplift would be possible in a dot release as a ridealong.
Comment 11•5 years ago
|
||
This only crashes on nightly and devedition, diagnostic asserts are disabled on beta and release, so updating status.
Comment 12•5 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #8)
John, do you have any updates on this crash? We're still seeing about 10-20 crash reports per day.
kmag thinks we might be trying to send an invalid BrowsingContext. We should add more logging to help diagnose.
Curiously, 99.99% of the reports for this crash signature for the last six months (1419 out of 1420) are x86-64, compared to 83% x86-64 for all other Firefox Nightly crashes.
Bug 1580176 is for adding MOZ_LOG for JSWindowActor, we can use bug 1580176 for tracking all of the JSWindowActor logs. After bug 1580176 fixed, I can help to diagnose this crash.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 13•5 years ago
|
||
Nika and kmag say this crash is likely caused by sending a discarded BrowsingContext. Deferring to Fission Nightly (M6) because this crash is low volume.
Some new bugs to help diagnose IPC message crashes like this:
- bug 1623981 to replace MOZ_DIAGNOSTIC_ASSERT with a MozCrashPrintf that reports the name of the crashing message
- bug 1623989 to add a MaybeDiscarded-like wrapper for sending BrowsingContext from JS
Unlinking JSWindowActor logging bug 1580176 because Nika says it won't help diagnose this crash.
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 15•5 years ago
|
||
Hi Calixte,
Is the crash still happen? Do you have a recent crash report for me to investigate? Thank you.
Comment 16•5 years ago
|
||
It does, there's a table and links in the "crash data" section of the bug page; see e.g. bp-13ce9caa-8ee9-4676-bdd8-9b95c0200616
Comment 17•5 years ago
|
||
I looked through a bunch of these reports and wrote down the actor and message name. The bulk of them were in Conduits, and most of the Conduits messages were RuntimeMessage and RunListener.
Conduits messages:
- RunListener (12 times),
- RuntimeMessage (10 times)
- PortConnect (twice)
- CallResult (once)
Lots of them were in a preallocated process. I don't know if that's meaningful.
Here are some other actors that showed up in these crashes (and their message):
- UnselectedTabHover (Browser:UnselectedTabHover, three times)
- BrowserTab (Browser:Reload, three times)
- BrowserElement (PermitUnload)
- BrowserTab (Browser:AppTab)
- AutoComplete (FormAutoComplete:HandleEnter)
Here's an example of a crash with Conduits and RunListener: https://crash-stats.mozilla.org/report/index/a898751d-4291-4b7e-994d-09bf10200616
Comment 18•5 years ago
|
||
Tom, could you look into this further, using :mccr8's info above.
Comment 19•5 years ago
|
||
I couldn't find any instance where we send a BrowsingContext in the extension framework.
Other than that, I don't know there's anything that we might be sending that would cause crashes. We do send extension-provided data, but we have it serialized into StructuredCloneHolder
s, so other than the size of those messages, anything else that would cause issues on deseralization would presumably throw when we do the serialization in the first place.
I don't have any other leads here, except maybe to prioritize bug 1605098 so that we get a bit more info in crash reports that include message names.
Comment 20•5 years ago
|
||
According to the previous comments, we need more information to move this bug forward. There's no clear action we can take, unassign John for now. Please feel free to reach out or re-assign.
Updated•5 years ago
|
Comment 23•5 years ago
|
||
I looked through a few of the reports here, and a large number of them seem to have the JSOutOfMemory
annotation set to "Reported" on them, which somewhat implies that these assertion failures are being caused by an OOM while deserializing the structured clone data.
It would be nice if an OOM while deserializing structured clone data here could produce a different error from other types of deserialization failures, so we can handle them differently.
Comment 24•4 years ago
|
||
Neha asked me to look at this.
Nika's point about JSOutOfMemory being set is a good one, and does seem like a plausible explanation. One line of investigation here would be to get a better error annotation. The structured clone code seems to generate more detailed error information in ReportDataCloneError, but right now we're not recording that in the annotation. Maybe we could even explicitly check if there's a pending OOM exception? I'm not sure where that gets cleaned up. The goal of this line of investigation would be to confirm that these are OOM crashes, or if they aren't, try to figure out what they are. I saw one report that had only 20MB of physical memory free, but I saw another one that seemed to have plenty of memory and pagefile (the latter being common for low memory situations). It would also be nice to know how large the allocation was that failed, but structured clone code lives in JS, and it is hard to find that out in SpiderMonkey, in my experience.
Another line of investigation would be to split up these crashes by actor type. The crash reports already contain JSActorName and JSActorMessage fields, which is good, but as far as I can see these are not indexed in a way that lets you do any aggregation. Maybe we could get the Socorro people to index these fields so a query could facet on these fields. If that proves to be useful, maybe the signature could reflect the actor name and the message. If this is really an OOM crash, maybe some actor is sending way too much data, so it would make more sense to ascribe the crash to the specific actor and not the general window actor infrastructure.
I looked at the first 20 reports, and the actor and message were as follows:
7 BrowserTab, Browser:Reload
5 Conduits, PortMessage
3 AboutReader, Reader:PushState
3 Conduits, RuntimeMessage
1 Conduits, CallResult
1 ExtensionContent, Execute
A third line of investigation would be to look over the actor messages that show up a lot (in this or another sample of crashes) and figure out if there's something we could do to make them smaller.
Comment 25•4 years ago
|
||
I guess I looked at the messages before. Browser:Reload is kind of interesting to see because as far as I can see it only sends an integer and a boolean, so maybe the contents of the message aren't to blame for whatever is happening, at least if it is an OOM.
Comment 26•4 years ago
|
||
kmag will add payload size so we get more info from these crash reports.
Comment 27•4 years ago
•
|
||
Visiting https://v8.github.io/test262/website/default.html#
, clicking Run
, clicking Run All
and allowing the tests to run will trigger this crash semi-reliably:
==2160546==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f01e8f7bd5a bp 0x7ffcfa6983f0 sp 0x7ffcfa697fa0 T0)
==2160546==The signal is caused by a WRITE memory access.
==2160546==Hint: address points to the zero page.
#0 0x7f01e8f7bd5a in mozilla::dom::JSActorManager::ReceiveRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ipc::StructuredCloneData&&, mozilla::dom::ipc::StructuredCloneData&&) /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161:5
#1 0x7f01e8f5693c in mozilla::dom::WindowGlobalChild::RecvRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ClonedMessageData const&, mozilla::dom::ClonedMessageData const&) /builds/worker/checkouts/gecko/dom/ipc/WindowGlobalChild.cpp:561:3
#2 0x7f01e298397f in mozilla::dom::PWindowGlobalChild::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PWindowGlobalChild.cpp:1175:61
#3 0x7f01e22a08ae in mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PContentChild.cpp:8621:32
#4 0x7f01e20a24e8 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:2150:25
#5 0x7f01e209eba2 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:2074:9
#6 0x7f01e20a084a in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1922:3
#7 0x7f01e20a0e8d in mozilla::ipc::MessageChannel::MessageTask::Run() /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1953:13
#8 0x7f01e0dd4757 in mozilla::RunnableTask::Run() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:245:16
#9 0x7f01e0dd0404 in mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:515:26
#10 0x7f01e0dcde65 in mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:374:15
#11 0x7f01e0dce3f7 in mozilla::TaskController::ProcessPendingMTTask(bool) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:171:36
#12 0x7f01e0dd9a71 in operator() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:85:37
#13 0x7f01e0dd9a71 in mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:577:5
#14 0x7f01e0df85c2 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1197:14
#15 0x7f01e0e025f1 in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:513:10
#16 0x7f01e20a9e47 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:87:21
#17 0x7f01e1fc2312 in RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:334:10
#18 0x7f01e1fc2312 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:327:3
#19 0x7f01e1fc2312 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:309:3
#20 0x7f01e987313a in nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:137:27
#21 0x7f01ed83844f in XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:913:20
#22 0x7f01e1fc2312 in RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:334:10
#23 0x7f01e1fc2312 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:327:3
#24 0x7f01e1fc2312 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:309:3
#25 0x7f01ed837ce1 in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:744:34
#26 0x55ebc515dac8 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:56:28
#27 0x55ebc515dac8 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:304:18
#28 0x7f01f7d03041 in __libc_start_main (/lib64/libc.so.6+0x27041)
#29 0x55ebc50b09a8 in _start (/home/geeknik/firefox/firefox-bin+0xb69a8)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161:5 in mozilla::dom::JSActorManager::ReceiveRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ipc::StructuredCloneData&&, mozilla::dom::ipc::StructuredCloneData&&)
==2160546==ABORTING
Comment 28•4 years ago
•
|
||
This also appears in the console if helpful:
Assertion failure: false (Should not receive non-decodable data), at /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161
It appears to be the result-get-matched-err
test, it triggers the slow page warning and the tab will crash if you don't do anything about the slow script warning. Even if you do click Stop It
the tab will likely crash. Whilst the browser "hangs" memory use eventually reaches 100% at which point the tab crashes? Last image seen before the tab crashed: https://i.imgur.com/7e4dfhs.png
Comment 29•4 years ago
|
||
I wonder if bug 1660539 is related to this.
Comment 30•4 years ago
|
||
Could be, I haven’t been able to reproduce this since bug 1660539 was fixed.
Comment 31•4 years ago
|
||
I just experienced with fission on today's nightly on Linux on a Google doc bp-011a96d9-0075-4352-ae51-d21fb0201201
Comment 32•4 years ago
|
||
Sylverstre, were you running out of memory?
In the (non-public) details tab of your crash report, I see
- JSActorMessage Execute
- JSActorName ExtensionContent
- JSOutOfMemory Reported
The "Execute" message is certainly expected to be "decodable", it is a JSON-serializable object.
Comment 33•4 years ago
|
||
Unlikely, the system on which I had the issue:
% cat /proc/meminfo
MemTotal: 65723420 kB
MemFree: 20876544 kB
MemAvailable: 51710984 kB
Buffers: 1378132 kB
Cached: 30175740 kB
SwapCached: 45928 kB
Active: 28126880 kB
Updated•4 years ago
|
Updated•4 years ago
|
Comment 34•4 years ago
|
||
Hello,
A process on my Firefox session just crashed.
And via about:crashes, I've been redirected to this bug.
So, here is a STR, I hope this may help :
- On a Windows 64-bit, launch the latest Nightly (in this case, 2020-12-20)
- Launch a live video on YouTube.
- Watch this live during one hour or one and half hour.
Obtained result :
During the live, the memory has risen to the maximum available limit (on my PC, around 10 GB for a PC with a RAM of 12 GB).
Excepted result :
The memory remains at a nominal level.
This is not the first time I have this crash on a live video on YouTube.
But this never occurs on a "classic" (ie, not live) video on YouTube.
Comment 35•4 years ago
|
||
I don't see evidence of frequent OOM conditions in the crash reports save for a few (~5% of them). Can you point us to your particular report? If Firefox ran out of memory we might find a memory report attached which could help us diagnose the issue.
Updated•4 years ago
|
Comment 36•4 years ago
|
||
Hello Gabriele,
I have been to able to reproduce the conditions just before the crash (Firefox uses a lot of memory while a live video on Youtube is played).
And I have obtained a memory report, so here it is.
Comment 37•4 years ago
|
||
Comment 38•4 years ago
•
|
||
Thanks, this is extremely useful. The process playing YouTube isn't particularly large but the extension process is huge, it's taking almost 1.5GiB of memory on its own. Looking at the various bits under that it seems that VideoDownloadHelper has allocated and never freed hundreds of megabytes of strings. So it seems that it's leaking memory somehow, can you try disabling the extension and seeing if the problem goes away? I'll inspect the other crash to see if they're also using the same extension.
[edit] I misinterpreted the memory report, it's not 2-byte strings, it's TwoByte strings so non-Latin unicode strings.
Comment 39•4 years ago
|
||
I inspected a few more memory reports in the crashes and I don't see a pattern unfortunately. The report attached as part of comment 37 is definitely a leak, so this crash might be also triggered by OOM-like conditions, or they might make it more likely. I poked a few more crashes for URLs and comments and it seems that pages with videos (Facebook feeds, YouTube and other streaming services) are more common than others, but there's no clear pattern.
Is the value of the data that has been received by JSActorManager::ReceiveRawMessage()
important? If it is I can crack open a few minidumps and see if I can extract some useful samples.
Comment 40•4 years ago
|
||
We have bug 1686267 on file regarding VideoDownloadHelper memory spiraling out of control, so you could look at that.
(In reply to Gabriele Svelto [:gsvelto] from comment #39)
Is the value of the data that has been received by
JSActorManager::ReceiveRawMessage()
important? If it is I can crack open a few minidumps and see if I can extract some useful samples.
I don't know if the data per se is important, but we are interested in the size of the message being received, or the size of the structured clone data.
Comment 41•4 years ago
|
||
I think we should reconsider tracking bug 1563825 as part of Fission m6c. The crash is showing up in Developer edition, where Fission can't be enabled, and only about 10% of crashes with this signature on Nightly have Fission enabled over the last month. If you look across all crashes on Nightly in the last month, 26% of them have Fission enabled. This might be skewed a bit due to the recent high frequency crash that affected Fission more, but it still suggests this isn't a Fission-specific problem, but rather a problem with infrastructure introduced to support Fission. As such, it feels like it shouldn't block Fission rollout.
Comment 42•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #41)
I think we should reconsider tracking bug 1563825 as part of Fission m6c. The crash is showing up in Developer edition, where Fission can't be enabled, and only about 10% of crashes with this signature on Nightly have Fission enabled over the last month. If you look across all crashes on Nightly in the last month, 26% of them have Fission enabled. This might be skewed a bit due to the recent high frequency crash that affected Fission more, but it still suggests this isn't a Fission-specific problem, but rather a problem with infrastructure introduced to support Fission. As such, it feels like it shouldn't block Fission rollout.
Clearing Fission Milestone because Nika says this is not a Fission-specific bug.
26% of crash reports had Fission enabled in the last month, but that's expected for almost any crash because Fission is enabled for about 20-25% of Nightly users.
Updated•4 years ago
|
Comment 44•4 years ago
|
||
Too late for 86 RC but I am keeping the status for firefox as fix-optional
as I would probably take a safe patch in a potantial dot release.
Comment 45•4 years ago
|
||
I opened up a couple of minidumps but had a hard time figuring out where to find the size of the data being cloned. What field should I be looking at? In one minidump aData.mStorage.val.mExternalData.bufList_
has a mSize
field of ~8KiB, in the other 1.5KiB. In both cases drilling down the error object I find a mErrorNumber
field set to MSG_INVALID_ENUM_VALUE
. Is any of this useful?
Updated•4 years ago
|
Updated•4 years ago
|
Comment 48•4 years ago
|
||
I'm not going to have time to look at this soon.
![]() |
||
Updated•4 years ago
|
Comment 49•3 years ago
|
||
These crashes are all fallout from JS OOMs. Even if we ignore the message decoding errors instead of crashing, the browser is probably going to crash elsewhere soon or we will miss an important message, leaving the content process in an inconsistent state.
Updated•3 years ago
|
Comment 50•3 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #49)
These crashes are all fallout from JS OOMs. Even if we ignore the message decoding errors instead of crashing, the browser is probably going to crash elsewhere soon or we will miss an important message, leaving the content process in an inconsistent state.
I assume this makes it less severe.
FWIW, I see very few crashes like https://crash-stats.mozilla.org/report/index/fd18037f-32b8-4cd8-aaaa-926050220301 in release that seem to indicate that while trying to do something with the mPendingQueries
of an actor
returned from GetActor
we have a nullptr
access. And there seems to be a ::new
on that stack, too.
Comment 51•3 years ago
|
||
Just hit this crash, in a mozilla.org process that was using >5.8GB of memory (per system sysinfo). It may have happened when I ran about:memory, which requires allocating memory to return a result, which may have triggered an JS OOM.
Comment 52•3 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #51)
Just hit this crash, in a mozilla.org process that was using >5.8GB of memory (per system sysinfo). It may have happened when I ran about:memory, which requires allocating memory to return a result, which may have triggered an JS OOM.
Yup, this crash tends to occur due to a JS heap OOM while deserializing.
Updated•3 years ago
|
Comment 53•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
:janv, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 54•2 years ago
|
||
I wonder if we should stop asserting when the failure is because of an OOM. I don't like the idea of continuing with a child process when it's failed to process a message, since that could mean its state could be out of sync with the parent in dangerous ways. But this isn't a release assert, so it isn't helping release users at all. And the number of crashes from OOMs means we don't actually see any reports that failed to deserialize the message for other reasons that we can actually fix...
Comment 55•2 years ago
|
||
(In reply to Kris Maglione [:kmag] from comment #54)
I wonder if we should stop asserting when the failure is because of an OOM. I don't like the idea of continuing with a child process when it's failed to process a message, since that could mean its state could be out of sync with the parent in dangerous ways. But this isn't a release assert, so it isn't helping release users at all. And the number of crashes from OOMs means we don't actually see any reports that failed to deserialize the message for other reasons that we can actually fix...
If I read StructuredCloneHolder::ReadFromBuffer correctly, it seems we get a specific error message from JS but throw always a DataCloneError
.
IIUC that makes it difficult to just check for OOM and exclude that case from the assertion which might be still of interest in other cases?
Comment 57•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #55)
If I read StructuredCloneHolder::ReadFromBuffer correctly, it seems we get a specific error message from JS but throw always a
DataCloneError
.IIUC that makes it difficult to just check for OOM and exclude that case from the assertion which might be still of interest in other cases?
Yes. I looked into it after I made the suggestion and came to the conclusion that the simplest thing would be to just check whether the OOM reported flag was set. Unfortunately, even the error we get from the JS engine is not very specific, and the situation isn't very easy to improve. The spec says that we need to throw a DataCloneError
, but it would be nice if internal consumers could still get more specific error details when they want them.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 61•2 years ago
|
||
Spike in crashes over the last 2 days aligns with the spike in bug 1405521.
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 66•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Comment 67•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
:aiunusov, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 84•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Comment hidden (Intermittent Failures Robot) |
Comment 86•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:
- Top 10 desktop browser crashes on nightly
- Top 10 content process crashes on beta
For more information, please visit auto_nag documentation.
Comment hidden (Intermittent Failures Robot) |
Comment 88•2 years ago
|
||
Ideally, we don't want to continue running a child process if it fails to
handle a message from the parent, since that could mean child and parent state
could get out of sync. But since this assertion is only a diagnostic assert,
it isn't guaranteeing that in release builds anyway. And since the vast
majority of the crashes we are seeing in builds with diagnostic asserts
enabled appear to be OOMs, we can't really use crash reports to diagnose other
issues.
Ideally (again), we'd determine if the failure was caused by an OOM based on
the failure code returned by the structured clone decode call. Unfortunately,
though, since the spec requires that we return a generic DataCloneError
on
failure, the structured clone code intentionally hides the specifics of
failure from callers. Propagating out more specific failure reasons for use by
privileged callers is nontrivial. So this patch essentially does the same
thing as crash reports do, and checks whether an OOM was reported recently,
and hasn't been recovered from by a successful GC.
Updated•2 years ago
|
Comment 89•2 years ago
|
||
Comment 90•2 years ago
|
||
bugherder |
Comment 91•2 years ago
|
||
This is still happening on autoland: https://treeherder.mozilla.org/logviewer?job_id=413723724&repo=autoland
Comment hidden (Intermittent Failures Robot) |
Comment 93•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit BugBot documentation.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (obsolete) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•