Open Bug 1563825 Opened 5 years ago Updated 3 days ago

Crash in [@ mozilla::dom::JSWindowActor::ReceiveRawMessage]

Categories

(Core :: DOM: Navigation, defect, P3)

x86_64
All
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox-esr78 --- wontfix
firefox67 --- unaffected
firefox67.0.1 --- unaffected
firefox68 --- unaffected
firefox69 --- disabled
firefox70 --- disabled
firefox71 --- disabled
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix
firefox80 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix

People

(Reporter: calixte, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression, Whiteboard: [not-a-fission-bug])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-1ec19da7-8bfc-4eea-9b7e-c7a030190705.

Top 10 frames of crashing thread:

0 libxul.so mozilla::dom::JSWindowActor::ReceiveRawMessage dom/ipc/JSWindowActor.cpp:151
1 libxul.so mozilla::dom::WindowGlobalChild::ReceiveRawMessage dom/ipc/WindowGlobalChild.cpp:304
2 libxul.so mozilla::dom::WindowGlobalChild::RecvRawMessage dom/ipc/WindowGlobalChild.cpp:295
3 libxul.so mozilla::dom::PWindowGlobalChild::OnMessageReceived ipc/ipdl/PWindowGlobalChild.cpp:435
4 libxul.so mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:7197
5 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2158
6 libxul.so mozilla::ipc::MessageChannel::RunMessage ipc/glue/MessageChannel.cpp:1939
7 libxul.so mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:295
8 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1225
9 libxul.so <name omitted> xpcom/threads/nsThreadUtils.cpp:486

There is 1 crash in nightly 69 with buildid 20190705064618. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1541557.

[1] https://hg.mozilla.org/mozilla-central/rev?node=6680278c231b

Flags: needinfo?(kmaglione+bmo)
Fission Milestone: --- → M4
Priority: -- → P2
Component: Mochitest → DOM: Content Processes
Product: Testing → Core
Version: Version 3 → unspecified
Flags: needinfo?(kmaglione+bmo)

Neha, this is affecting Beta69 in the wild too. Any chance we can re-prioritize investigation?

Flags: needinfo?(nkochar)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)

Neha, this is affecting Beta69 in the wild too. Any chance we can re-prioritize investigation?

This is a diagnostic assert so it won't impact release or beta. The crashes are probably on devedition (or wherever we actually run MOZ_DIAGNOSTIC_ASSERTs)

Flags: needinfo?(nkochar)

We'll add more logging in 70 for this assert. But this shouldn't block 69.

Ah indeed, the 69 reports are all from DevEdition. Thanks!

John, could you look into adding more logging to debug this further?

Flags: needinfo?(jdai)

(In reply to Neha Kochar [:neha] from comment #5)

John, could you look into adding more logging to debug this further?

Sure. I'll take a look.

Flags: needinfo?(jdai)
Assignee: nobody → jdai
Status: NEW → ASSIGNED

Roll some unfixed bugs from Fission Milestone M4 to M5

0ee3c76a-bc79-4eb2-8d12-05dc0b68e732

Fission Milestone: M4 → M5

John, do you have any updates on this crash? We're still seeing about 10-20 crash reports per day.

kmag thinks we might be trying to send an invalid BrowsingContext. We should add more logging to help diagnose.

Curiously, 99.99% of the reports for this crash signature for the last six months (1419 out of 1420) are x86-64, compared to 83% x86-64 for all other Firefox Nightly crashes.

Flags: needinfo?(jdai)
OS: Linux → All
Hardware: Unspecified → x86_64

Hi Chris,
I am going to add more logging to help diagnose. Thank you.

Flags: needinfo?(jdai)

We have shipped our last beta for 71 but the crash volume is low to medium, I am marking it as fix-optional in case a safe uplift would be possible in a dot release as a ridealong.

This only crashes on nightly and devedition, diagnostic asserts are disabled on beta and release, so updating status.

(In reply to Chris Peterson [:cpeterson] from comment #8)

John, do you have any updates on this crash? We're still seeing about 10-20 crash reports per day.

kmag thinks we might be trying to send an invalid BrowsingContext. We should add more logging to help diagnose.

Curiously, 99.99% of the reports for this crash signature for the last six months (1419 out of 1420) are x86-64, compared to 83% x86-64 for all other Firefox Nightly crashes.

Bug 1580176 is for adding MOZ_LOG for JSWindowActor, we can use bug 1580176 for tracking all of the JSWindowActor logs. After bug 1580176 fixed, I can help to diagnose this crash.

Depends on: 1623981
Depends on: 1623989

Nika and kmag say this crash is likely caused by sending a discarded BrowsingContext. Deferring to Fission Nightly (M6) because this crash is low volume.

Some new bugs to help diagnose IPC message crashes like this:

  • bug 1623981 to replace MOZ_DIAGNOSTIC_ASSERT with a MozCrashPrintf that reports the name of the crashing message
  • bug 1623989 to add a MaybeDiscarded-like wrapper for sending BrowsingContext from JS

Unlinking JSWindowActor logging bug 1580176 because Nika says it won't help diagnose this crash.

Fission Milestone: M5 → M6b
No longer depends on: 1580176
Crash Signature: [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] → [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] [@ mozilla::dom::JSActor::ReceiveRawMessage]

Hi Calixte,
Is the crash still happen? Do you have a recent crash report for me to investigate? Thank you.

Flags: needinfo?(cdenizet)

It does, there's a table and links in the "crash data" section of the bug page; see e.g. bp-13ce9caa-8ee9-4676-bdd8-9b95c0200616

Flags: needinfo?(cdenizet)

I looked through a bunch of these reports and wrote down the actor and message name. The bulk of them were in Conduits, and most of the Conduits messages were RuntimeMessage and RunListener.

Conduits messages:

  • RunListener (12 times),
  • RuntimeMessage (10 times)
  • PortConnect (twice)
  • CallResult (once)

Lots of them were in a preallocated process. I don't know if that's meaningful.

Here are some other actors that showed up in these crashes (and their message):

  • UnselectedTabHover (Browser:UnselectedTabHover, three times)
  • BrowserTab (Browser:Reload, three times)
  • BrowserElement (PermitUnload)
  • BrowserTab (Browser:AppTab)
  • AutoComplete (FormAutoComplete:HandleEnter)

Here's an example of a crash with Conduits and RunListener: https://crash-stats.mozilla.org/report/index/a898751d-4291-4b7e-994d-09bf10200616

Tom, could you look into this further, using :mccr8's info above.

Flags: needinfo?(tomica)

I couldn't find any instance where we send a BrowsingContext in the extension framework.

Other than that, I don't know there's anything that we might be sending that would cause crashes. We do send extension-provided data, but we have it serialized into StructuredCloneHolders, so other than the size of those messages, anything else that would cause issues on deseralization would presumably throw when we do the serialization in the first place.

I don't have any other leads here, except maybe to prioritize bug 1605098 so that we get a bit more info in crash reports that include message names.

Flags: needinfo?(tomica)
Depends on: 1605098

According to the previous comments, we need more information to move this bug forward. There's no clear action we can take, unassign John for now. Please feel free to reach out or re-assign.

Assignee: jdai → nobody
Status: ASSIGNED → NEW
Crash Signature: [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] [@ mozilla::dom::JSActor::ReceiveRawMessage] → [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] [@ mozilla::dom::JSActor::ReceiveRawMessage] [@ mozilla::dom::JSActorManager::ReceiveRawMessage]

Collect JSActorName/Message info

Flags: needinfo?(rjesup)

Nika said she'll look into this.

Flags: needinfo?(nika)

I looked through a few of the reports here, and a large number of them seem to have the JSOutOfMemory annotation set to "Reported" on them, which somewhat implies that these assertion failures are being caused by an OOM while deserializing the structured clone data.

It would be nice if an OOM while deserializing structured clone data here could produce a different error from other types of deserialization failures, so we can handle them differently.

Flags: needinfo?(nika)

Neha asked me to look at this.

Nika's point about JSOutOfMemory being set is a good one, and does seem like a plausible explanation. One line of investigation here would be to get a better error annotation. The structured clone code seems to generate more detailed error information in ReportDataCloneError, but right now we're not recording that in the annotation. Maybe we could even explicitly check if there's a pending OOM exception? I'm not sure where that gets cleaned up. The goal of this line of investigation would be to confirm that these are OOM crashes, or if they aren't, try to figure out what they are. I saw one report that had only 20MB of physical memory free, but I saw another one that seemed to have plenty of memory and pagefile (the latter being common for low memory situations). It would also be nice to know how large the allocation was that failed, but structured clone code lives in JS, and it is hard to find that out in SpiderMonkey, in my experience.

Another line of investigation would be to split up these crashes by actor type. The crash reports already contain JSActorName and JSActorMessage fields, which is good, but as far as I can see these are not indexed in a way that lets you do any aggregation. Maybe we could get the Socorro people to index these fields so a query could facet on these fields. If that proves to be useful, maybe the signature could reflect the actor name and the message. If this is really an OOM crash, maybe some actor is sending way too much data, so it would make more sense to ascribe the crash to the specific actor and not the general window actor infrastructure.

I looked at the first 20 reports, and the actor and message were as follows:
7 BrowserTab, Browser:Reload
5 Conduits, PortMessage
3 AboutReader, Reader:PushState
3 Conduits, RuntimeMessage
1 Conduits, CallResult
1 ExtensionContent, Execute

A third line of investigation would be to look over the actor messages that show up a lot (in this or another sample of crashes) and figure out if there's something we could do to make them smaller.

I guess I looked at the messages before. Browser:Reload is kind of interesting to see because as far as I can see it only sends an integer and a boolean, so maybe the contents of the message aren't to blame for whatever is happening, at least if it is an OOM.

kmag will add payload size so we get more info from these crash reports.

Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED
Fission Milestone: M6b → M6c
Flags: needinfo?(rjesup)

Visiting https://v8.github.io/test262/website/default.html# , clicking Run, clicking Run All and allowing the tests to run will trigger this crash semi-reliably:

==2160546==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f01e8f7bd5a bp 0x7ffcfa6983f0 sp 0x7ffcfa697fa0 T0)
==2160546==The signal is caused by a WRITE memory access.
==2160546==Hint: address points to the zero page.
    #0 0x7f01e8f7bd5a in mozilla::dom::JSActorManager::ReceiveRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ipc::StructuredCloneData&&, mozilla::dom::ipc::StructuredCloneData&&) /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161:5
    #1 0x7f01e8f5693c in mozilla::dom::WindowGlobalChild::RecvRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ClonedMessageData const&, mozilla::dom::ClonedMessageData const&) /builds/worker/checkouts/gecko/dom/ipc/WindowGlobalChild.cpp:561:3
    #2 0x7f01e298397f in mozilla::dom::PWindowGlobalChild::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PWindowGlobalChild.cpp:1175:61
    #3 0x7f01e22a08ae in mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PContentChild.cpp:8621:32
    #4 0x7f01e20a24e8 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:2150:25
    #5 0x7f01e209eba2 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:2074:9
    #6 0x7f01e20a084a in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1922:3
    #7 0x7f01e20a0e8d in mozilla::ipc::MessageChannel::MessageTask::Run() /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1953:13
    #8 0x7f01e0dd4757 in mozilla::RunnableTask::Run() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:245:16
    #9 0x7f01e0dd0404 in mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:515:26
    #10 0x7f01e0dcde65 in mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:374:15
    #11 0x7f01e0dce3f7 in mozilla::TaskController::ProcessPendingMTTask(bool) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:171:36
    #12 0x7f01e0dd9a71 in operator() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:85:37
    #13 0x7f01e0dd9a71 in mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:577:5
    #14 0x7f01e0df85c2 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1197:14
    #15 0x7f01e0e025f1 in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:513:10
    #16 0x7f01e20a9e47 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:87:21
    #17 0x7f01e1fc2312 in RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:334:10
    #18 0x7f01e1fc2312 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:327:3
    #19 0x7f01e1fc2312 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:309:3
    #20 0x7f01e987313a in nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:137:27
    #21 0x7f01ed83844f in XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:913:20
    #22 0x7f01e1fc2312 in RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:334:10
    #23 0x7f01e1fc2312 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:327:3
    #24 0x7f01e1fc2312 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:309:3
    #25 0x7f01ed837ce1 in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:744:34
    #26 0x55ebc515dac8 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:56:28
    #27 0x55ebc515dac8 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:304:18
    #28 0x7f01f7d03041 in __libc_start_main (/lib64/libc.so.6+0x27041)
    #29 0x55ebc50b09a8 in _start (/home/geeknik/firefox/firefox-bin+0xb69a8)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161:5 in mozilla::dom::JSActorManager::ReceiveRawMessage(mozilla::dom::JSActorMessageMeta const&, mozilla::dom::ipc::StructuredCloneData&&, mozilla::dom::ipc::StructuredCloneData&&)
==2160546==ABORTING

This also appears in the console if helpful:
Assertion failure: false (Should not receive non-decodable data), at /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSActorManager.cpp:161

It appears to be the result-get-matched-err test, it triggers the slow page warning and the tab will crash if you don't do anything about the slow script warning. Even if you do click Stop It the tab will likely crash. Whilst the browser "hangs" memory use eventually reaches 100% at which point the tab crashes? Last image seen before the tab crashed: https://i.imgur.com/7e4dfhs.png

I wonder if bug 1660539 is related to this.

Could be, I haven’t been able to reproduce this since bug 1660539 was fixed.

I just experienced with fission on today's nightly on Linux on a Google doc bp-011a96d9-0075-4352-ae51-d21fb0201201

Sylverstre, were you running out of memory?

In the (non-public) details tab of your crash report, I see

  • JSActorMessage Execute
  • JSActorName ExtensionContent
  • JSOutOfMemory Reported

The "Execute" message is certainly expected to be "decodable", it is a JSON-serializable object.

Unlikely, the system on which I had the issue:

% cat /proc/meminfo
MemTotal:       65723420 kB
MemFree:        20876544 kB
MemAvailable:   51710984 kB
Buffers:         1378132 kB
Cached:         30175740 kB
SwapCached:        45928 kB
Active:         28126880 kB

Hello,
A process on my Firefox session just crashed.
And via about:crashes, I've been redirected to this bug.

So, here is a STR, I hope this may help :

  • On a Windows 64-bit, launch the latest Nightly (in this case, 2020-12-20)
  • Launch a live video on YouTube.
  • Watch this live during one hour or one and half hour.

Obtained result :
During the live, the memory has risen to the maximum available limit (on my PC, around 10 GB for a PC with a RAM of 12 GB).

Excepted result :
The memory remains at a nominal level.

This is not the first time I have this crash on a live video on YouTube.
But this never occurs on a "classic" (ie, not live) video on YouTube.

I don't see evidence of frequent OOM conditions in the crash reports save for a few (~5% of them). Can you point us to your particular report? If Firefox ran out of memory we might find a memory report attached which could help us diagnose the issue.

Flags: needinfo?(lolo2bdx)
Assignee: kmaglione+bmo → continuation

Hello Gabriele,
I have been to able to reproduce the conditions just before the crash (Firefox uses a lot of memory while a live video on Youtube is played).
And I have obtained a memory report, so here it is.

Flags: needinfo?(lolo2bdx)
Attachment #9197591 - Attachment description: High memory used by Firefox on a live video on outube → High memory used by Firefox on a live video on Youtube

Thanks, this is extremely useful. The process playing YouTube isn't particularly large but the extension process is huge, it's taking almost 1.5GiB of memory on its own. Looking at the various bits under that it seems that VideoDownloadHelper has allocated and never freed hundreds of megabytes of strings. So it seems that it's leaking memory somehow, can you try disabling the extension and seeing if the problem goes away? I'll inspect the other crash to see if they're also using the same extension.

[edit] I misinterpreted the memory report, it's not 2-byte strings, it's TwoByte strings so non-Latin unicode strings.

I inspected a few more memory reports in the crashes and I don't see a pattern unfortunately. The report attached as part of comment 37 is definitely a leak, so this crash might be also triggered by OOM-like conditions, or they might make it more likely. I poked a few more crashes for URLs and comments and it seems that pages with videos (Facebook feeds, YouTube and other streaming services) are more common than others, but there's no clear pattern.

Is the value of the data that has been received by JSActorManager::ReceiveRawMessage() important? If it is I can crack open a few minidumps and see if I can extract some useful samples.

We have bug 1686267 on file regarding VideoDownloadHelper memory spiraling out of control, so you could look at that.

(In reply to Gabriele Svelto [:gsvelto] from comment #39)

Is the value of the data that has been received by JSActorManager::ReceiveRawMessage() important? If it is I can crack open a few minidumps and see if I can extract some useful samples.

I don't know if the data per se is important, but we are interested in the size of the message being received, or the size of the structured clone data.

I think we should reconsider tracking bug 1563825 as part of Fission m6c. The crash is showing up in Developer edition, where Fission can't be enabled, and only about 10% of crashes with this signature on Nightly have Fission enabled over the last month. If you look across all crashes on Nightly in the last month, 26% of them have Fission enabled. This might be skewed a bit due to the recent high frequency crash that affected Fission more, but it still suggests this isn't a Fission-specific problem, but rather a problem with infrastructure introduced to support Fission. As such, it feels like it shouldn't block Fission rollout.

Fission Milestone: M6c → ?

(In reply to Andrew McCreight [:mccr8] from comment #41)

I think we should reconsider tracking bug 1563825 as part of Fission m6c. The crash is showing up in Developer edition, where Fission can't be enabled, and only about 10% of crashes with this signature on Nightly have Fission enabled over the last month. If you look across all crashes on Nightly in the last month, 26% of them have Fission enabled. This might be skewed a bit due to the recent high frequency crash that affected Fission more, but it still suggests this isn't a Fission-specific problem, but rather a problem with infrastructure introduced to support Fission. As such, it feels like it shouldn't block Fission rollout.

Clearing Fission Milestone because Nika says this is not a Fission-specific bug.

26% of crash reports had Fission enabled in the last month, but that's expected for almost any crash because Fission is enabled for about 20-25% of Nightly users.

Fission Milestone: ? → ---
Whiteboard: [fission-]

Not a Fission bug

Whiteboard: [fission-] → [not-a-fission-bug]

Too late for 86 RC but I am keeping the status for firefox as fix-optional as I would probably take a safe patch in a potantial dot release.

I opened up a couple of minidumps but had a hard time figuring out where to find the size of the data being cloned. What field should I be looking at? In one minidump aData.mStorage.val.mExternalData.bufList_ has a mSize field of ~8KiB, in the other 1.5KiB. In both cases drilling down the error object I find a mErrorNumber field set to MSG_INVALID_ENUM_VALUE. Is any of this useful?

ni?mccr8 in case comment 45 helps.

Flags: needinfo?(continuation)

I don't know what comment 45 means.

Flags: needinfo?(continuation)
Severity: critical → S2

I'm not going to have time to look at this soon.

Assignee: continuation → nobody
Status: ASSIGNED → NEW
QA Whiteboard: qa-not-actionable
Crash Signature: [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] [@ mozilla::dom::JSActor::ReceiveRawMessage] [@ mozilla::dom::JSActorManager::ReceiveRawMessage] → [@ mozilla::dom::JSWindowActor::ReceiveRawMessage] [@ mozilla::dom::JSActor::ReceiveRawMessage] [@ mozilla::dom::JSActorManager::ReceiveRawMessage] [@ OOM | unknown | mozilla::dom::JSActorManager::ReceiveRawMessage]

These crashes are all fallout from JS OOMs. Even if we ignore the message decoding errors instead of crashing, the browser is probably going to crash elsewhere soon or we will miss an important message, leaving the content process in an inconsistent state.

Has Regression Range: --- → yes

(In reply to Chris Peterson [:cpeterson] from comment #49)

These crashes are all fallout from JS OOMs. Even if we ignore the message decoding errors instead of crashing, the browser is probably going to crash elsewhere soon or we will miss an important message, leaving the content process in an inconsistent state.

I assume this makes it less severe.

FWIW, I see very few crashes like https://crash-stats.mozilla.org/report/index/fd18037f-32b8-4cd8-aaaa-926050220301 in release that seem to indicate that while trying to do something with the mPendingQueries of an actor returned from GetActor we have a nullptr access. And there seems to be a ::new on that stack, too.

Severity: S2 → S3
Priority: P2 → P3

Just hit this crash, in a mozilla.org process that was using >5.8GB of memory (per system sysinfo). It may have happened when I ran about:memory, which requires allocating memory to return a result, which may have triggered an JS OOM.

See Also: → 1764415

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #51)

Just hit this crash, in a mozilla.org process that was using >5.8GB of memory (per system sysinfo). It may have happened when I ran about:memory, which requires allocating memory to return a result, which may have triggered an JS OOM.

Yup, this crash tends to occur due to a JS heap OOM while deserializing.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:janv, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jvarga)
Keywords: topcrash

I wonder if we should stop asserting when the failure is because of an OOM. I don't like the idea of continuing with a child process when it's failed to process a message, since that could mean its state could be out of sync with the parent in dangerous ways. But this isn't a release assert, so it isn't helping release users at all. And the number of crashes from OOMs means we don't actually see any reports that failed to deserialize the message for other reasons that we can actually fix...

(In reply to Kris Maglione [:kmag] from comment #54)

I wonder if we should stop asserting when the failure is because of an OOM. I don't like the idea of continuing with a child process when it's failed to process a message, since that could mean its state could be out of sync with the parent in dangerous ways. But this isn't a release assert, so it isn't helping release users at all. And the number of crashes from OOMs means we don't actually see any reports that failed to deserialize the message for other reasons that we can actually fix...

If I read StructuredCloneHolder::ReadFromBuffer correctly, it seems we get a specific error message from JS but throw always a DataCloneError.

IIUC that makes it difficult to just check for OOM and exclude that case from the assertion which might be still of interest in other cases?

Duplicate of this bug: 1751391

(In reply to Jens Stutte [:jstutte] from comment #55)

If I read StructuredCloneHolder::ReadFromBuffer correctly, it seems we get a specific error message from JS but throw always a DataCloneError.

IIUC that makes it difficult to just check for OOM and exclude that case from the assertion which might be still of interest in other cases?

Yes. I looked into it after I made the suggestion and came to the conclusion that the simplest thing would be to just check whether the OOM reported flag was set. Unfortunately, even the error we get from the JS engine is not very specific, and the situation isn't very easy to improve. The spec says that we need to throw a DataCloneError, but it would be nice if internal consumers could still get more specific error details when they want them.

Spike in crashes over the last 2 days aligns with the spike in bug 1405521.

See Also: → 1405521
See Also: → 1803675

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:aiunusov, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(aiunusov)
Keywords: topcrash

Still need to collect more informaton

Flags: needinfo?(aiunusov)
Component: DOM: Content Processes → DOM: Navigation
Flags: needinfo?(jvarga)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 10 desktop browser crashes on nightly
  • Top 10 content process crashes on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Ideally, we don't want to continue running a child process if it fails to
handle a message from the parent, since that could mean child and parent state
could get out of sync. But since this assertion is only a diagnostic assert,
it isn't guaranteeing that in release builds anyway. And since the vast
majority of the crashes we are seeing in builds with diagnostic asserts
enabled appear to be OOMs, we can't really use crash reports to diagnose other
issues.

Ideally (again), we'd determine if the failure was caused by an OOM based on
the failure code returned by the structured clone decode call. Unfortunately,
though, since the spec requires that we return a generic DataCloneError on
failure, the structured clone code intentionally hides the specifics of
failure from callers. Propagating out more specific failure reasons for use by
privileged callers is nontrivial. So this patch essentially does the same
thing as crash reports do, and checks whether an OOM was reported recently,
and hasn't been recovered from by a successful GC.

Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED
Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/3b1e0a2fb06b
Don't assert on failure to decode message after OOM. r=mccr8
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 114 Branch
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 114 Branch → ---

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Flags: needinfo?(kmaglione+bmo)
See Also: → 1836195
Assignee: kmaglione+bmo → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: