Open Bug 1716849 Opened 3 years ago Updated 16 days ago

Segmentation fault in ExtensionsParent::RecvStateChange during session restore

Categories

(Core :: DOM: Navigation, defect, P3)

Firefox 89
defect

Tracking

()

Tracking Status
firefox-esr78 --- unaffected
firefox89 --- affected
firefox90 --- affected
firefox91 --- affected

People

(Reporter: eternaleye, Unassigned)

References

Details

(Keywords: leave-open)

Crash Data

Attachments

(5 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Falkon/3.1.0 Chrome/83.0.4103.122 Safari/537.36

Steps to reproduce:

  • Use firefox 85.0.2 normally
  • Experience a power loss event
  • Boot computer
  • Attempt to reopen firefox with 89.0.1

Actual results:

  • Segmentation fault (I have a coredump, and so can debug at will)
  • On prior versions of Firefox (85.0.2, 85.0.1, 86.0, 87.0) I experienced https://bugzilla.mozilla.org/show_bug.cgi?id=1694979 instead
  • Much like the above, this seems to be connected to sessions where a given window has a very large tab count.
  • Stack trace of crashing thread fed through c++filt (consistent across attempts):
#0  0x00007f344793e8d5 raise (libpthread.so.0 + 0x138d5)
#1  0x00007f344146ebf8 nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*) (libxul.so + 0x409cbf8)
#2  0x00007f344793ea40 __restore_rt (libpthread.so.0 + 0x13a40)
#3  0x00007f343dd75feb mozilla::extensions::ExtensionsParent::RecvStateChange(mozilla::dom::MaybeDiscarded<mozilla::dom::BrowsingContext>&&, nsIURI*, nsresult, unsigned int) [clone .cold] (libxul.so + 0x9a3feb)
#4  0x00007f343e4491e6 mozilla::extensions::PExtensionsParent::OnMessageReceived(IPC::Message const&) (libxul.so + 0x10771e6)
#5  0x00007f343e411a0d mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) (libxul.so + 0x103fa0d)
#6  0x00007f343e337cff mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) (libxul.so + 0xf65cff)
#7  0x00007f343e33d536 mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) (libxul.so + 0xf6b536)
#8  0x00007f343e33eefa mozilla::ipc::MessageChannel::MessageTask::Run() (libxul.so + 0xf6cefa)
#9  0x00007f343dec4134 mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (libxul.so + 0xaf2134)
#10 0x00007f343dec6093 mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (libxul.so + 0xaf4093)
#11 0x00007f343dec628c mozilla::TaskController::ProcessPendingMTTask(bool) (libxul.so + 0xaf428c)
#12 0x00007f343dec6362 mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::{lambda()#1}>::Run() (libxul.so + 0xaf4362)
#13 0x00007f343dee27f6 nsThread::ProcessNextEvent(bool, bool*) (libxul.so + 0xb107f6)
#14 0x00007f343ded1de8 NS_ProcessNextEvent(nsIThread*, bool) (libxul.so + 0xaffde8)
#15 0x00007f343e32c8ba mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (libxul.so + 0xf5a8ba)
#16 0x00007f343e2f8cf4 MessageLoop::Run() (libxul.so + 0xf26cf4)
#17 0x00007f3440503bd8 nsBaseAppShell::Run() (libxul.so + 0x3131bd8)
#18 0x00007f34413bfe66 nsAppStartup::Run() (libxul.so + 0x3fede66)
#19 0x00007f3441481bd3 XREMain::XRE_mainRun() (libxul.so + 0x40afbd3)
#20 0x00007f3441482548 XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (libxul.so + 0x40b0548)
#21 0x00007f3441482ade XRE_main(int, char**, mozilla::BootstrapConfig const&) (libxul.so + 0x40b0ade)
#22 0x00005599f3a5f0e6 do_main (firefox + 0x100e6)
#23 0x00005599f3a5e52b main (firefox + 0xf52b)
#24 0x00007f34475a1b35 __libc_start_main (libc.so.6 + 0x27b35)
#25 0x00005599f3a5e7fe _start (firefox + 0xf7fe)

Expected results:

Firefox should, hopefully, not crash during session restore

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Navigation' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → DOM: Navigation
Product: Firefox → Core

On prior versions of Firefox (85.0.2, 85.0.1, 86.0, 87.0) I experienced https://bugzilla.mozilla.org/show_bug.cgi?id=1694979 instead

Alex, are you able to reproduce this crash every time? How are you reproducing the power loss event? How do you recover from this crash?

Maybe there is a profile corruption on power failure? We're failing to load WebNavigation.jsm and crashing on this MOZ_RELEASE_ASSERT:

https://searchfox.org/mozilla-central/rev/c114db74a92cf15096dfda02255e125949b0e070/toolkit/components/extensions/ExtensionsParent.cpp#26

Here are some similar crash reports that look like JS out of memory or profile corruption:

bp-b910286b-ae32-455e-b215-bb6bc0210620
bp-147115f6-0861-4c63-a654-616f80210621

Status: UNCONFIRMED → NEW
Crash Signature: [@ mozilla::extensions::ExtensionsParent::RecvStateChange]
Ever confirmed: true
Flags: needinfo?(eternaleye)

Whether I can reproduce it is a bit complex. I've basically been attempting to restore the same session since the initial power loss event; the crash on attempting to restore is deterministic given the session, but I haven't even tried to reproduce the power loss event.

However, what I have done is manually verify that the session restore file is valid, and even attempted bisecting it down (removing windows and tabs) to try and pinpoint the issue. I found that, no matter which subset of windows and tabs I remove, the problem vanishes if I go below a (frustratingly fuzzy) point.

The crash I noted as happening in earlier versions of Firefox is also a MOZ_RELEASE_ASSERT, and the best hypothesis we have for that is that it failed due to a data race (https://bugzilla.mozilla.org/show_bug.cgi?id=1671601) that I (nearly) deterministically lost due to the extremely large number of tabs. I suspect the same may be occurring here.

The problem with it only manifesting with an unminimized session, however, is that I'm not comfortable sharing the core dump or the session file.

Flags: needinfo?(eternaleye)

Unfortunately, it's probably just an OOM. There aren't many other ways we can fail to load that module other than corruption (which doesn't seem to be the case here). That's also consistent with the crash moving around depending on the version.

I can guarantee it's not an OOM; 32GB of RAM and 128 GB of swap go a long way.

To be more specific, before the failure it had reliably run just fine in under a quarter of RAM at startup, and it gets far enough (in safe mode, which is what I've been doing) to open the windows. It never touches swap at all, either.

kmag is going to look at some crash reports to see if there is a problem with nested event loops (XPCOM spin event loop in Add-ons Manager). He might add a new annotation to collect more details about the error.

Severity: -- → S3
Flags: needinfo?(kmaglione+bmo)
Priority: -- → P3
Flags: needinfo?(kmaglione+bmo)
Keywords: leave-open

There are a number of modules that we import from C++ and can't continue
running without. We have a number of crashes for some of those failed loads. A
lot of them are from OOMs or corruption, but we're not sure about the rest.

This patch adds a crash annotation with the details of the error wherever we
abort for failing to load a module.

Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED
Attachment #9232049 - Flags: data-review?(chutten)

Comment on attachment 9232049 [details]
Request for data collection review form

For Q5: Please list the annotations paying special attention to what category the collections fall into

For Q7: If the collections are to be collected permanently, please identify the individual responsible for the collections.

Attachment #9232049 - Flags: data-review?(chutten) → data-review-

I don't know if this is just a red herring, but looking through the crashes, quite a few of them (but not all) have this in their app notes: "ToShmem failed for Atom: must be a static atom: anonymous-div". That annotation was added in bug 1621773, and see this comment for a mention of that specific atom, which was apparently removed in 2020, and the theory was that it involved the binary getting updated but loading an old style sheet.

See Also: → 1621773

Other common signatures with the atom annotation in their app notes are:

The crashes look similar, at least in terms of being related to some module import failing. I don't know if those signatures should be added to this bug or not.

Attachment #9232049 - Attachment is obsolete: true
Attachment #9232227 - Flags: data-review?(chutten)

Comment on attachment 9232227 [details]
Request for data collection review form

PRELIMINARY NOTES:

Thank you for the fast return on this.

Please confirm that the error message isn't parameterized in a way that could contain information supplied by the user or their system (e.g. no full paths or usernames please). I presume since these are JSMs their names are all known ahead of time so they're unlikely to contain sensitive data. I will proceed with the review with these assumptions.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Kris Maglione is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9232227 - Flags: data-review?(chutten) → data-review+

(In reply to Chris H-C :chutten from comment #14)

Please confirm that the error message isn't parameterized in a way that could contain information supplied by the user or their system (e.g. no full paths or usernames please). I presume since these are JSMs their names are all known ahead of time so they're unlikely to contain sensitive data. I will proceed with the review with these assumptions.

Yes, the URL of the JSM is required to be a static string, so it will never contain any data supplied by the user. The JS error message itself can technically be arbitrary, but given the nature of the call that's generating it (only loading a built-in JSM, none of which do more than basic initialization), it is exceedingly unlikely that it would ever contain system- or user-specific data. The majority of the error messages should just be "out of memory" or syntax errors possibly containing segments of static JS source code with single-bit-flip errors. The rest should still all point to static source locations and have relatively generic error messages.

Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ef812b3d914d
Add crash annotation for error when aborting for failed module import. r=mccr8
Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/78e3c985ebb1
Add crash annotation for error when aborting for failed module import. r=mccr8
See Also: → 1724000

Same here, but after unrelated firefox crash instead of power loss :/
I guess https://crash-stats.mozilla.org/report/index/754b29d5-304d-4bca-960f-c7aa80210902 and some other reports should have the needed annotations.

(consistently reproducible on nightly in safe mode)

Ok, hear me out: firefox just can't handle big sessions.

  1. Create fresh profile
  2. Enable session store
  3. Open over 8000 (crashed before over 9000, sorry) https://ojab.ru
  4. Boom!
    resulting sessionstore in the attached file

Reproducible here with

  1. Create fresh profile
  2. about:preferences
  3. Check Restore previous session
  4. Close firefox
  5. Copy attaches sessionstore to the profile
  6. Open firefox
  7. Boom!

(In reply to ojab from comment #22)

Created attachment 9239255 [details]
sessionstore.jsonlz4 kaboom

Ok, hear me out: firefox just can't handle big sessions.

  1. Create fresh profile
  2. Enable session store
  3. Open over 8000 (crashed before over 9000, sorry) https://ojab.ru
  4. Boom!
    resulting sessionstore in the attached file

Reproducible here with

  1. Create fresh profile
  2. about:preferences
  3. Check Restore previous session
  4. Close firefox
  5. Copy attaches sessionstore to the profile
  6. Open firefox
  7. Boom!

I seem to be afflicted with the same issue. Performing a tab count of my decompressed 'previous.jsonlz4' gives 7992 tabs (working). my sessionstore.jsonlz4 gives 8050 tabs (crashing). Rule of thumb seems to be: don't exceed ~8000 tabs.

Firefox 92.0.1 (Windows 64-bit)

Status update on my end - this seems to be specifically connected to the number of tabs in a single window - as of Firefox 94 I still had the issue, but as soon as I manually edited my session to add another window object in the JSON, and moved 2k of the tabs to it, the issue vanished. My current hypothesis is that session restore enters some kind of tight loop when restoring the tabs of a single window, and some timeouts may be initiated before it and then run out during it, causing failures in code that has no other causal relationship.

Flags: needinfo?(kmaglione+bmo)

Is there some timeout that I can bump in about:config if I'm affected by this (or bug 1724000)?

See Also: → 1751836

(In reply to The 8472 from comment #25)

Is there some timeout that I can bump in about:config if I'm affected by this (or bug 1724000)?

Kris may have thoughts (if this is still relevant).

Flags: needinfo?(kmaglione+bmo)

Well, this crash signature no longer exists since we landed the diagnostic patch. We now have crashes in @ mozilla::loader::ImportModule for the failure, and for all of the ones I've looked at, the error is Failed to load module "resource://gre/modules/WebNavigation.jsm": ... out of memory and the JSOutOfMemory: Reported annotation.

Which is fairly strange, since the reports also look like we have plenty of memory available. But there are also separate memory limits for the JS engine, and I suppose it's possible something is filling up that memory and we aren't immediately able to GC it for some reason.

I don't really have any good solutions aside from maybe forcing a full GC/CC and trying again when we OOM trying to load a critical module... It might work.

Andrew, do you have any thoughts?

Flags: needinfo?(kmaglione+bmo) → needinfo?(continuation)

(In reply to Kris Maglione [:kmag] from comment #27)

Which is fairly strange, since the reports also look like we have plenty of memory available. But there are also separate memory limits for the JS engine, and I suppose it's possible something is filling up that memory and we aren't immediately able to GC it for some reason.

I don't really have any good solutions aside from maybe forcing a full GC/CC and trying again when we OOM trying to load a critical module... It might work.

Andrew, do you have any thoughts?

Or Paul?

Flags: needinfo?(pbone)

The JS engine has a limit of 4GB for the JS heap size, that doesn't include DOM objects (in the jemalloc heap) or things backing storage for JS arrays (which are also in the jemalloc heap). So while 4GB is a lot smaller than the 32GB of physical memory the user has installed, I'd hope it would be enough since things like arrays are stored elsewhere, which means if this is hitting this limit it's probably actionable.

Now that we can reproduce this adding some logging might confirm/refute this. it should be possible for the WebNavigation.jsm, or the module that loads it to ask the JS enigne how much memory it has allocated at various stages during the load.

Or just run it in the Firefox profiler, I'd expect to see an emergency GC or 2 before the crash, you can look at the markers for these and see how big the JS heap is.

Flags: needinfo?(pbone)

I don't have any real ideas, sorry.

Flags: needinfo?(continuation)

I don't know if there is a way to grab Firefox profiler profiles when Firefox ends up crashing, so I instead grabbed a "perf" profile, up to the crash. Hopefully there is something interesting in here - it is a profile of me loading a profile with a sessionstore that always crashes Firefox with the bug this bug is about.

(In reply to Rob North from bug 1724000 comment #5)

Had an interesting correlation for this bug.
Happened slate in startup when restoring session.
Profile had silly amount of windows and tabs open, but distinguishing feature was that on startup downloaded and opened to bluetooth specification (a pdf file).
If I minimised the window before it completed loading the pdf would no longer crash.
Happened about 3-4 times before tried minimising window.

Typical report is: 0237ca32-4c85-4b6d-9329-75ef40220727

That's interesting, and in someways surprising and in other ways unsurprising. If the spec is opening to a PDF.js window, it could conceivably consume enough heap space to cause the OOM. But I'd also expect PDF.js to load in a content process, which shouldn't consume any heap space in the parent process. Minimizing the window may also prevent it from rendering until the window is shown again.

If it opens in an external PDF viewer, something else could be going on, but we still could be delaying it until the window is shown again.

Is the URL of the PDF something that you can share so someone can try to reproduce?

Flags: needinfo?(6jju4k002)

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::extensions::ExtensionsParent::RecvStateChange] → [@ mozilla::extensions::ExtensionsParent::RecvStateChange] [@ mozilla::dom::Promise::AppendNativeHandler] [@ mozilla::loader::ImportModule]

Ok, happened again, and once again successfully recovered by minimising window.
It appears that the window that was open was pointing at the following URL: https://www.bluetooth.org/docman/handlers/downloaddoc.ashx?doc_id=478726, which in turn downloaded the PDF, and opened in a new window (Firefox is set to be pdf viewer).

Over time, with multiple re-starts this has opened new tabs in multiple windows, resulting in 12 instances of the document.

Note that I have never seen a browser crash from the window other than at startup. I can successfully navigate to the minimised window once download & render are complete.

I suppose one obvious issue is that Firefox is downloading files on startup, when arguably it should only be loading pages: Downloads should really be pointing at original download, and as at present they open to blank page, then if would cause download, should do nothing.
But having said that, this may not be the cause of the crash, it's quite likely to be the size of the document, as it is quite large.

I have had this happen to other pdfs but the Bluetooth pdf seems to be the only one that causes a startup crash.

I will raise a bug regarding startup behaviour of download pages (if doesn't already exist).

Crash Signature: [@ mozilla::extensions::ExtensionsParent::RecvStateChange] [@ mozilla::dom::Promise::AppendNativeHandler] [@ mozilla::loader::ImportModule] → [@ mozilla::extensions::ExtensionsParent::RecvStateChange] [@ mozilla::dom::Promise::AppendNativeHandler] [@ mozilla::loader::ImportModule]
Flags: needinfo?(6jju4k002)

I have tried to repeat in a smaller profile, and can't.
After trying to reload the the culprit tabs, seems that they're no longer causing problems, even in re-load pages from original URL. I suspect that some update corrupted some of the download tabs, as stored in profile, and they are no longer corrupted.
I would still cast suspicion on "pdf.js", as was ending up with 10 windows with tabs open to pdfs on startup.

There is related bug for startup re-downloading files here: https://bugzilla.mozilla.org/show_bug.cgi?id=560203

The leave-open keyword is there and there is no activity for 6 months.
:kmag, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(kmaglione+bmo)

What landed was a some kind of diagnostic thing. The crash signature is still happening, and the patch wasn't expected to fix it.

Flags: needinfo?(kmaglione+bmo)

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: kmaglione+bmo → nobody
Status: ASSIGNED → NEW

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

:sefeng, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(sefeng)
Keywords: topcrash

The bug is three years old and has been stalled for more than a year. I think it's fine to keep the priority and severity, and looks like we don't have immediate plans to address this.

Flags: needinfo?(sefeng)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Duplicate of this bug: 1888305

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::extensions::ExtensionsParent::RecvStateChange] [@ mozilla::dom::Promise::AppendNativeHandler] [@ mozilla::loader::ImportModule] → [@ mozilla::extensions::ExtensionsParent::RecvStateChange] [@ mozilla::dom::Promise::AppendNativeHandler] [@ mozilla::loader::ImportModule] [@ mozilla::loader::ImportESModule]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: