Closed Bug 1612799 Opened 4 years ago Closed 2 years ago

Crash in [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk]

Categories

(Core :: Gecko Profiler, defect, P3)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
106 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox104 --- wontfix
firefox105 --- wontfix
firefox106 --- fixed

People

(Reporter: gsvelto, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(18 files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

This bug is for crash report bp-2b10eba0-0e04-4b10-9aa9-9c77b0200203.

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 mozglue.dll mozalloc_handle_oom memory/mozalloc/mozalloc_oom.cpp:51
2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp:54
3 xul.dll ChunkedJSONWriteFunc::AllocChunk tools/profiler/core/ProfileJSONWriter.cpp:84
4 xul.dll SpliceableChunkedJSONWriter::SpliceableChunkedJSONWriter tools/profiler/public/ProfileJSONWriter.h:115
5 xul.dll profiler_get_profile tools/profiler/core/platform.cpp:3799
6 xul.dll mozilla::ProfilerChild::GrabShutdownProfile tools/profiler/gecko/ProfilerChild.cpp:114
7 xul.dll mozilla::ChildProfilerController::ShutdownProfilerChild tools/profiler/gecko/ChildProfilerController.cpp:93
8 xul.dll mozilla::detail::RunnableMethodImpl<RefPtr<mozilla::AbstractCanonical<RefPtr<AudioDeviceInfo> > >, void  xpcom/threads/nsThreadUtils.h:1215
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1220

This is an OOM crash happening in the profiler code when a content process is shutting down. The allocation triggering the crash is rather large (2MiB) which is surprising because it seems that the code is just writing out JSON code.

Maybe this can involve a lot of JSON code? Is it possible to fail gracefully here instead of going OOM?

If this is not actionable please go ahead and close this as WONTFIX.

The Profiler can indeed generate big JSON strings, and at that time they coexist with the big Profiler raw buffer, increasing the odds of OOMs!

But we should definitely be able to handle most OOMs. I'll keep this bug (focusing on JSON generation) as a dependency of bug 1558402.

In the meantime, you may be able to lessen the occurrences by reducing the Profiler buffer size (controlled through the add-on or Nightly's built-in popup), of course at the cost of recording a smaller window of time.

Blocks: 1558402
Priority: -- → P3

Super-silly question, could this JSON be streamed out incrementally without using large buffers? We had a couple more bugs like this one which we fixed by streaming out data to disk but not being familiar with the code here I don't know if it's possible.

That JSON is later sent to the web client js code (from https://profiler.firefox.com) in the same Firefox, so I'm afraid it will stick around in memory anyway -- and it can be multi-MB big!
But there are probably options that could help. As you suggested, we could stream the generated JSON to disk, then clean up the Profiler data, and later stream the JSON out of disk to the Profiler client. There are big plans for the future! 😉

Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk] [@ OOM | large | mozalloc_abort | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk ]
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk] [@ OOM | large | mozalloc_abort | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk] [@ OOM | large | mozalloc_abort | moz_xmalloc | ChunkedJSONWriteFunc::AllocChunk ] [@ OOM | large | mozalloc_abort | moz_xmalloc | mozilla::baseprofile…

I crashed on this today (https://crash-stats.mozilla.org/report/index/a742e636-1fe9-4fb0-9706-061800220727). So it looks like we still have ways to crash when capturing big profiles. It's strange that we crashed trying to allocate a 2.1MB block when 1.67GB was free. Maybe memory fragmentation?

In recent bugs I've (hopefully) dealt with situations where we hit an arbitrary Firefox limit (e.g.: string or IPC maximum size), but yes there is still the possibility of actual OOMs, where we just can't allocate system memory as needed.
Regardless of the strangeness of not being able to allocate 2MB out of 1.6GB free, we need to handle these runtime errors.
Dependent bug 1558402 should deal with these in general, but this one here gives a good list of crash sites that we could handle first, and see effects in crash-stats.

(In reply to Florian Quèze [:florian] from comment #4)

I crashed on this today (https://crash-stats.mozilla.org/report/index/a742e636-1fe9-4fb0-9706-061800220727). So it looks like we still have ways to crash when capturing big profiles. It's strange that we crashed trying to allocate a 2.1MB block when 1.67GB was free. Maybe memory fragmentation?

Windows clearly bumped against the page file limit (the last error value is ERROR_COMMITMENT_LIMIT). What might have happened is the following:

  1. We hit the commit-space limit, an allocation returned NULL
  2. We crashed Firefox
  3. Windows automatically enlarged the page file in the meantime
  4. The crash report generator picked up the new, larger page file that resulted in the large amount of free commit space

Bug 1716727 should allow avoiding this type of crash.

The FailureLatch interface, and some implementation helpers and classes, will
be used to record the first failure (if any) during some long process.

Assignee: nobody → mozbugz
Status: NEW → ASSIGNED

This removes some redundant initializers in constructors (present and future).

Also some [[nodiscard]]s were added where appropriate, and empty lines for
clarity.

Depends on D155647

Note that at this point, it's using the FailureLatchInfallibleSource singleton,
so operations are still effectively infallible, i.e. they will terminate the
program.
But users already handle future fallible outcomes.

Depends on D155648

Instead of redundantly explaining each static_cast in different
SpliceableChunkedJSONWriter functions, safe accesses are now isolated to only
two functions, making the public function implementations simpler, including
changes in the following patch.

Depends on D155649

For now, all users specify an infallible latch.

Depends on D155650

When something goes wrong, the most likely cause is running out of memory, so
we clear our data to try and free some memory ASAP, to hopefully reduce the
likelihood of a terminating OOM elsewhere.

Depends on D155652

This will help with forwarding the chosen FailureLatch to the code and
structures gathering per-thread samples and markers.

Depends on D155653

Because JITFrameInfo objects are creating during profiling, they need to have
their own fallible FailureLatchSource, so that any error can be safely handled
even before we know which FailureLatch will be used in the final JSON
generation work.

Depends on D155654

Switch from the FailureLatchInfallibleSource singleton to a local fallible FailureLatchSource.

Depends on D155660

Use a fallible FailureLatchSource, and record the failure message (if any).

Depends on D155661

Use a fallible FailureLatchSource, and record the failure message (if any).

Depends on D155662

There is no easy way to inject failures like OOM during the JSON generation
process, except from markers, which have direct access to the
SpliceableJSONWriter that may now use a fallible FailureLatchSource.

Depends on D155663

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c769477c263b
FailureLatch - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/cef60c38bc1e
Add default initializers for ChunkedJSONWriteFunc member variables - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/7b8313ab7356
Make ChunkedJSONWriteFunc a FailureLatch, with optionally-fallible operations - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/e97b56983a0e
Refactor SpliceableChunkedJSONWriter's access to its ChunkedJSONWriteFunc - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/9a2ac25ecd49
Make Spliceable{,Chunked}JSONWriter a FailureLatch - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/d53fee4addea
Make UniqueJSONStrings a FailureLatch - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/031562a9d9fe
Make UniqueJSONStrings operations fallible - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/a3e3adcadcd2
Make ProcessStreamingContext a FailureLatch - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/d24fe096ec92
JITFrameInfo carries its own FailureLatchSource - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/57886f5885d4
Make UniqueStacks a FailureLatch - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/65e8fde135ac
Make UniqueStacks operations fallible - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/bc03754da595
Make JITFrameInfo operations fallible - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/5c940a7feae6
Handle failures and exit early in main JSON-generation function - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/0326bea6ec5c
Pass a FailureLatch to EntryGetter, to potentially exit early on failure - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/e0fe782f6ad4
Handle failures in nsProfiler - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/6cf4c6aa5296
Handle failures when generating shutdown profile - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/76697984e917
Handle failures when generating child profile - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/cb7dec6613b9
Add gtest to test failure injected by marker - r=canaltinova
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: