Closed Bug 1589493 Opened 5 years ago Closed 4 years ago

Make it possible for the front-end to determine that a frame crash was caused by an OOM

Categories

(Firefox :: Tabbed Browser, task, P3)

task

Tracking

()

RESOLVED FIXED
Firefox 76
Fission Milestone M6
Tracking Status
firefox76 --- fixed

People

(Reporter: gsvelto, Assigned: Yoric)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

Currently in the case of an OOM crash of a content process we'll show users the "This tab crashed" interface and ask them to submit a crash report. In a post-Fission world where OOM crashes are more likely it might be less disruptive to silently reload the crashed tab (or display an infobar mentioning that the tab was reloaded because it consumed too much memory).

I'm filing this bug to investigate what would be required to inform the tab that its content process disappeared due to an OOM.

Priority: -- → P3

Are you talking about the process being killed by the OS (as in Linux oom-killer) or about a content process committing ritual self-immolation because it cannot allocate memory?

For the former, there doesn't seem to be a simple way to detect it:

  • Linux distributions write this in logs, but not all in the same file and not always immediately;
  • I haven't found anything for Windows or macOS yet;
  • neither our IPC code nor Chrome (I've checked in their code) seems to detect this.

For the latter, it might be simpler.

Flags: needinfo?(gsvelto)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #1)

Are you talking about the process being killed by the OS (as in Linux oom-killer) or about a content process committing ritual self-immolation because it cannot allocate memory?

Both, in some cases we can always detect OOM crashes on Windows but not all the time on Linux/macOS.

For the former, there doesn't seem to be a simple way to detect it:

  • Linux distributions write this in logs, but not all in the same file and not always immediately;
  • I haven't found anything for Windows or macOS yet;
  • neither our IPC code nor Chrome (I've checked in their code) seems to detect this.

For the latter, it might be simpler.

We're going to cheat :-) When doing large allocations - and on Windows when any allocation fails - we set the OOMAllocationSize crash annotation to a value that is different from 0. The idea is to use that to detect OOMs. When we hit a content process crash generate the minidump here:

https://searchfox.org/mozilla-central/rev/131338e5017bc0283d86fb73844407b9a2155c98/dom/ipc/ContentParent.cpp#1685

There the CrashReporterHost object contains the crash annotations (we set them here) so we could use that to inspect the OOMAllocationSize and use it to inform the front-end that this was an OOM.

We're passing a property bag to the front-end here and we could add an entry to flag OOM crashes.

This will always work on Windows but on macOS and Linux only sometimes. On Linux we can improve the situation but it's more work so I'll leave it for another bug. If you're curious about investigating that too OOM crashes will often look like this:

https://crash-stats.mozilla.com/report/index/b44f5e06-568c-492b-b2d5-0ad6f0191119

The crash reason is SIGBUS and more often than not the crashing address is a multiple of the page size. That's because we were trying to page in something and the kernel couldn't find a free page so it killed us. SIGBUS can also be raised in other scenarios - but they're rare - and the crash address might not be a multiple of the page-size if the access happened in the middle of an object but IMHO it's a good starting point.

Flags: needinfo?(gsvelto)

and on Windows when any allocation fails

Why only on Windows?

The crash reason is SIGBUS and more often than not the crashing address is a multiple of the page size. That's because we were trying to page in something and the kernel couldn't find a free page so it killed us.

So it's a problem of us not always using fallible allocations? Does this mean that we should patch jemalloc to have a flag "allocation crashed recently"?

We're passing a property bag to the front-end here and we could add an entry to flag OOM crashes.

Well, this certainly simplifies the situation :)

Do we have infrastructure to test OOMs somewhere?

Flags: needinfo?(gsvelto)

Also, what you're describing is case 2 (self-immolation) and case 2bis (we should have self-immolated but we forgot so the OS did it for us), but not case 1 (oom-killer), right?

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #4)

Also, what you're describing is case 2 (self-immolation) and case 2bis (we should have self-immolated but we forgot so the OS did it for us), but not case 1 (oom-killer), right?

On Windows failed allocations always return NULL so we always self-immolate when doing infallible allocations. On Linux most allocations succeed (e.g. mmap() almost never fails) so it's the OOM killer that's killing us most of the time. The core difference between the two is that Linux allow processes to overcommit memory while Windows doesn't. So on Linux an allocation might succeed but then we would crash when touching the allocated memory because the kernel isn't able to back it up with actual physical memory. That makes it harder to detect OOMs.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #5)

On Windows failed allocations always return NULL so we always self-immolate when doing infallible allocations.

Ok. Is that the most common situation we need to deal with? Rather than whatever oom-killer-ish technique Windows may be employing when the swap grows too large and it needs to start killing random processes?

That makes it harder to detect OOMs [on Linux and probably macOS].

Oh. I thought that Windows also allowed overcommiting. Good point. I don't see any simple way around this.

Perhaps for macOS and Linux we could somehow detect we're encountering a page fault in a page that we have allocated (and not deallocated) with jemalloc? Surely jemalloc must have some list of the pages it owns. Anyway, that sounds like a followup bug.

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

https://crash-stats.mozilla.com/report/index/b44f5e06-568c-492b-b2d5-0ad6f0191119

The crash reason is SIGBUS and more often than not the crashing address is a multiple of the page size. That's because we were trying to page in something and the kernel couldn't find a free page so it killed us. SIGBUS can also be raised in other scenarios - but they're rare

Getting ENOSPC on a mapped file from /dev/shm (e.g., from glibc's shm_open) isn't especially rare, unfortunately, and it's likely the cause if the crash is in graphics code touching texture memory; see bug 1245239 and connected bugs. That's sort of an OOM, in that it's a resource shortage and not a memory unsafety bug, but it doesn't necessarily mean the entire system is out of memory.

Is there evidence that some of our SIGBUS crashes are from non-file-backed memory?

Oh yes, I forgot about bug 1245239. Many common SIGBUS crashes happen when copying, setting or scanning memory at page-aligned addresses, see these ones for example:

[__memmove_avx_unaligned_erms | mozilla::AudioStream::GetUnprocessed]

Or these:

[__memcpy_sse2_unaligned_erms | mozilla::layers::MappedYCbCrChannelData::CopyInto]

I don't think these involve shared segments.

Either way the goal here is to make these kind of content process crashes almost invisible to the user , crashes caused by shared-memory exhaustion might also fit in this category, WDYT?

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #6)

Ok. Is that the most common situation we need to deal with? Rather than whatever oom-killer-ish technique Windows may be employing when the swap grows too large and it needs to start killing random processes?

Yes, in fact what we consider our OOM crash rate is measured this way. It's basically the number of crashes with the OOMAllocationSize crash annotation set over the total number of crashes.

Oh. I thought that Windows also allowed overcommiting. Good point. I don't see any simple way around this.

Perhaps for macOS and Linux we could somehow detect we're encountering a page fault in a page that we have allocated (and not deallocated) with jemalloc? Surely jemalloc must have some list of the pages it owns. Anyway, that sounds like a followup bug.

Maybe but it's not urgent either. Windows is by far the biggest source of OOM crashes so if we could implement this within the existing machinery it would be more than sufficient. We can improve macOS & Linux OOM crash detection at a later stage.

I'll start working on it as soon as I have received my Windows machine.

(also, my jemalloc-related idea doesn't work)

Ok, if I read code correctly, ContentCrashHandlers.jsm is in charge of displaying about:tabcrashed whenever a <xul:browser> has crashed. Also, this same module already has access to the nsIPropertyBag in which OOMAllocationSize is stored.

Sounds like I should be able to replace about:tabcrashed with a lazily-loaded <xul:browser> for the current page.

Not sure how I can test it, though.

You can try CrashTestUtils, it has a crash-as-an-oom function. We're using it in xpcshell tests, see this one for example.

We'll use this method to expose additional information to the front-end for recovering from OOM.

As Fission currently causes lots of OOM under Windows, to make life tolerable, we need OOM-crashed tabs to reload automatically. This patch attempts to piggyback upon the existing tabbrowser infrastructure to make this happen.

I'd like the input from someone who knows tabbrowser.discardBrowser better than me, though!

Depends on D54130

(In reply to Gabriele Svelto [:gsvelto] from comment #13)

You can try CrashTestUtils, it has a crash-as-an-oom function. We're using it in xpcshell tests, see this one for example.

Thanks, I'll try this. I need to also find how to test that a tab was lazified.

Ok, I should be able to test with browser.linkedTab.

Gabriele, if we immediately restore the currently focused tab, there's a decent change that we'd go into some kind of infinite OOM loop. That wouldn't be good. How do you want to handle that? One option is to just expose the crash information and let fx-team handle this side, as they have UX resources at hand.

Flags: needinfo?(gsvelto)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #19)

Gabriele, if we immediately restore the currently focused tab, there's a decent change that we'd go into some kind of infinite OOM loop.

Good point. We should be able to mostly prevent that by adjusting the memory priority of the other content processes (the ones handling background tabs); I'll file a bug for that. That being said we might still be in a scenario where there's only one tab and it's opening a huge website so we keep reloading and crashing. In fact I'm not even sure if reloading the currently focused tab is a good idea; background tabs can certainly be reloaded with minimal disruption but for the foreground one we might want to pay attention.

That wouldn't be good. How do you want to handle that? One option is to just expose the crash information and let fx-team handle this side, as they have UX resources at hand.

+1 to hand this over to the fx-team; they can certainly find a better solution given the necessary tools.

Flags: needinfo?(gsvelto)

CC'ing :mconley because he might want to know about this.

Putting on the radar for Fission front-end work.

We'll probably want to get some UX help to figure how to communicate the various failure cases to the user.

So, what I'll do in this bug is send a new observable event with the id of the crashing tab so that the front-end can decide whether to display a message, add a lazy reload or reload immediately.

:mconley, any favored name for this observable?

Flags: needinfo?(mconley)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #23)

So, what I'll do in this bug is send a new observable event with the id of the crashing tab so that the front-end can decide whether to display a message, add a lazy reload or reload immediately.

Hm, what ID do you mean?

:mconley, any favored name for this observable?

We already have several signals for when a oop frame crashes - example:

  1. oop-frameloader-crashed observer notification
  2. oop-browser-crashed event
  3. ipc:content-shutdown observer notification

I'm reluctant to add another unless we really need it. Can we piggyback off one of these instead?

Flags: needinfo?(mconley)

Adding a field to ipc:content-shutdown saying that the crash was an OOM should be trivial.

Summary: Investigate if it's possible to automatically reload tabs that died due to OOM crashes → Make it possible for the front-end to determine that a frame crash was caused by an OOM

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #24)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #23)

So, what I'll do in this bug is send a new observable event with the id of the crashing tab so that the front-end can decide whether to display a message, add a lazy reload or reload immediately.

Hm, what ID do you mean?

I meant the tab id.

:mconley, any favored name for this observable?

We already have several signals for when a oop frame crashes - example:

  1. oop-frameloader-crashed observer notification
  2. oop-browser-crashed event
  3. ipc:content-shutdown observer notification

I'm reluctant to add another unless we really need it. Can we piggyback off one of these instead?

Ok, I've rethought it a bit. It's now field isLikelyOOM of the subject of notification ipc:content-shutdown.

BrowserTestUtils.crashFrame now accepts additional options, with an argument crashType that may
take "CRASH_OOM" or "CRASH_INVALID_POINTER_DEREF"|null to specify the nature of the crash. The names
are taken from CrashTestUtils.jsm but this module cannot be imported as such as it has non-trivial
binary dependencies.

Depends on D54130

Assignee: nobody → dteller
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/421b8d600806
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/cac468582924
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/d2ed39839f83
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/9b97128e83d8
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto

Backed out 4 changesets (bug 1589493) for chrome failures at dom/ipc/tests/test_process_error_oom.xul

Bakout: https://hg.mozilla.org/integration/autoland/rev/ab306a7d27cd4fe35f985d089a0bd35e18faa26c

Failure push: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9b97128e83d8f0b2a1e586b9de280e57a55a549d

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278639151&repo=autoland&lineNumber=3670

task 2019-11-28T15:34:54.902Z] 15:34:54 INFO - TEST-START | dom/ipc/tests/test_process_error_oom.xul
[task 2019-11-28T15:34:54.909Z] 15:34:54 INFO - GECKO(520) | ++DOMWINDOW == 43 (188E0800) [pid = 4000] [serial = 43] [outer = 275135E0]
[task 2019-11-28T15:34:54.950Z] 15:34:54 INFO - GECKO(520) | ++DOCSHELL 19C90000 == 11 [pid = 4000] [id = {d6ca31f0-ed4a-49ca-a2d9-50f70f0d8a9c}]
[task 2019-11-28T15:34:54.950Z] 15:34:54 INFO - GECKO(520) | ++DOMWINDOW == 44 (24E6D700) [pid = 4000] [serial = 44] [outer = 00000000]
[task 2019-11-28T15:34:54.950Z] 15:34:54 INFO - GECKO(520) | ++DOMWINDOW == 45 (19C91C00) [pid = 4000] [serial = 45] [outer = 24E6D700]
[task 2019-11-28T15:34:54.965Z] 15:34:54 INFO - GECKO(520) | Chrome file doesn't exist: Z:\task_1574954351\build\tests\mochitest\chrome\dom\ipc\tests\process_error.xul
[task 2019-11-28T15:34:55.020Z] 15:34:55 INFO - GECKO(520) | ++DOMWINDOW == 46 (1E55F400) [pid = 4000] [serial = 46] [outer = 24E6D700]
[task 2019-11-28T15:34:55.075Z] 15:34:55 INFO - GECKO(520) | [Parent 4000, Main Thread] WARNING: '!tsi', file z:/build/build/src/dom/base/Document.cpp, line 1426
[task 2019-11-28T15:34:55.075Z] 15:34:55 INFO - GECKO(520) | [Parent 4000, Main Thread] WARNING: Not same origin error!: file z:/build/build/src/dom/base/nsJSEnvironment.cpp, line 523
[task 2019-11-28T15:34:55.075Z] 15:34:55 INFO - GECKO(520) | JavaScript error: chrome://browser/content/aboutNetError.js, line 270: ReferenceError: RPMGetFormatURLPref is not defined
[task 2019-11-28T15:34:55.730Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 45 (2735FC00) [pid = 4000] [serial = 38] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.730Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 44 (27580000) [pid = 4000] [serial = 37] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.730Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 43 (1EA77800) [pid = 4000] [serial = 16] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.730Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 42 (1EA76000) [pid = 4000] [serial = 15] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.730Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 41 (1EA78C00) [pid = 4000] [serial = 17] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.731Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 40 (20805400) [pid = 4000] [serial = 28] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.731Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 39 (2001F000) [pid = 4000] [serial = 26] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.731Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 38 (1E2E1800) [pid = 4000] [serial = 9] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.731Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 37 (1E2DF000) [pid = 4000] [serial = 7] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.738Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 36 (1EA7A000) [pid = 4000] [serial = 18] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:55.738Z] 15:34:55 INFO - GECKO(520) | --DOMWINDOW == 35 (24CC1C00) [pid = 4000] [serial = 32] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:34:56.113Z] 15:34:56 INFO - GECKO(520) | --DOMWINDOW == 34 (0BFD03A0) [pid = 4000] [serial = 1] [outer = 00000000] [url = chrome://gfxsanity/content/sanityparent.html]
[task 2019-11-28T15:34:56.113Z] 15:34:56 INFO - GECKO(520) | --DOMWINDOW == 33 (0BFD0940) [pid = 4000] [serial = 8] [outer = 00000000] [url = chrome://gfxsanity/content/sanitytest.html]
[task 2019-11-28T15:34:56.113Z] 15:34:56 INFO - GECKO(520) | --DOMWINDOW == 32 (1EA9A5E0) [pid = 4000] [serial = 14] [outer = 00000000] [url = moz-extension://75c27674-9b83-422f-bafc-a70ea4a39ceb/_generated_background_page.html]
[task 2019-11-28T15:35:00.052Z] 15:35:00 INFO - GECKO(520) | --DOMWINDOW == 31 (1E2E2C00) [pid = 4000] [serial = 29] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:35:00.052Z] 15:35:00 INFO - GECKO(520) | --DOMWINDOW == 30 (1F0F2400) [pid = 4000] [serial = 24] [outer = 00000000] [url = moz-extension://75c27674-9b83-422f-bafc-a70ea4a39ceb/_generated_background_page.html]
[task 2019-11-28T15:35:00.052Z] 15:35:00 INFO - GECKO(520) | --DOMWINDOW == 29 (19C92400) [pid = 4000] [serial = 2] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:35:00.052Z] 15:35:00 INFO - GECKO(520) | --DOMWINDOW == 28 (1ED85800) [pid = 4000] [serial = 20] [outer = 00000000] [url = chrome://gfxsanity/content/sanitytest.html]
[task 2019-11-28T15:35:04.332Z] 15:35:04 INFO - GECKO(520) | --DOMWINDOW == 27 (19C91C00) [pid = 4000] [serial = 45] [outer = 00000000] [url = about:blank]
[task 2019-11-28T15:35:04.332Z] 15:35:04 INFO - GECKO(520) | --DOMWINDOW == 26 (188D9C00) [pid = 4000] [serial = 42] [outer = 00000000] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2019-11-28T15:36:49.336Z] 15:36:49 INFO - GECKO(520) | [Parent 4000, Jump List] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/widget/windows/WinUtils.cpp, line 1346
[task 2019-11-28T15:38:49.328Z] 15:38:49 INFO - GECKO(520) | [Parent 4000, Jump List] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/widget/windows/WinUtils.cpp, line 1346
[task 2019-11-28T15:40:21.211Z] 15:40:21 INFO - TEST-INFO | started process screenshot
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - TEST-INFO | screenshot: exit 0
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - TEST-UNEXPECTED-FAIL | dom/ipc/tests/test_process_error_oom.xul | Test timed out.
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - SimpleTest.ok@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:277:18
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - reportError@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:121:22
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:142:18
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handlerTestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handler
TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handlerTestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handler
TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handlerTestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.278Z] 15:40:21 INFO - setTimeout handler
TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.279Z] 15:40:21 INFO - setTimeout handlerTestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15
[task 2019-11-28T15:40:21.279Z] 15:40:21 INFO - setTimeout handler
TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:170:15

Flags: needinfo?(dteller)

We've been unlucky, bug 1595908 converted the .xul tests under dom/ipc to .xhtml before we landed this so the new test (which is still .xul) started failing once landed.

Ah, I was wondering what went wrong. I'll try and update this today.

Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/04adc4c18424
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/94aa25f22d44
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/0e09d02e484a
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/fb609feb845a
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5fa805b486fb
Backed out 4 changesets for linting failure at builds/worker/checkouts/gecko/dom/ipc/tests/test_process_error_oom.xhtml:12:11. On a CLOSED TREE

Not sure the Geckoview test is related. As far as I can tell, my code is entirely dead in Geckoview (unless we stumble upon a OOM, but that's another problem).

The failure seems to show up only under Windows, so I'll probably wait until I have a Windows machine to reproduce it. It was ordered 2 weeks ago, so it should arrive eventually :)

Flags: needinfo?(dteller)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #36)

Not sure the Geckoview test is related. As far as I can tell, my code is entirely dead in Geckoview (unless we stumble upon a OOM, but that's another problem).

The failure seems to show up only under Windows, so I'll probably wait until I have a Windows machine to reproduce it. It was ordered 2 weeks ago, so it should arrive eventually :)

Can you provide a status update? I opened a feature request a few days ago with a similar goal (https://bugzilla.mozilla.org/show_bug.cgi?id=1611631). We use Firefox as part of a digital signage solution and are hit with "Gah. Your tab just crashed.", due to the OOM killing the content process (probably due to some leaky javascript).

At the moment there seems to be noway we can detect a crashed content process, so some sort of builtin solution would be very useful.

Tentatively tracking for Fission Nightly (M6)

Fission Milestone: --- → M6

(In reply to Kristian Klausen from comment #37)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #36)

Not sure the Geckoview test is related. As far as I can tell, my code is entirely dead in Geckoview (unless we stumble upon a OOM, but that's another problem).

The failure seems to show up only under Windows, so I'll probably wait until I have a Windows machine to reproduce it. It was ordered 2 weeks ago, so it should arrive eventually :)

Can you provide a status update? I opened a feature request a few days ago with a similar goal (https://bugzilla.mozilla.org/show_bug.cgi?id=1611631). We use Firefox as part of a digital signage solution and are hit with "Gah. Your tab just crashed.", due to the OOM killing the content process (probably due to some leaky javascript).

At the moment there seems to be noway we can detect a crashed content process, so some sort of builtin solution would be very useful.

The machine finally arrived. I'll try and get on it next week.

Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c1bc7695e720
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/3f7c15d29416
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/900ec6b447c9
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/9dbe0bdd321b
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e8e7bc5c8a00
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/23601d10e69d
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/275b66a8c77a
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/59fc685edca2
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto
Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0ab64c7765b
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/4c242e542545
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/3621ad792a38
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/46183b72cf37
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto

Backed out 4 changesets (Bug 1589493) for causing mochitest failure at dom/ipc/tests/test_process_error_oom.xhtml

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=290735880&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=46183b72cf37d3851633da983f74d37f7199d8d6

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=290735880&repo=autoland&lineNumber=2646

Backout link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=90d3b18b906e1dcb973d6d021e7a6d58430238a1

[task 2020-02-27T10:27:42.315Z] 10:27:42     INFO - TEST-INFO | screentopng: exit 0
[task 2020-02-27T10:27:42.316Z] 10:27:42     INFO - Buffered messages logged at 10:22:43
[task 2020-02-27T10:27:42.317Z] 10:27:42     INFO - TEST-PASS | dom/ipc/tests/test_process_error.xhtml | Expected the right browsing context id on the oop-browser-crashed event. 
[task 2020-02-27T10:27:42.318Z] 10:27:42     INFO - TEST-PASS | dom/ipc/tests/test_process_error.xhtml | Received correct observer topic. 
[task 2020-02-27T10:27:42.319Z] 10:27:42     INFO - TEST-PASS | dom/ipc/tests/test_process_error.xhtml | Subject implements nsIPropertyBag2. 
[task 2020-02-27T10:27:42.320Z] 10:27:42     INFO - TEST-PASS | dom/ipc/tests/test_process_error.xhtml | dumpID is present and not an empty string 
[task 2020-02-27T10:27:42.320Z] 10:27:42     INFO - Buffered messages finished
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - TEST-UNEXPECTED-FAIL | dom/ipc/tests/test_process_error.xhtml | Test timed out. 
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - SimpleTest.ok@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:299:16
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - reportError@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:128:22
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:150:18
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.321Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.322Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.322Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.322Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.322Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.322Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - TestRunner.runTests/<@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:420:16
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - promise callback*TestRunner.runTests@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:407:48
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - RunSet.runtests@chrome://mochikit/content/tests/SimpleTest/setup.js:218:14
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - RunSet.runall@chrome://mochikit/content/tests/SimpleTest/setup.js:197:12
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - hookupTests@chrome://mochikit/content/tests/SimpleTest/setup.js:294:12
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:50:13
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:61:28
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:57:3
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - hookup@chrome://mochikit/content/tests/SimpleTest/setup.js:270:20
[task 2020-02-27T10:27:42.323Z] 10:27:42     INFO - linkAndHookup@chrome://mochikit/content/harness.xhtml:45:3
[task 2020-02-27T10:27:42.324Z] 10:27:42     INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:50:13
[task 2020-02-27T10:27:42.324Z] 10:27:42     INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:61:28
[task 2020-02-27T10:27:42.324Z] 10:27:42     INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:57:3
[task 2020-02-27T10:27:42.324Z] 10:27:42     INFO - getTestList@chrome://mochikit/content/chrome-harness.js:258:18
[task 2020-02-27T10:27:42.325Z] 10:27:42     INFO - loadTests@chrome://mochikit/content/harness.xhtml:24:14
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/87c463f53ea9
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/9f357dded30f
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/aed44db455c4
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/5b1b813bbdab
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto
Flags: needinfo?(dteller)

Backed out 4 changesets (Bug 1589493) for causing mochitest failures at dom/ipc/tests/test_process_error.xhtml

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=290764426&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=5b1b813bbdab0d4e9f2f52e3c4441c77d41f6a16

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=290766439&repo=autoland&lineNumber=2643

Backout link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=b72613b5bd7cd334a8fa06bd33debfdfc4346614

[task 2020-02-27T14:07:24.814Z] 14:07:24     INFO - TEST-PASS | dom/ipc/tests/test_process_error.xhtml | dumpID is present and not an empty string 
[task 2020-02-27T14:07:24.815Z] 14:07:24     INFO - Buffered messages finished
[task 2020-02-27T14:07:24.815Z] 14:07:24     INFO - TEST-UNEXPECTED-FAIL | dom/ipc/tests/test_process_error.xhtml | Test timed out. 
[task 2020-02-27T14:07:24.815Z] 14:07:24     INFO - SimpleTest.ok@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:299:16
[task 2020-02-27T14:07:24.815Z] 14:07:24     INFO - reportError@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:128:22
[task 2020-02-27T14:07:24.815Z] 14:07:24     INFO - TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:150:18
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.816Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:184:15
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - TestRunner.runTests/<@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:420:16
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - promise callback*TestRunner.runTests@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:407:48
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - RunSet.runtests@chrome://mochikit/content/tests/SimpleTest/setup.js:218:14
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - RunSet.runall@chrome://mochikit/content/tests/SimpleTest/setup.js:197:12
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - hookupTests@chrome://mochikit/content/tests/SimpleTest/setup.js:294:12
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:50:13
[task 2020-02-27T14:07:24.817Z] 14:07:24     INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:61:28
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:57:3
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - hookup@chrome://mochikit/content/tests/SimpleTest/setup.js:270:20
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - linkAndHookup@chrome://mochikit/content/harness.xhtml:45:3
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:50:13
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:61:28
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:57:3
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - getTestList@chrome://mochikit/content/chrome-harness.js:258:18
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - loadTests@chrome://mochikit/content/harness.xhtml:24:14
[task 2020-02-27T14:07:24.818Z] 14:07:24     INFO - EventListener.handleEvent*@chrome://mochikit/content/harness.xhtml:48:12
[task 2020-02-27T14:07:25.677Z] 14:07:25     INFO - GECKO(2732) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2020-02-27T14:07:25.677Z] 14:07:25     INFO - GECKO(2732) | MEMORY STAT | vsize 3028MB | residentFast 343MB | heapAllocated 116MB
Flags: needinfo?(dteller)
Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/716e6cd5245f
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/8e5458038d38
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/4c3d736d0259
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/6ab5b9391f95
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto

Mmmmh.... can't reproduce the issue either locally or on try.

Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/47e745b389d0
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/416b95911b6d
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/2c23fed6b2ff
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley
https://hg.mozilla.org/integration/autoland/rev/d914e968de2c
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto

Ok, I finally know why the tests don't always pass.

In some configurations, moz_xmalloc isn't a public symbol. I'll try and find a way around this.

Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/73e3711e7849
Expose CrashReporterHost::isLikelyOOM();r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/3e2d218c4f0d
Expose isLikelyOOM to Content crash handlers;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/5afbdf2538dc
Extending BrowserTestUtils.crashFrame to allow crashing with an OOM;r=mconley,froydnj,dmajor
https://hg.mozilla.org/integration/autoland/rev/6a351aef2167
Testing ipc:content-shutdown's support for isLikelyOOM;r=gsvelto
See Also: → 1648953
Blocks: 1648953
See Also: 1648953
Attachment #9110531 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: