Closed Bug 1786451 Opened 3 years ago Closed 4 months ago

Apply the delay crashing on OOM idea to the JS GC.

Categories

(Core :: JavaScript: GC, enhancement, P2)

x86_64
Windows
enhancement

Tracking

()

RESOLVED FIXED
134 Branch
Tracking Status
firefox134 --- fixed

People

(Reporter: pbone, Assigned: jonco)

References

(Blocks 2 open bugs)

Details

Attachments

(4 files)

In Bug 1716727 gsvelto and rkraesig implemented a delay before crashing from OOM to give windows a chance to find more virtual memory. This seems to be working and since there are VIrtualAlloc calls in JSGC we should probably do the same there.

Alternatively some time ago glandium suggested that we consolidate our calls to VirtualAlloc, especially where we cache memory blocks since JSGC and jemalloc have some similar requirements for these blocks.

More discussion on this topic is happening in bug 1784033 where we proposed to hook VirtualAlloc() (but it might not be a good idea).

See Also: → 1784033
Blocks: GC
Severity: -- → N/A
Priority: -- → P2

(In reply to Paul Bone [:pbone] [PTO until 8/8/22] from comment #0)
Looks like this is very effective at moving crashes from the parent process to content process, but doesn't necessarily reduce the number of overall crashes.

Nevertheless, content process crashes are a better user experience so we should consider doing this.

Assignee: nobody → jcoppeard
Blocks: 1911537

(In reply to Jon Coppeard (:jonco) from comment #2)

Looks like this is very effective at moving crashes from the parent process to content process, but doesn't necessarily reduce the number of overall crashes.

It's worth noting that this has turned out not to be true: it measurably does reduce the number of overall crashes. (Physical memory is effectively a zero-sum game, but delaying often gives Windows a chance to add more virtual memory.)

(In reply to Ray Kraesig [:rkraesig] from comment #3)

It's worth noting that this has turned out not to be true: it measurably does reduce the number of overall crashes.

That's even better then!

This will allow this function to be used by the JS engine too.

We always allocate pages with read/write protection so we can remove this unused parameter.

This uses MozVirtualAlloc in MapAlignedPagesLastDitch which is called when
other allocation methods fail.

Normally we trigger a GC on chunk allocation failure (either last ditch major
GC or minor GC when trying to extend the nursery). We can't recover from
allocation failure during GC however so this adds the stall and retry behaviour
there.

Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ba7e6787c58a Part 1: Export MozVirtualAlloc as part of the jemalloc API r=glandium https://hg.mozilla.org/integration/autoland/rev/bcf2f6626b82 Part 2: Remove unused protection parameter to page allocation functions r=sfink https://hg.mozilla.org/integration/autoland/rev/23494a028c18 Part 3: Add a parameter to MapAlignedPages to allow stall and retry on failure r=sfink https://hg.mozilla.org/integration/autoland/rev/6b63840889b6 Part 4: Stall and retry on failure when allocating chunks in GC r=sfink

It would be interesting to measure the impact of this, I'll make a note to compare the OOM crash rate of major releases once this hits the release channel and we have enough data.

(In reply to Gabriele Svelto [:gsvelto] from comment #11)

It would be interesting to measure the impact of this, I'll make a note to compare the OOM crash rate of major releases once this hits the release channel and we have enough data.

Apologies if I'm a bit over eager but is there enough data to see if this has helped a lot or not?

I haven't looked at telemetry in a while, this should be abundantly visible now. NI? myself so I don't forget to check as soon as I have a minute.

Flags: needinfo?(gsvelto)

Looking at legacy telemetry there's been a fairly visible decline in OOM crashes affecting child processes. The decline is not as marked for the main process but given that this should affect JS execution I think it's reasonable that we'd see most of the effect in child processes:

https://sql.telemetry.mozilla.org/queries/105100

Flags: needinfo?(gsvelto)

This per-version query makes the effect more visible:

https://sql.telemetry.mozilla.org/queries/105101

We might check back in a month but it looks like this had a measurable impact.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: