Open Bug 1584266 Opened 3 months ago Updated 18 hours ago

Investigate increase in content crash OOM spike since 2019-09-24

Categories

(Core :: General, defect, P1, critical)

69 Branch
All
Windows
defect

Tracking

()

Tracking Status
firefox69 - wontfix
firefox70 + wontfix
firefox71 --- affected
firefox72 --- affected

People

(Reporter: philipp, Unassigned, NeedInfo)

References

Details

(4 keywords)

Crash Data

Attachments

(3 files)

This bug is for crash report bp-04a06fb1-bd02-4fd7-858a-d327f0190923.

Top 10 frames of crashing thread:

0 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1480
1 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1493
2 xul.dll js::TenuringTracer::traverse<JSObject> js/src/gc/Marking.cpp:2764
3 xul.dll class mozilla::Maybe<JS::Value> js::MapGCThingTyped<`lambda at z:/task_1568726031/build/src/js/src/gc/Marking.cpp:2780:43'> js/public/Value.h:1311
4 xul.dll js::gc::StoreBuffer::SlotsEdge::trace js/src/gc/Marking.cpp:2850
5 xul.dll js::gc::StoreBuffer::MonoTypeBuffer<js::gc::StoreBuffer::SlotsEdge>::trace js/src/gc/Marking.cpp:2799
6 xul.dll js::Nursery::doCollection js/src/gc/Nursery.cpp:932
7 xul.dll js::Nursery::collect js/src/gc/Nursery.cpp:839
8 xul.dll js::gc::GCRuntime::minorGC js/src/gc/GC.cpp:8085
9 xul.dll js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7649

these out-of-memory tab crash signatures in firefox spiked up across channels (release, beta) on september 24. there are no clear correlations, but looking at it in more detail it looks like the increase can mostly be tied to reports containing facebook.com as the crashing url.

this table is a breakdown of how crash reports containing an url have developed in the past couple of days:

date facebook non-facebook % facebook
2019-09-18 36 46 43.9
2019-09-19 33 50 39.8
2019-09-20 42 61 40.8
2019-09-22 31 45 40.8
2019-09-23 39 51 43.3
2019-09-24 106 58 64.6
2019-09-25 141 59 70.5
2019-09-26 (not full day) 84 32 72.4

maybe we can reach out to contacts at facebook to inquire if something changed in that particular time-frame that may have affected memory usage?

[Tracking Requested - why for this release]:

Priority: -- → P1

Both of the signatures in this bug show well over 70% of the crashes are users running Windows 7. I reached out to the Facebook list today to get an answer to Philipp's question.

Peter is following up in a meeting with FB today.

See Also: → 1584232

Not a blocker for 70 release since this is bad crash volume in 69 as well. I'll continue to track during the 70 release to see if the volume changes.

Emailed mozilla-fb-discuss a second time and am also reaching out through other folks.

Facebook folks are looking (Hi Vladan!)

Andrew, Nika, jonco, any luck investigating this crash spike that hit us in the 69 timeframe? It is very high volume in 69 and is likely going to be about the same in 70 release.

Flags: needinfo?(overholt)
Flags: needinfo?(nika)
Flags: needinfo?(jcoppeard)

I also asked FB if they had data on Large-Allocation usage and if that's failing I guess it could be related to this OOM spike (Nika?).

Flags: needinfo?(overholt)

There are two things here: one in a general increase in this kind of OOM that happened in 69, and the other is the recent spike.

The original increase seems to have occurred in beta release 69.0b3 (I can't tell when it hit nightly).

The recent spike is more concerning though. As noted it affects both beta and release, appearing in 69.0.1 and 70.0b8.

Pushes for these releases are:

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=2771f6fe489942ec0091773e98ff00f0409f876e&tochange=45c0e8a9df93f545bfacf07e5a78bd69559c6adf

https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=cce4622026ab8e0130a0afc03f829f9b19ca38c2&tochange=bf6ea738ba073f1a70554799a749235136afc93a

I can't see any changes common to both versions.

GC telemetry for beta shows a significant increase in nursery size from the 19th September:

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=Median!Mean!5th%2520percentile!25th%2520percentile!75th%2520percentile!95th%2520percentile&cumulative=0&end_date=2019-10-10&include_spill=0&keys=!__none__!__none__&max_channel_version=beta%252F70&measure=GC_NURSERY_BYTES_2&min_channel_version=beta%252F70&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2019-09-02&trim=1&use_submission_date=0

This indicates that nursery allocated things are living longer in general and doesn't necessarily indicate that this is a GC issue. There are no GC changes (or any JS engine changes I could see) in the above pushlogs.

One thing that was slightly suspicious in beta is bug 1575216 since it concerns low memory detection and (I think) is Windows only. But this is not present on release yet so couldn't be causing problems there.

Flags: needinfo?(jcoppeard)

(In reply to Andrew Overholt [:overholt] from comment #8)

I also asked FB if they had data on Large-Allocation usage and if that's failing I guess it could be related to this OOM spike (Nika?).

Might be related? If we're mostly running into this issue on 32-bit windows, then I could see this being related, as we don't run Large-Allocation anything outside of 32-bit Windows.

Have we also been seeing a spike in crashes on non-win32 platforms?

Flags: needinfo?(nika)

no, there was no change in the crash pattern of 64bit firefox versions around 2019-09-24 for those 2 signatures.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

I meet OOM on 32-bit Windows by one facebook apps both on Firefox 59/69.
Bingo Blitz: https://apps.facebook.com/108854979142742

Here is the crash report of Firefox 59.
https://crash-stats.mozilla.org/report/index/9179992c-26ac-4c8d-bfa3-8609b0191016
Here is another OOM on Firefox 69.
https://crash-stats.mozilla.org/report/index/3b59f7d0-c5ae-426c-aaee-cb4fb0191016

I also see "uncaught exception: out of memory" in webConsole when running the app.
However, I don't see "Large-Allocation response header" in httplog or webDeveloper tool.

Thanks, Alphan. Is Bingo Blitz a wasm or asm.js game?

Flags: needinfo?(alchen)

(In reply to Andrew Overholt [:overholt] from comment #14)

Thanks, Alphan. Is Bingo Blitz a wasm or asm.js game?

Actually, I don't know.
I found this game from the test suites for LargeAllocation header in testrail.

FB games
https://testrail.stage.mozaws.net/index.php?/cases/view/26033
other web games
https://testrail.stage.mozaws.net/index.php?/cases/view/26104

Flags: needinfo?(alchen)

After setting pref "dom.largeAllocation.testing.allHttpLoads" as true, I saw the following log from webConsole when running Bingo Blitz.

(indexCanvas)
A Large-Allocation header was ignored due to the presence of windows which have a reference to this browsing context through the frame hierarchy or window.opener.

I will update the test again with "dom.largeAllocation.testing.allHttpLoads = true".

The Ecosystem QA folks were asked to help out. I am using a October Win10 32 bit VM in VMWare Fusion on my Mac. This link to go and find a custom build of Firefox to try out gives me a 404

https://archive.mozilla.org/pub/firefox/try-builds/michael@thelayzells.com-cd987df02260607e8ecefcefd9ad997a510cf218/try-win32/

Using a recent build of Firefox 69.0.3 (32-bit) I visited the following web sites:

http://www.flashgames247.com/
http://www.miniclip.com/games/8-ball-pool-multiplayer/en/#t-w-c-H
http://ro.y8.com/games/football_legends_2016

I did not see any errors related to Large Allocation headers on any of those sites.

I think the Large-Allocation thing was a red herring :/. Thank you to Alphan and Chris for trying to get a regression range.

Given the increase in facebook.com URLs in comment 0, I'm thinking this might be a change on Facebook's side. I looked at awsy tp6 data [1] and there doesn't appear to be a corresponding increase around September 24-26.

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

[1]
https://mzl.la/2MoYvLd

Flags: needinfo?(rwood)
Attached image aboutmemory.PNG

Liz, can we ask someone who's submitted a recent report here and given their email address for an anonymized about:memory report?

Flags: needinfo?(lhenry)

Pascal, or Marcia, can I pass that on to you while I get 70 out the door? ty!!!

Flags: needinfo?(pascalc)
Flags: needinfo?(mozillamarcia.knous)
Flags: needinfo?(lhenry)

It would have to be after memory usage has increased but before a crash.

Crash reports can contain anonymized memory reports.

So the first thing to do would be to check for those. I don't remember where to get them, unfortunately.

they're in the raw dump section of a crash report in case you have access to those on crash-stats. i will send over a bunch of them to overholt.

Flags: needinfo?(pascalc)
Flags: needinfo?(mozillamarcia.knous)

(In reply to Andrew Overholt [:overholt] from comment #18)

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

Hey Andrew, yes the recorded facebook site that is played back during the tp6 page-load test is quite basic, and I don't believe it has been updated recently. Updating the recording and replacing what is in production is pretty easy. However if you are wanting a 2nd facebook page recording in production along with the first one, so that you can run both and compare old vs new, that would require creating a new test and corresponding taskcluster configs, along with updating the recording itself. Maybe it could just be hacked together for a try push. I'm assuming this is for Firefox desktop (and not android)? We use separate recordings on android. Anyhow :bebe looks after the recordings so adding him to the conversation here, thanks!

Flags: needinfo?(rwood) → needinfo?(fstrugariu)

(In reply to Robert Wood [:rwood] from comment #25)

(In reply to Andrew Overholt [:overholt] from comment #18)

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

Hey Andrew, yes the recorded facebook site that is played back during the tp6 page-load test is quite basic, and I don't believe it has been updated recently. Updating the recording and replacing what is in production is pretty easy.

I'm interested in being able to do comparisons between the snapshots - can that be done locally? If so, that'd work and avoid having to create a new taskcluster config.

I skimmed over a handful of the memory reports. Nothing really sticks out to me. Honestly, these are some of the smallest memory reports I've ever seen. Of course, I would guess that these are fairly memory constrained systems, so they probably never get that high. We only periodically collect memory reports, so perhaps the memory usage suddenly spikes up and causes the crash (like if somebody opened a Facebook game) so the memory reports don't contain any relevant issue. I'll look over the reports again in more depth tomorrow.

:overholt what website do you need me to record? I can make a new recording and a try build as you require.

Flags: needinfo?(fstrugariu) → needinfo?(overholt)

(In reply to Alphan Chen [:alchen] from comment #16)

I will update the test again with "dom.largeAllocation.testing.allHttpLoads = true".

There is no solid STR to reproduce the OOM.
However, by using latest 32-bit Windows FF(71) it is easy to meet OOM by running Bingo Blitz with "dom.largeAllocation.testing.allHttpLoads = true".

I also saw other error messages in webconsole.

  1. Error: WebGL warning: texImage2D: Failed to allocate dest buffer.
  2. Error: WebGL warning: texImage2D: Driver ran out of memory during upload. program.min.js:10166:67
    Error: WebGL warning: bufferData: Error from driver: 0x0505
  3. uncaught exception: out of memory

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #28)

:overholt what website do you need me to record? I can make a new recording and a try build as you require.

Whatever we use for the Facebook tp5 recording. Thanks!

Flags: needinfo?(overholt)

(ni for comment 30)

Flags: needinfo?(fstrugariu)

(In reply to Jon Coppeard (:jonco) from comment #9)

The recent spike is more concerning though. As noted it affects both beta and release, appearing in 69.0.1 and 70.0b8.
[...]
GC telemetry for beta shows a significant increase in nursery size from the 19th September:

Nika mentioned to me that a bunch of the crashes seem to have "OOM allocation size" of 1048576 - does this ring any bells, Jon?

Flags: needinfo?(jcoppeard)

(In reply to Andrew Overholt [:overholt] from comment #32)

Nika mentioned to me that a bunch of the crashes seem to have "OOM allocation size" of 1048576 - does this ring any bells, Jon?

The JS GC grabs memory in 1MB chunks, so that's the expected behavior. I'd imagine that this just happens to be the largest allocation we regularly make, and so that's what triggers the failure if we run out of address space.

Thanks, mccr8.

Another theory that's been floated here is that Large-Allocation isn't working so we're doing the asm.js allocation in the content process and it's succeeding but then subsequent small allocations cause the OOM crash. Maybe Large-Allocation is failing? Maybe the way it's being used by Facebook changed (see the window.opener message that Alphan reported in comment 16)?

Andrew McCreight, I'm all ears for your theories here :)

Flags: needinfo?(jcoppeard)

sorry for the late response.

I took a look over the fb page and it an old google page that has no login
The current page for that link would be: https://www.facebook.com/Google

I will record this page and submit it as a test.

Should we be logged in or not when recording?

Flags: needinfo?(fstrugariu) → needinfo?(overholt)

Is a bisection required from manual QA, as there doesn't seem to be a way to consistently reproduce this without flipping the "dom.largeAllocation.testing.allHttpLoads" pref?

Flags: needinfo?(htsai)

Thanks to the new info added by bug 1590034, I was looking into the crash reports on 72.0a1 , all the crash reasons are "[unhandlable oom] Failed not allocate new chunk during GC." Does that ring any bell? :jonco?

(In reply to Cristian Baica [:cbaica], Release Desktop QA from comment #36)

Is a bisection required from manual QA, as there doesn't seem to be a way to consistently reproduce this without flipping the "dom.largeAllocation.testing.allHttpLoads" pref?

Indeed, there's no reliable step to reproduce yet...

Flags: needinfo?(htsai) → needinfo?(jcoppeard)

(In reply to Hsin-Yi Tsai [:hsinyi] from comment #37)
The crash is happening because we're running out of memory when collecting the nursery. It seems more long lived JS objects end up being allocated since 24th September. What we don't know is why this started happening.

Flags: needinfo?(jcoppeard)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #35)

Created attachment 9106115 [details]
Screenshot_2019-11-04 Google.png

sorry for the late response.

I took a look over the fb page and it an old google page that has no login
The current page for that link would be: https://www.facebook.com/Google

I will record this page and submit it as a test.

Should we be logged in or not when recording?

Since we're not sure what's going on, let's try logged-out for now and see if we can pinpoint a date when the change occurred.

Thanks!

Flags: needinfo?(overholt)
See Also: → 1586236

Hi Vicky, would you please have someone from your team to check and compare the tp recording results and see if we can pinpoint a date when the change occurred that cause the OOM spike? See comment 26 and comment 39 and comment 41. Thank you.

Flags: needinfo?(vchin)

The recordings won't be able to pinpoint a date as it's just a snapshot in time, all we have is the recording of the previous page and a recording of the new page. If it is indeed a change in Facebook that caused the OOM spike, then the date the change likely occurred will correspond to the spike in OOM date.

Flags: needinfo?(vchin)
QA Whiteboard: [qa-regression-triage]

Do we have next steps for this?

Flags: needinfo?(htsai)
You need to log in before you can comment on or make changes to this bug.