1584266 - Investigate increase in content crash OOM spike since 2019-09-24

[:philipp]

Reporter

Description

•

6 years ago

•

Edited

This bug is for crash report bp-04a06fb1-bd02-4fd7-858a-d327f0190923.

Top 10 frames of crashing thread:

0 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1480
1 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1493
2 xul.dll js::TenuringTracer::traverse<JSObject> js/src/gc/Marking.cpp:2764
3 xul.dll class mozilla::Maybe<JS::Value> js::MapGCThingTyped<`lambda at z:/task_1568726031/build/src/js/src/gc/Marking.cpp:2780:43'> js/public/Value.h:1311
4 xul.dll js::gc::StoreBuffer::SlotsEdge::trace js/src/gc/Marking.cpp:2850
5 xul.dll js::gc::StoreBuffer::MonoTypeBuffer<js::gc::StoreBuffer::SlotsEdge>::trace js/src/gc/Marking.cpp:2799
6 xul.dll js::Nursery::doCollection js/src/gc/Nursery.cpp:932
7 xul.dll js::Nursery::collect js/src/gc/Nursery.cpp:839
8 xul.dll js::gc::GCRuntime::minorGC js/src/gc/GC.cpp:8085
9 xul.dll js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7649

these out-of-memory tab crash signatures in firefox spiked up across channels (release, beta) on september 24. there are no clear correlations, but looking at it in more detail it looks like the increase can mostly be tied to reports containing facebook.com as the crashing url.

this table is a breakdown of how crash reports containing an url have developed in the past couple of days:

date	facebook	non-facebook	% facebook
2019-09-18	36	46	43.9
2019-09-19	33	50	39.8
2019-09-20	42	61	40.8
2019-09-22	31	45	40.8
2019-09-23	39	51	43.3
2019-09-24	106	58	64.6
2019-09-25	141	59	70.5
2019-09-26 (not full day)	84	32	72.4

maybe we can reach out to contacts at facebook to inquire if something changed in that particular time-frame that may have affected memory usage?

[:philipp]

Reporter

Comment 1

•

6 years ago

[Tracking Requested - why for this release]:

status-firefox69: --- → affected

status-firefox70: --- → affected

tracking-firefox69: --- → ?

tracking-firefox70: --- → ?

Jan Varga [:janv]

Updated

•

6 years ago

Priority: -- → P1

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox69: affected → wontfix

tracking-firefox69: ? → -

tracking-firefox70: ? → +

Marcia Knous [:marcia]

Comment 2

•

6 years ago

Both of the signatures in this bug show well over 70% of the crashes are users running Windows 7. I reached out to the Facebook list today to get an answer to Philipp's question.

Liz Henry (:lizzard)

Comment 3

•

6 years ago

Peter is following up in a meeting with FB today.

Hsin-Yi Tsai (she/her)[:hsinyi]

Updated

•

6 years ago

Comment 4

•

6 years ago

Not a blocker for 70 release since this is bad crash volume in 69 as well. I'll continue to track during the 70 release to see if the volume changes.

Liz Henry (:lizzard)

Comment 5

•

6 years ago

Emailed mozilla-fb-discuss a second time and am also reaching out through other folks.

Liz Henry (:lizzard)

Comment 6

•

6 years ago

Facebook folks are looking (Hi Vladan!)

Liz Henry (:lizzard)

Comment 7

•

6 years ago

Andrew, Nika, jonco, any luck investigating this crash spike that hit us in the 69 timeframe? It is very high volume in 69 and is likely going to be about the same in 70 release.

Flags: needinfo?(overholt)

Flags: needinfo?(nika)

Flags: needinfo?(jcoppeard)

Andrew Overholt [:overholt]

Comment 8

•

6 years ago

I also asked FB if they had data on Large-Allocation usage and if that's failing I guess it could be related to this OOM spike (Nika?).

Flags: needinfo?(overholt)

Jon Coppeard (:jonco)

Comment 9

•

6 years ago

•

Edited

There are two things here: one in a general increase in this kind of OOM that happened in 69, and the other is the recent spike.

The original increase seems to have occurred in beta release 69.0b3 (I can't tell when it hit nightly).

The recent spike is more concerning though. As noted it affects both beta and release, appearing in 69.0.1 and 70.0b8.

Pushes for these releases are:

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=2771f6fe489942ec0091773e98ff00f0409f876e&tochange=45c0e8a9df93f545bfacf07e5a78bd69559c6adf

https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=cce4622026ab8e0130a0afc03f829f9b19ca38c2&tochange=bf6ea738ba073f1a70554799a749235136afc93a

I can't see any changes common to both versions.

GC telemetry for beta shows a significant increase in nursery size from the 19th September:

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=Median!Mean!5th%2520percentile!25th%2520percentile!75th%2520percentile!95th%2520percentile&cumulative=0&end_date=2019-10-10&include_spill=0&keys=!__none__!__none__&max_channel_version=beta%252F70&measure=GC_NURSERY_BYTES_2&min_channel_version=beta%252F70&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2019-09-02&trim=1&use_submission_date=0

This indicates that nursery allocated things are living longer in general and doesn't necessarily indicate that this is a GC issue. There are no GC changes (or any JS engine changes I could see) in the above pushlogs.

One thing that was slightly suspicious in beta is bug 1575216 since it concerns low memory detection and (I think) is Windows only. But this is not present on release yet so couldn't be causing problems there.

Flags: needinfo?(jcoppeard)

Nika Layzell [:nika] (ni? for response)

Comment 10

•

6 years ago

(In reply to Andrew Overholt [:overholt] from comment #8)

I also asked FB if they had data on Large-Allocation usage and if that's failing I guess it could be related to this OOM spike (Nika?).

Might be related? If we're mostly running into this issue on 32-bit windows, then I could see this being related, as we don't run Large-Allocation anything outside of 32-bit Windows.

Have we also been seeing a spike in crashes on non-win32 platforms?

Flags: needinfo?(nika)

[:philipp]

Reporter

Comment 11

•

6 years ago

no, there was no change in the crash pattern of 64bit firefox versions around 2019-09-24 for those 2 signatures.

Liz Henry (:lizzard)

Updated

•

6 years ago

status-firefox71: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

6 years ago

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

Alphan Chen [:alchen]

Comment 13

•

6 years ago

•

Edited

I meet OOM on 32-bit Windows by one facebook apps both on Firefox 59/69.
Bingo Blitz: https://apps.facebook.com/108854979142742

Here is the crash report of Firefox 59.
https://crash-stats.mozilla.org/report/index/9179992c-26ac-4c8d-bfa3-8609b0191016
Here is another OOM on Firefox 69.
https://crash-stats.mozilla.org/report/index/3b59f7d0-c5ae-426c-aaee-cb4fb0191016

I also see "uncaught exception: out of memory" in webConsole when running the app.
However, I don't see "Large-Allocation response header" in httplog or webDeveloper tool.

Andrew Overholt [:overholt]

Comment 14

•

6 years ago

Thanks, Alphan. Is Bingo Blitz a wasm or asm.js game?

Flags: needinfo?(alchen)

Alphan Chen [:alchen]

Comment 15

•

6 years ago

(In reply to Andrew Overholt [:overholt] from comment #14)

Thanks, Alphan. Is Bingo Blitz a wasm or asm.js game?

Actually, I don't know.
I found this game from the test suites for LargeAllocation header in testrail.

FB games
https://testrail.stage.mozaws.net/index.php?/cases/view/26033
other web games
https://testrail.stage.mozaws.net/index.php?/cases/view/26104

Flags: needinfo?(alchen)

Alphan Chen [:alchen]

Comment 16

•

6 years ago

•

Edited

After setting pref "dom.largeAllocation.testing.allHttpLoads" as true, I saw the following log from webConsole when running Bingo Blitz.

(indexCanvas)
A Large-Allocation header was ignored due to the presence of windows which have a reference to this browsing context through the frame hierarchy or window.opener.

I will update the test again with "dom.largeAllocation.testing.allHttpLoads = true".

Chris Hartjes [:grumpy][:chartjes]

Comment 17

•

6 years ago

The Ecosystem QA folks were asked to help out. I am using a October Win10 32 bit VM in VMWare Fusion on my Mac. This link to go and find a custom build of Firefox to try out gives me a 404

https://archive.mozilla.org/pub/firefox/try-builds/michael@thelayzells.com-cd987df02260607e8ecefcefd9ad997a510cf218/try-win32/

Using a recent build of Firefox 69.0.3 (32-bit) I visited the following web sites:

http://www.flashgames247.com/
http://www.miniclip.com/games/8-ball-pool-multiplayer/en/#t-w-c-H
http://ro.y8.com/games/football_legends_2016

I did not see any errors related to Large Allocation headers on any of those sites.

Andrew Overholt [:overholt]

Comment 18

•

6 years ago

I think the Large-Allocation thing was a red herring :/. Thank you to Alphan and Chris for trying to get a regression range.

Given the increase in facebook.com URLs in comment 0, I'm thinking this might be a change on Facebook's side. I looked at awsy tp6 data [1] and there doesn't appear to be a corresponding increase around September 24-26.

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

[1]
https://mzl.la/2MoYvLd

Flags: needinfo?(rwood)

Andrew Overholt [:overholt]

Comment 19

•

6 years ago

Attached image aboutmemory.PNG — Details

Liz, can we ask someone who's submitted a recent report here and given their email address for an anonymized about:memory report?

Flags: needinfo?(lhenry)

Liz Henry (:lizzard)

Comment 20

•

6 years ago

Pascal, or Marcia, can I pass that on to you while I get 70 out the door? ty!!!

Flags: needinfo?(pascalc)

Flags: needinfo?(mozillamarcia.knous)

Flags: needinfo?(lhenry)

Andrew Overholt [:overholt]

Comment 21

•

6 years ago

It would have to be after memory usage has increased but before a crash.

Andrew McCreight [:mccr8]

Comment 22

•

6 years ago

Crash reports can contain anonymized memory reports.

Andrew McCreight [:mccr8]

Comment 23

•

6 years ago

So the first thing to do would be to check for those. I don't remember where to get them, unfortunately.

[:philipp]

Reporter

Comment 24

•

6 years ago

they're in the raw dump section of a crash report in case you have access to those on crash-stats. i will send over a bunch of them to overholt.

Flags: needinfo?(pascalc)

Flags: needinfo?(mozillamarcia.knous)

Robert Wood [:rwood]

Comment 25

•

6 years ago

(In reply to Andrew Overholt [:overholt] from comment #18)

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

Hey Andrew, yes the recorded facebook site that is played back during the tp6 page-load test is quite basic, and I don't believe it has been updated recently. Updating the recording and replacing what is in production is pretty easy. However if you are wanting a 2nd facebook page recording in production along with the first one, so that you can run both and compare old vs new, that would require creating a new test and corresponding taskcluster configs, along with updating the recording itself. Maybe it could just be hacked together for a try push. I'm assuming this is for Firefox desktop (and not android)? We use separate recordings on android. Anyhow :bebe looks after the recordings so adding him to the conversation here, thanks!

Flags: needinfo?(rwood) → needinfo?(fstrugariu)

Andrew Overholt [:overholt]

Comment 26

•

6 years ago

(In reply to Robert Wood [:rwood] from comment #25)

(In reply to Andrew Overholt [:overholt] from comment #18)

Rob, how hard is it to get a new snapshot of Facebook (for tp6) so we can compare new and old browser builds against new and old Facebook snapshots?

Hey Andrew, yes the recorded facebook site that is played back during the tp6 page-load test is quite basic, and I don't believe it has been updated recently. Updating the recording and replacing what is in production is pretty easy.

I'm interested in being able to do comparisons between the snapshots - can that be done locally? If so, that'd work and avoid having to create a new taskcluster config.

Andrew McCreight [:mccr8]

Comment 27

•

6 years ago

I skimmed over a handful of the memory reports. Nothing really sticks out to me. Honestly, these are some of the smallest memory reports I've ever seen. Of course, I would guess that these are fairly memory constrained systems, so they probably never get that high. We only periodically collect memory reports, so perhaps the memory usage suddenly spikes up and causes the crash (like if somebody opened a Facebook game) so the memory reports don't contain any relevant issue. I'll look over the reports again in more depth tomorrow.

Florin Strugariu [:Bebe]

Comment 28

•

6 years ago

:overholt what website do you need me to record? I can make a new recording and a try build as you require.

Flags: needinfo?(fstrugariu) → needinfo?(overholt)

Alphan Chen [:alchen]

Comment 29

•

6 years ago

•

Edited

(In reply to Alphan Chen [:alchen] from comment #16)

I will update the test again with "dom.largeAllocation.testing.allHttpLoads = true".

There is no solid STR to reproduce the OOM.
However, by using latest 32-bit Windows FF(71) it is easy to meet OOM by running Bingo Blitz with "dom.largeAllocation.testing.allHttpLoads = true".

I also saw other error messages in webconsole.

Error: WebGL warning: texImage2D: Failed to allocate dest buffer.
Error: WebGL warning: texImage2D: Driver ran out of memory during upload. program.min.js:10166:67
Error: WebGL warning: bufferData: Error from driver: 0x0505
uncaught exception: out of memory

Andrew Overholt [:overholt]

Comment 30

•

6 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #28)

:overholt what website do you need me to record? I can make a new recording and a try build as you require.

Whatever we use for the Facebook tp5 recording. Thanks!

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 31

•

6 years ago

(ni for comment 30)

Flags: needinfo?(fstrugariu)

Andrew Overholt [:overholt]

Comment 32

•

6 years ago

(In reply to Jon Coppeard (:jonco) from comment #9)

The recent spike is more concerning though. As noted it affects both beta and release, appearing in 69.0.1 and 70.0b8.
[...]
GC telemetry for beta shows a significant increase in nursery size from the 19th September:

Nika mentioned to me that a bunch of the crashes seem to have "OOM allocation size" of 1048576 - does this ring any bells, Jon?

Flags: needinfo?(jcoppeard)

Andrew McCreight [:mccr8]

Comment 33

•

6 years ago

(In reply to Andrew Overholt [:overholt] from comment #32)

Nika mentioned to me that a bunch of the crashes seem to have "OOM allocation size" of 1048576 - does this ring any bells, Jon?

The JS GC grabs memory in 1MB chunks, so that's the expected behavior. I'd imagine that this just happens to be the largest allocation we regularly make, and so that's what triggers the failure if we run out of address space.

Andrew Overholt [:overholt]

Comment 34

•

6 years ago

Thanks, mccr8.

Another theory that's been floated here is that Large-Allocation isn't working so we're doing the asm.js allocation in the content process and it's succeeding but then subsequent small allocations cause the OOM crash. Maybe Large-Allocation is failing? Maybe the way it's being used by Facebook changed (see the window.opener message that Alphan reported in comment 16)?

Andrew McCreight, I'm all ears for your theories here :)

Flags: needinfo?(jcoppeard)

Hsin-Yi Tsai (she/her)[:hsinyi]

Updated

•

6 years ago

Keywords: regressionwindow-wanted, steps-wanted

Florin Strugariu [:Bebe]

Comment 35

•

6 years ago

Attached image Screenshot_2019-11-04 Google.png — Details

sorry for the late response.

I took a look over the fb page and it an old google page that has no login
The current page for that link would be: https://www.facebook.com/Google

I will record this page and submit it as a test.

Should we be logged in or not when recording?

Flags: needinfo?(fstrugariu) → needinfo?(overholt)

Cristian Baica [:cbaica], Release Desktop QA

Comment 36

•

6 years ago

Is a bisection required from manual QA, as there doesn't seem to be a way to consistently reproduce this without flipping the "dom.largeAllocation.testing.allHttpLoads" pref?

Flags: needinfo?(htsai)

Hsin-Yi Tsai (she/her)[:hsinyi]

Comment 37

•

6 years ago

Thanks to the new info added by bug 1590034, I was looking into the crash reports on 72.0a1 , all the crash reasons are "[unhandlable oom] Failed not allocate new chunk during GC." Does that ring any bell? :jonco?

(In reply to Cristian Baica [:cbaica], Release Desktop QA from comment #36)

Is a bisection required from manual QA, as there doesn't seem to be a way to consistently reproduce this without flipping the "dom.largeAllocation.testing.allHttpLoads" pref?

Indeed, there's no reliable step to reproduce yet...

Flags: needinfo?(htsai) → needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Comment 38

•

6 years ago

(In reply to Hsin-Yi Tsai [:hsinyi] from comment #37)
The crash is happening because we're running out of memory when collecting the nursery. It seems more long lived JS objects end up being allocated since 24th September. What we don't know is why this started happening.

Flags: needinfo?(jcoppeard)

Andrew Overholt [:overholt]

Comment 39

•

6 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #35)

Created attachment 9106115 [details]
Screenshot_2019-11-04 Google.png

sorry for the late response.

I took a look over the fb page and it an old google page that has no login
The current page for that link would be: https://www.facebook.com/Google

I will record this page and submit it as a test.

Should we be logged in or not when recording?

Since we're not sure what's going on, let's try logged-out for now and see if we can pinpoint a date when the change occurred.

Thanks!

Flags: needinfo?(overholt)

Florin Strugariu [:Bebe]

Comment 40

•

6 years ago

Attached file Bug 1584266 - Investigate increase in content crash OOM spike since 2019-09-24 (obsolete) — Details

Florin Strugariu [:Bebe]

Comment 41

•

6 years ago

Created the recording and a patch to have it in treeherder:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fcaa823050720de918c96c9104d871e372c3f816

Liz Henry (:lizzard)

Updated

•

6 years ago

status-firefox70: affected → wontfix

status-firefox72: --- → affected

Hsin-Yi Tsai (she/her)[:hsinyi]

Updated

•

6 years ago

Comment 42

•

6 years ago

Hi Vicky, would you please have someone from your team to check and compare the tp recording results and see if we can pinpoint a date when the change occurred that cause the OOM spike? See comment 26 and comment 39 and comment 41. Thank you.

Flags: needinfo?(vchin)

Vicky Chin [:vchin]

Comment 43

•

6 years ago

The recordings won't be able to pinpoint a date as it's just a snapshot in time, all we have is the recording of the previous page and a recording of the new page. If it is indeed a change in Facebook that caused the OOM spike, then the date the change likely occurred will correspond to the spike in OOM date.

Flags: needinfo?(vchin)

Bogdan Maris, Desktop Test Engineering

Updated

•

6 years ago

QA Whiteboard: [qa-regression-triage]

Julien Cristau [:jcristau]

Comment 44

•

6 years ago

Do we have next steps for this?

Flags: needinfo?(htsai)

Pascal Chevrel:pascalc

Comment 45

•

6 years ago

It doesn't seem actionable for 71 as this is unassigned and we are half way to 72, marking as wontfix for the release channel.

status-firefox71: affected → wontfix

status-firefox73: --- → affected

status-firefox-esr68: --- → affected

Hsin-Yi Tsai (she/her)[:hsinyi]

Comment 46

•

6 years ago

•

Edited

(In reply to Julien Cristau [:jcristau] from comment #44)

Do we have next steps for this?

I'm afraid that I haven't had more thoughts. Edgar has been helping take another look at the rrports. Hopefully he can get back with fresh ideas the next days. I have also been hoping that after bug 1586236 lands, we will get more information. The bug is in progress, but NI Gabriele to see if he thinks about something to help we move this bug on early.

Flags: needinfo?(htsai) → needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 47

•

6 years ago

Bug 1586236 has been stalled on issues with the devtools tests on 32-bit builds. Unfortunately I wasn't able to make any more progress there because of that. I'm hopeful that we'll be able to make more work on top of bug 1589493 to hide the crashes from the user as much as possible.

Flags: needinfo?(gsvelto)

Marcia Knous [:marcia]

Comment 48

•

6 years ago

Won't fixing this for 72 to get it off the radar for now. Keeping 73 as affected as work continues on this.

status-firefox72: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Comment 49

•

6 years ago

It looks like the massive spike we saw in late September went away in late December (kinda correlates with the timing of bug 1604655).

status-firefox73: affected → fix-optional

status-firefox74: --- → affected

status-firefox-esr68: affected → wontfix

Phabricator Automation

Updated

•

6 years ago

Attachment #9106489 - Attachment is obsolete: true

Jens Stutte [:jstutte]

Comment 50

•

6 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #49)

It looks like the massive spike we saw in late September went away in late December (kinda correlates with the timing of bug 1604655).

:mccr8, I would assume, we can lower the priority here now?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 51

•

6 years ago

Sure.

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(continuation)

Resolution: --- → DUPLICATE

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox73: fix-optional → fixed

status-firefox74: affected → fixed

aboutmemory.PNG 6 years ago Andrew Overholt [:overholt] 14.78 KB, image/png		Details
Screenshot_2019-11-04 Google.png 6 years ago Florin Strugariu [:Bebe] 658.78 KB, image/png		Details
Bug 1584266 - Investigate increase in content crash OOM spike since 2019-09-24 6 years ago Florin Strugariu [:Bebe] 47 bytes, text/x-phabricator-request		Details \| Review