Closed Bug 1519129 Opened 6 years ago Closed 6 years ago

[@ OOM | small] browser crash spike hitting 32bit installations on beta channel since 2019-01-10

Categories

(Core :: General, defect, P3)

65 Branch
x86
Windows
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr60 --- unaffected
firefox64 --- unaffected
firefox65 --- wontfix
firefox66 + wontfix

People

(Reporter: philipp, Unassigned)

Details

(Keywords: crash, regression, Whiteboard: [MemShrink] )

Crash Data

This bug was filed from the Socorro interface and is
report bp-f60bc4b9-4083-4d38-89e9-246e80190110.

Top 10 frames of crashing thread:

0 xul.dll xul.dll@0x14716bc
1 xul.dll xul.dll@0x5d19f0
2 xul.dll xul.dll@0x5d003d
3 xul.dll xul.dll@0x5cf7a8
4 xul.dll xul.dll@0x3ae973
5 xul.dll xul.dll@0x5b54a4
6 xul.dll xul.dll@0x17a5d3e
7 xul.dll xul.dll@0x1523a59
8 xul.dll xul.dll@0x1522f67
9 xul.dll xul.dll@0x3c52caa

=============================================================

there's an increase in OOM | small crashes in the browser process from 32bit installations of firefox on the beta channel since 2019-01-08.

the spike is made up of reports that contain unsymbolized frames of xul.dll in their stack trace - other than that i didn't spot any obvious other correlations (happens across multiple versions of windows; no particular addons, locales or urls show up in the reports) - this query should catch most of the problem:
https://bit.ly/2Fn6029

as far as i can tell, it's only the beta channel that's affected by this spike - it started a few days after 65.0b8 got released to the beta audience but that version is also heavily affected still so i'd start from the assumption it was some sort of external change on affected systems causing the increase in crashes.

Summary: Crash in OOM | small → [@ OOM | small] browser crash spike hitting 32bit installations on beta channel since 2019-01-10

It looks like about a five-fold increase which is potentially concerning - thanks for flagging, Philipp! Do we know of anything we turned on in that build that could lead to an increase in OOMs from small allocations?

Flags: needinfo?(erahm)

This is one of those things that's hard to track down. The crashing point for OOM small is usually the fallout of something else making large allocations. That said we should probably loop a soccoro person in to figure out why we're getting trashed stacks.

I took a look at the memory report from bp-245fdb4-073b-47ca-841d-0d9e40190110, it had a ton of heap-unclassifed and was on it's way to VM exhaustion:

2,604.36 MB (100.0%) -- explicit
├──2,452.58 MB (94.17%) ── heap-unclassified

4,095.94 MB (100.0%) -- address-space
├──3,493.80 MB (85.30%) -- commit
│  ├──2,638.82 MB (64.43%) -- private
│  │  ├──2,633.57 MB (64.30%) ── readwrite(segments=3624)
│  │  └──────5.25 MB (00.13%) -- (5 tiny)
│  │         ├──3.39 MB (00.08%) ── readwrite+stack(segments=144)
│  │         ├──1.47 MB (00.04%) ── readwrite+guard(segments=144)
│  │         ├──0.38 MB (00.01%) ── execute-read(segments=5)
│  │         ├──0.01 MB (00.00%) ── readonly(segments=2)
│  │         └──0.00 MB (00.00%) ── noaccess(segments=1)
│  ├────629.44 MB (15.37%) -- mapped
│  │    ├──598.86 MB (14.62%) ── readonly(segments=1577)
│  │    └───30.58 MB (00.75%) -- (2 tiny)
│  │        ├──30.31 MB (00.74%) ── readwrite(segments=123)
│  │        └───0.27 MB (00.01%) ── writecopy(segments=1)
│  └────225.55 MB (05.51%) -- image
│       ├──161.08 MB (03.93%) ── execute-read(segments=208)
│       ├───61.42 MB (01.50%) ── readonly(segments=440)
│       └────3.05 MB (00.07%) -- (2 tiny)
│            ├──1.80 MB (00.04%) ── readwrite(segments=225)
│            └──1.25 MB (00.03%) ── writecopy(segments=57)
├────354.11 MB (08.65%) -- reserved
│    ├──325.72 MB (07.95%) ── private(segments=1143)
│    └───28.39 MB (00.69%) ++ (2 tiny)
└────248.03 MB (06.06%) ── free(segments=1899)

2,696.71 MB ── resident-unique
3,847.91 MB ── vsize
   99.44 MB ── vsize-max-contiguous

Also I saw an RDD process in there, it looks like that was uplifted in b5 though so maybe not related.

RyanVM, is there some way we can get a list of uplifted bugs for 65.0b8 (particularly for windows)?

Flags: needinfo?(erahm) → needinfo?(ryanvm)

Lots of potentially interesting patches in there. New expat?

Adding ni for Peter to look into the possibility of expat upgrade (bug 1374012) being a cause.

Flags: needinfo?(peterv)

Eric, would you mind taking a look at the list of pushes in comment 3 and see if anything sticks out?

A crazy idea tossed out there was to run a Shield study with 2 different versions of expat to see if the cause here was indeed expat. I know, it's a stretch for the new expat version to even be the cause. Of course, I can't see how we'd get that done super soon so that leads us to the bigger question of whether we want to hold 65 to figure this out.

Interestingly, Philipp said the overall number of crashes didn't seem to change, just that the proportion of them that are OOM|small crashes increased. I'm not sure what that means :)

Flags: needinfo?(erahm)

The expat upgrade fix (bug 1374012) went into beta on Dec 31 and https://calendar.google.com/calendar/embed?src=bW96aWxsYS5jb21fZGJxODRhbnI5aTh0Y25taGFiYXRzdHY1Y29AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ&pli=1 shows Beta 8 build on Jan 3 but the spike in OOM | small crashes have been reported to increase after beta 9 release on Jan 8 so it doesn't look like expat upgrade can be a cause of this. We should probably be looking at the patches that are new in beta 9.

Flags: needinfo?(peterv)

when the crashes are broken down into versions, b8 was already somewhat affected:
https://screenshots.firefox.com/BXBhm9g6OvlEMdLp/crash-stats.mozilla.com

Neha's analysis looks correct, the spike is in b9. Lets look at the patches in that:

  • Bug 1516426 - this is a null check, probably fine
  • Bug 1510204, bug 1516289 - this looks okay, but it's large enough we might want emilio to confirm
  • Bug 1515793 is WR related, so probably not it
  • Bug 1464003 is Android
  • Bug 1515658 seems okay, although it's assigning a uint64_t into an int64_t. I'm pretty sure this is okay in this case and worst case scenario we just end up recalculating the size which is the old behavior. Maybe ping baku to see if CalculateStreamLength has side-effects? It's marked const so in theory it shouldn't.
  • Bug 1515463 looks okay
  • Bug 1517275 is android
  • Bug 1517221 is fixing a stack overflow by releasing memory earlier, so probably not but maybe ping olli to see if this could cause CC related weirdness
  • Bug 1513304 is a null pointer check, previously we would just crash but now we bail. It's possible that exposed a leak but given the prevalence of crashes (2 per day maybe) it doesn't seem to correlate
  • Bug 1517710 seems okay, in theory it was linux only (?). It did remove taking a strong ref to mGraph which seems it could be an issue by probably not related to leaks. maybe ping :pehrsons (he had a few changes in this list)
  • Bug 1513973 this is another :pehrsons change, there are enough patches that it might be worth looking at
  • Bug 1516738 this is just a codegen change
  • Bug 1515873 :pehrsons again, but if anything it looks like it's avoiding allocating
  • Bug 1514192 this is VR, probably not related
  • Bug 1513232 this is clamping viewport sizes, probably not related

:pehrsons your patches show up the most in this beta release, can you double check to see if it's possible they'd cause an uptick in OOMs?

Olli, is there any chance bug 1517221 might be involved?

Flags: needinfo?(erahm)
Flags: needinfo?(bugs)
Flags: needinfo?(apehrson)

I don't expect any of my fixes to cause any large allocations.

That said, bug 1517711 is a followup to bug 1517710 that fixed a remaining nullptr issue but only landed in b10. If we're back to normal in b10, bug 1517711 could be the explanation.

Both those bugs fixed regressions from bug 1513973 which is in the b9 list in comment 9.

Flags: needinfo?(apehrson)

Unlikely bug 1517221 to be involved. That issue has been seen only on one intranet site basically, and the bug fixed it. We end up releasing memory sooner with the fix.

Flags: needinfo?(bugs)
Priority: -- → P3

the particular crash pattern described in comment #0 has receded again. since there isn't much tangible information to go on for a potential fix, i'll just close the bug...

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.