Closed Bug 1081790 Opened 10 years ago Closed 10 years ago

Large spike in fake OOM|small (via PushNewDT) in 20141011030203

Categories

(Core :: Graphics, defect)

35 Branch
All
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox34 --- unaffected
firefox35 --- affected

People

(Reporter: away, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-57d7bf8b-3e8d-4622-9e60-d24282141011.
=============================================================

OOM|small rates by build on the nightly channel:

10 	20141008030202 	13 	0.47 %
8 	20141008065430 	24 	0.86 %
7 	20141009030201 	41 	1.48 %
6 	20141010030201 	42 	1.51 %
1 	20141011030203 	1748 	62.95 %
2 	20141012030203 	729 	26.25 %
5c6980f9caff	Nicolas Silva — Bug 1064107 - Ensure that gfxPlatform is initialized by the time we create the compositor. r=Bas
367b155c5b5e	Benoit Jacob — Bug 1080137 - WebGL2: misc fixes to make new tex formats and sized internalformats actually work - r=jgilbert
0a69bc9e746c	Bas Schouten — Bug 1078693: Correctly indicate validity of a SourceSurfaceD2D1 and deal with failed surface creation. r=jrmuizel
216915390f9b	jdashg — Bug 1066280 - Handle dirtying in BasicCanvasLayer. - r=mattwoodrow

I don't think we should let 35 go to Aurora like this. nical/bas, is there an obvious culprit here?

Logs: Failed to create similar cairo surface! Size: Size(35,15) Status: 1
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bas)
My guess would be that it comes from bug 1066280 which has a lot of changes. Although it looks like that bug affects basic layers specifically and only 5 out of 15 crashes are using basic compositing so it doesn't explain everything.

I am certain that Bug 1064107 can't have caused this.

I would be surprised that bug 1078693 be the cause, because I would expect "Failed to create software bitmap" or "Failed to readback into software bitmap" to appear in the app notes if it was the case.
Flags: needinfo?(nical.bugzilla)
Blocks: 1066280
Status: 1 is CAIRO_STATUS_NO_MEMORY, which makes this seem like a real OOM. Has there been a spike in other OOM crash sites too?
No, there hasn't been such a spike in other OOM signatures, and the OOM|small signature is essentially all PushNewDT. 

Here's the 'app notes' facet (it's strange to split by word, but you get the idea):
8 	size 	6381 	98.56 %
9 	create 	6341 	97.95 %
10 	surface 	6335 	97.85 %
11 	status 	6335 	97.85 %
12 	cairo 	6335 	97.85 %
13 	1 	6335 	97.85 %
14 	failed 	6329 	97.76 %
15 	similar 	6323 	97.67 %

Available virtual/physical/page memory numbers aren't low, so I don't think this is a 'regular' OOM. Could that error code also indicate lack of video ram?
Is this crash spike specific to windows users that don't have direct2d?

There's a few other reaons for a CAIRO_STATUS_NO_MEMORY that could come from cairo-win32-surface.c, but none of them really stand one.

One is a GDI error (which should have been printed to stderr), not sure how Jeff's patches could have triggered GDI errors when drawing content (not webgl) though.

We try allocate in vmem (GDI DDB), but if that fails we fail back to system memory (DIB) so that shouldn't be the issue here.
(In reply to Matt Woodrow (:mattwoodrow) from comment #6)
> Is this crash spike specific to windows users that don't have direct2d?

Yes, these are all Windows and >90% are "D3D11 Layers-".
I guess it's possible that we're exhausting GDI handles or similar, but I don't see any changes that would cause us to leak that sort of thing.
(In reply to Nicolas Silva [:nical] from comment #3)
> My guess would be that it comes from bug 1066280 which has a lot of changes.
> Although it looks like that bug affects basic layers specifically and only 5
> out of 15 crashes are using basic compositing so it doesn't explain
> everything.
> 
> I am certain that Bug 1064107 can't have caused this.
> 
> I would be surprised that bug 1078693 be the cause, because I would expect
> "Failed to create software bitmap" or "Failed to readback into software
> bitmap" to appear in the app notes if it was the case.

So is the appnote "Logs: Failed to create similar cairo surface! Size: Size(35,15) Status: 1" unrelated to this?

I would really like to avoid backing the ShSurf changes out. It seems like it should be simpler to try backing bug 1078693 out, and that bug does touch moz2d failure cases.

I could see this being caused by something in the ShSurf patches, but I really don't touch content, and content is what's failing.
Flags: needinfo?(nical.bugzilla)
(In reply to Jeff Gilbert [:jgilbert] from comment #9)

> So is the appnote "Logs: Failed to create similar cairo surface! Size:
> Size(35,15) Status: 1" unrelated to this?

This is the place where we fail to allocate a surface, log the error in the app notes and return null, just before crashing in the caller (gfxContext::PushClip or friend). So this appnote is a symptom of failure to allocate memory, but not a cause.

> 
> I would really like to avoid backing the ShSurf changes out. It seems like
> it should be simpler to try backing bug 1078693 out, and that bug does touch
> moz2d failure cases.

We can try, Bas what do you think?

> 
> I could see this being caused by something in the ShSurf patches, but I
> really don't touch content, and content is what's failing.

Allocating memory is what's failing (Status: 1 means the error was CAIRO_STATUS_NO_MEMORY). Statistically this happens a lot in gfxContext::PushClip because that's where we tend to do a lot of allocations and, more importantly, often large ones. But anything in the browser could be eating up memory before we end up crashing in here.
Flags: needinfo?(nical.bugzilla)
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> (In reply to Nicolas Silva [:nical] from comment #3)
> > My guess would be that it comes from bug 1066280 which has a lot of changes.
> > Although it looks like that bug affects basic layers specifically and only 5
> > out of 15 crashes are using basic compositing so it doesn't explain
> > everything.
> > 
> > I am certain that Bug 1064107 can't have caused this.
> > 
> > I would be surprised that bug 1078693 be the cause, because I would expect
> > "Failed to create software bitmap" or "Failed to readback into software
> > bitmap" to appear in the app notes if it was the case.
> 
> So is the appnote "Logs: Failed to create similar cairo surface! Size:
> Size(35,15) Status: 1" unrelated to this?
> 
> I would really like to avoid backing the ShSurf changes out. It seems like
> it should be simpler to try backing bug 1078693 out, and that bug does touch
> moz2d failure cases.
> 
> I could see this being caused by something in the ShSurf patches, but I
> really don't touch content, and content is what's failing.

Error 1 is an actual OOM error in cairo I belief. So I'd look for the cause of this in something that can actually cause runaway memory usage. Backing out 1078693 could be done as an experiment, but it seems unlikely to me that it would be the cause of the problem. Never say never though.
Flags: needinfo?(bas)
I backed out bug 1066280 from Aurora. It's still on Trunk while I nail down some known regressions.
See Also: → 1084696
This might be bug 1084696.
> This might be bug 1084696.
That fix hasn't reached a nightly yet, but the OOMs are back to normal volume beginning with build 20141017030201. Perhaps it was fixed by bug 1081363.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.