1261483 - (Pretty pattern) garbage in a single tile in about:support

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Description

•

8 years ago

Attached image Screen Shot 2016-04-01 at 16.48.44 .png — Details

Nightly, 10.11.2, retina, default content/canvas (both skia.)

Opened about:support, and the top middle tile had garbage in it.  Pretty pattern though.
Once I scrolled, it turned into a black tile, and remained that way until I switched tabs and back.
Recall that opening about:support switches to discrete GPU.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 1

•

8 years ago

Reproduced again, this time I got different tiles, that contained "old content" from what I could tell - maybe tabs and images from other pages.  Making security sensitive because of the content leak, to be a bit paranoid.

Group: gfx-core-security

Flags: needinfo?(mchang)

Flags: needinfo?(lsalzman)

Flags: needinfo?(jmuizelaar)

Summary: Garbage in a tile in about:support → (Pretty pattern) garbage in a single tile in about:support

Whiteboard: [gfx-noted]

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 2

•

8 years ago

The second time it happened, it seemed to go away to black on its own, after some amount of time (1-2s perhaps), but remained black as in the first case, until the tab change.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 4

•

8 years ago

Talked to Jeff about this.  We sort of need a reproducible case.  On the other hand, I managed to also crash once with the "go to about:support" workflow: https://crash-stats.mozilla.com/report/index/84987bfa-ad7f-417f-bc80-07ec62160404

Not a lot of crashes in the wild, but going back as far as 44 (perhaps earlier, I'm only looking at the last four weeks.)  Note that following the crash above will only give us 10.11.2 crashes, probably because of the driver signature.

Side note - this seems to occur when switching from integrated to discrete graphics; I haven't seen it while plugged into the external monitor which forces you to stay on discrete.

Flags: needinfo?(mchang)

Flags: needinfo?(lsalzman)

Flags: needinfo?(jmuizelaar)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 5

•

8 years ago

Also, note https://bugzilla.mozilla.org/show_bug.cgi?id=1254206#c10 suggesting this may not be Firefox.

Mason Chang [Inactive] [:mchang]

Comment 6

•

8 years ago

> Side note - this seems to occur when switching from integrated to discrete
> graphics; I haven't seen it while plugged into the external monitor which
> forces you to stay on discrete.

This seems to mimic my report as well, - https://bugzilla.mozilla.org/show_bug.cgi?id=1254206#c5

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

8 years ago

Comment 7

•

8 years ago

As per our discussion in the critsmash meeting, I've been trying to get valgrind working under OSX to see if it would be able to catch/report anything while reproducing this issue which happens often on my MPB. However, it looks like valgrind doesn't really play well with newer versions of OSX so it's definitely been a process [1]. Julian from the performance team will take a look and see he can find the reason why valgrind keeps crashing with fx.

[1] https://pastebin.mozilla.org/8867428

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Keywords: sec-moderate

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 8

•

8 years ago

I've been seeing similar issues on Fennec Nightly, where I open a link in a new tab and it initially renders with improperly positioned tiles from the previous tab. No real STR unfortunately, I'll keep an eye on it.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 9

•

8 years ago

Also seeing it in Fennec Beta (46.0b1)

Jan de Mooij [:jandem]

Comment 10

•

8 years ago

(In reply to Kamil Jozwiak [:kjozwiak] from comment #7)
> As per our discussion in the critsmash meeting, I've been trying to get
> valgrind working under OSX to see if it would be able to catch/report
> anything while reproducing this issue which happens often on my MPB.

Have you considered an ASan build? It should be possible to do an ASan Firefox build on OS X.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 11

•

8 years ago

46 and 47 are affected as well. Wontfix for 46, it is too late in the cycle.

status-firefox46: --- → wontfix

status-firefox47: --- → affected

Jan de Mooij [:jandem]

Comment 12

•

8 years ago

(In reply to Milan Sreckovic [:milan] from comment #4)
> We sort of need a reproducible case.

I can reproduce crashes quite reliably on my MacBook Pro with the following STR:

(1) Start Firefox 45.0.2 with a clean profile.

(2) Install the Tab Auto Reload add-on, this one: https://addons.mozilla.org/en-US/firefox/addon/tab-auto-reload/ and restart Firefox.

(3) Confirm OS X is using integrated graphics (the gfxCardStatus app is useful for this - it shows 'i' or 'd' in the OS X status bar)

(4) Go to about:support, right click on the tab -> click "User defined" -> "5 seconds". Now OS X switches between integrated <-> discrete graphics every 5-10 seconds or so.

(5) Open YouTube or other websites in new tabs and scroll/click etc. It often crashes even if you just have Firefox open in the background though, and often another Firefox instance (like Nightly) will crash instead :)

It may take a few tries. If it doesn't reproduce for anyone else, I'd be happy to test patches or do some IRC debugging with someone.

https://crash-stats.mozilla.com/report/index/ff0018df-7f55-469f-8c45-3c6e42160420
https://crash-stats.mozilla.com/report/index/cf287d59-91c5-4f4c-b81b-c66742160420
https://crash-stats.mozilla.com/report/index/2c858b40-0726-41bc-9d06-ec6982160420
https://crash-stats.mozilla.com/report/index/f8f6db65-c98d-454b-83da-5dda02160420

Jan de Mooij [:jandem]

Comment 13

•

8 years ago

Hm I just looked at these crash reports and noticed I had Firefox.app set to open in 32-bit mode, but that doesn't make a difference. 64-bit also crashed within a few minutes:

https://crash-stats.mozilla.com/report/index/5cdaaf3e-83bd-4bfa-9630-218262160420

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 14

•

8 years ago

Markus, I have a feeling this is widget-y issue,; either we somehow get to the wrong context or hold on to things we shouldn't be holding on it, or something like that, if not a bug in the drivers.  Thoughts?  Looking at the five crashes from comment 12 and comment 13, it's either widgets or texture upload.

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Comment 15

•

8 years ago

It's not clear to me what we might be doing wrong, or whether we even are doing anything wrong. The Apple docs about GPU switching are here:
https://developer.apple.com/library/mac/qa/qa1734/_index.html#//apple_ref/doc/uid/DTS40010791
https://developer.apple.com/library/mac/technotes/tn2229/_index.html#//apple_ref/doc/uid/DTS40008924
https://developer.apple.com/library/mac/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_contexts/opengl_contexts.html#//apple_ref/doc/uid/TP40001987-CH216-SW5

We don't need to throw away the context. It looks like there are only three rules we need to follow when a GPU switch happens:
 - Call -[NSOpenGLContext update]
 - Re-render the "scene" because our front buffer contents are now garbage
 - Re-query GL abilities and don't rely on abilities that only the old GPU had.

The last one might be the one that we're not doing correctly. Are the crashes happening when we switch from the integrated to the discrete GPU, or when we switch back from the discrete to the integrated one?

Flags: needinfo?(mstange)

Jan de Mooij [:jandem]

Comment 16

•

8 years ago

(In reply to Markus Stange [:mstange] from comment #15)
> Are the crashes happening when we switch from the integrated to the discrete GPU, or
> when we switch back from the discrete to the integrated one?

I think I just got one when switching from discrete to integrated:

https://crash-stats.mozilla.com/report/index/3f784ccc-8c43-43cf-8139-c308b2160423

Note that this one has AppleIntelHD5000GraphicsGLDriver on the stack, and some other crashes have GeForceGLDriver on the stack. So I wouldn't be surprised if it crashed when switching in either direction.

Jan de Mooij [:jandem]

Comment 17

•

8 years ago

Note that sometimes it crashes with a different signature: CoreFoundation@0xe9aa instead of libsystem_kernel.dylib@0x16f06

https://crash-stats.mozilla.com/report/index/e3320d02-70eb-47f1-8907-148132160423

The libsystem_kernel.dylib crash is a topcrash on Mac, and I've a hunch this might be responsible for some weird and impossible to repro memory corruption crashes we're seeing in JS, mostly on OS X (that's the #1 crash on Mac). So it'd be great to have this fixed. I'd be happy to test (diagnostic) patches.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 18

•

8 years ago

(In reply to Markus Stange [:mstange] from comment #15)
> 
> The last one might be the one that we're not doing correctly. Are the
> crashes happening when we switch from the integrated to the discrete GPU, or
> when we switch back from the discrete to the integrated one?

I'd say discrete to integrate for me as well.  about:support gets you into discrete (from integrated), but after a couple of seconds bounces you back, and that's when bad stuff happens.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

8 years ago

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

8 years ago

Assignee: nobody → mchang

Flags: needinfo?(milan)

Mason Chang [Inactive] [:mchang]

Updated

•

8 years ago

Updated

•

8 years ago

Comment 19

•

8 years ago

What's the status of this?

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 20

•

8 years ago

I haven't seen it recently.  I don't think we ever got close to figuring out what was going on.

Markus Stange [:mstange]

Comment 21

•

8 years ago

(In reply to Markus Stange [:mstange] from comment #15)
> We don't need to throw away the context. It looks like there are only three
> rules we need to follow when a GPU switch happens:
>  - Call -[NSOpenGLContext update]

I thought we were already doing this on GPU switches, but it turns out we're not. We only call it on window resizes.
I'll write a patch.

Matt Woodrow (:mattwoodrow)

Assignee

Comment 22

•

8 years ago

All of these crashes (plus the ones I looked at from my local list) happen when the main thread is within OpenGL code called from -[ChildView updateGLContext], and the compositor thread is within other OpenGL code (usually texture upload or swap buffer, not surprising given that these are longer running operations).

OpenGL isn't threadsafe, so this isn't particularly surprising.

It looks like updateGLContext uses CGLLockContext/CGLUnlockContext in an attempt to synchronise things, but we don't actually use this anywhere else so it won't have any effect.

I think we need to make sure every time we access OpenGL from the compositor it is within the same locking.

Mason Chang [Inactive] [:mchang]

Updated

•

8 years ago

Assignee: mchang → nobody

Matt Woodrow (:mattwoodrow)

Assignee

Comment 23

•

8 years ago

Steven Michaud managed to symbolicate one of these crashes in bug 1176404 comment 2.

I managed to reproduce the texture upload crash locally (using the STR from comment 12) and got this stack:

thread #43: tid = 0x9fe4ca, 0x00007fff84a650ae libsystem_kernel.dylib`__pthread_kill + 10, name = 'Compositor'
    frame #0: 0x00007fff84a650ae libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff86921500 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff8b39237b libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff86540e5c libGPUSupportMercury.dylib`gpusGenerateCrashLog + 158
  * frame #4: 0x00001234402211a7 GeForceGLDriver`___lldb_unnamed_function6443$$GeForceGLDriver + 9
    frame #5: 0x00007fff86542204 libGPUSupportMercury.dylib`gpusSubmitDataBuffers + 162
    frame #6: 0x00001234403151ac GeForceGLDriver`___lldb_unnamed_function11158$$GeForceGLDriver + 335
    frame #7: 0x00001234402e593e GeForceGLDriver`___lldb_unnamed_function10581$$GeForceGLDriver + 263
    frame #8: 0x00001234405121be GeForceGLDriver`___lldb_unnamed_function17702$$GeForceGLDriver + 239
    frame #9: 0x00001234405123e1 GeForceGLDriver`___lldb_unnamed_function17703$$GeForceGLDriver + 349
    frame #10: 0x0000123440512ac5 GeForceGLDriver`___lldb_unnamed_function17705$$GeForceGLDriver + 31
    frame #11: 0x00001234402ca901 GeForceGLDriver`___lldb_unnamed_function10221$$GeForceGLDriver + 695
    frame #12: 0x00001234402cb28a GeForceGLDriver`___lldb_unnamed_function10226$$GeForceGLDriver + 837
    frame #13: 0x0000123440336515 GeForceGLDriver`___lldb_unnamed_function11366$$GeForceGLDriver + 6010
    frame #14: 0x00007fff8695c70c GLEngine`glTexSubImage2D_Exec + 1723
    frame #15: 0x00007fff876b359e libGL.dylib`glTexSubImage2D + 77

I can't find a way to get symbols for the unnamed_function entries, but it looks like it might be the same.

I also found this in my machine's logs:

2/06/16 1:37:00.000 PM kernel[0]: NVDA: Error in allocation list processing (malformed alloc list)
2/06/16 1:37:00.000 PM kernel[0]: NVDA: Calling gpusKillClient for task <ptr>

 
> It looks like updateGLContext uses CGLLockContext/CGLUnlockContext in an
> attempt to synchronise things, but we don't actually use this anywhere else
> so it won't have any effect.

This wasn't quite true. preRender/postRender in ChildView.mm also lock, so we should be holding the lock while compositing.

Most of the crashes occur while uploading textures, which is a time when the lock isn't held. Unfortunately Jan's second crash report (from comment 12) has us crashing in SwapBuffers. The main thread is blocked in CGLLockContext for this one, so we're not actually racing on anything useful.

It's possible that we have two bugs here, and making sure we don't race on updateGLContext will fix the upload crash, but not the SwapBuffers one. If they are the same bug, then it won't help.

I'm going to have a go at writing a patch that will fix the race, and make us call update before SwapBuffers (if necessary).

Matt Woodrow (:mattwoodrow)

Assignee

Comment 24

•

8 years ago

Attached patch child-view-race — Details — Splinter Review

I can't get it to crash with this patch, but it was pretty hard to do to begin with.

It's almost certain that this won't fix the crash in SwapBuffers, but I think we should give it a shot anyway and see if the volume drops.

Attachment #8759005 - Flags: review?(mstange)

Markus Stange [:mstange]

Comment 25

•

8 years ago

Comment on attachment 8759005 [details] [diff] [review]
child-view-race

Review of attachment 8759005 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/cocoa/nsChildView.h
@@ +204,5 @@
>    // last received event so that, when we receive one of the events, we make sure
>    // to send its pair event first, in case we didn't yet for any reason.
>    BOOL mExpectingWheelStop;
>  
> +  BOOL mNeedsGLUpdate;

Please add a comment.

::: widget/cocoa/nsChildView.mm
@@ +3422,4 @@
>  
>    if (!mGLContext) {
>      [self setGLContext:aGLContext];
>      [self updateGLContext];

This if calls updateGLContext outside the lock, and the one you're adding calls it inside the lock. Can we combine them? setGLContext is only called in this one place, so we might as well get rid of it as a separate method.
It seems like CGLLockContext is re-entrant, otherwise I'd expect a deadlock. But let's not rely on that.

@@ +3668,2 @@
>      [mGLContext setView:self];
>      [mGLContext update];

Please add a comment to updateGLContext saying that it should only be called on the compositor thread. Oh, except there is the force update stuff... should we try removing that again?

Attachment #8759005 - Flags: review?(mstange) → review+

Matt Woodrow (:mattwoodrow)

Assignee

Comment 26

•

8 years ago

(In reply to Markus Stange [:mstange] from comment #25)

> Please add a comment to updateGLContext saying that it should only be called
> on the compositor thread. Oh, except there is the force update stuff...
> should we try removing that again?

Probably, yes. Hopefully that'll be gone now that we only support 10.7. We should do that separately since we'll probably want to uplift this if it works.

Matt Woodrow (:mattwoodrow)

Assignee

Comment 27

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/151f1d8697dc

Keywords: leave-open

Carsten Book [:Tomcat]

Comment 28

•

8 years ago

https://hg.mozilla.org/mozilla-central/rev/151f1d8697dc

Target Milestone: --- → mozilla49

Ryan VanderMeulen [:RyanVM]

Comment 29

•

8 years ago

Anything left to do here?

Assignee: nobody → matt.woodrow

status-firefox47: affected → wontfix

status-firefox48: affected → wontfix

Flags: needinfo?(matt.woodrow)

Target Milestone: mozilla49 → ---

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

7 years ago

Priority: -- → P3

Matt Woodrow (:mattwoodrow)

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(matt.woodrow)

Resolution: --- → FIXED

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Group: gfx-core-security → core-security-release

Markus Stange [:mstange]

Comment 30

•

7 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #26)
> (In reply to Markus Stange [:mstange] from comment #25)
> 
> > Please add a comment to updateGLContext saying that it should only be called
> > on the compositor thread. Oh, except there is the force update stuff...
> > should we try removing that again?
> 
> Probably, yes. Hopefully that'll be gone now that we only support 10.7. We
> should do that separately since we'll probably want to uplift this if it
> works.

This has now been done in bug 1394654.

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Group: core-security-release

Screen Shot 2016-04-01 at 16.48.44 .png 8 years ago Milan Sreckovic [:milan] (needinfo for best results) 1.69 MB, image/png		Details
child-view-race 8 years ago Matt Woodrow (:mattwoodrow) 2.94 KB, patch	mstange : review+	Details \| Diff \| Splinter Review