Closed Bug 1261483 Opened 6 years ago Closed 5 years ago

(Pretty pattern) garbage in a single tile in about:support

Categories

(Core :: Graphics, defect, P3)

Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox46 --- wontfix
firefox47 --- wontfix
firefox48 --- wontfix

People

(Reporter: milan, Assigned: mattwoodrow)

References

Details

(Keywords: leave-open, sec-moderate, Whiteboard: [gfx-noted])

Attachments

(2 files)

Nightly, 10.11.2, retina, default content/canvas (both skia.)

Opened about:support, and the top middle tile had garbage in it.  Pretty pattern though.
Once I scrolled, it turned into a black tile, and remained that way until I switched tabs and back.
Recall that opening about:support switches to discrete GPU.
Reproduced again, this time I got different tiles, that contained "old content" from what I could tell - maybe tabs and images from other pages.  Making security sensitive because of the content leak, to be a bit paranoid.
Group: gfx-core-security
Flags: needinfo?(mchang)
Flags: needinfo?(lsalzman)
Flags: needinfo?(jmuizelaar)
Summary: Garbage in a tile in about:support → (Pretty pattern) garbage in a single tile in about:support
Whiteboard: [gfx-noted]
The second time it happened, it seemed to go away to black on its own, after some amount of time (1-2s perhaps), but remained black as in the first case, until the tab change.
Duplicate of this bug: 1254206
Talked to Jeff about this.  We sort of need a reproducible case.  On the other hand, I managed to also crash once with the "go to about:support" workflow: https://crash-stats.mozilla.com/report/index/84987bfa-ad7f-417f-bc80-07ec62160404

Not a lot of crashes in the wild, but going back as far as 44 (perhaps earlier, I'm only looking at the last four weeks.)  Note that following the crash above will only give us 10.11.2 crashes, probably because of the driver signature.

Side note - this seems to occur when switching from integrated to discrete graphics; I haven't seen it while plugged into the external monitor which forces you to stay on discrete.
Flags: needinfo?(mchang)
Flags: needinfo?(lsalzman)
Flags: needinfo?(jmuizelaar)
> Side note - this seems to occur when switching from integrated to discrete
> graphics; I haven't seen it while plugged into the external monitor which
> forces you to stay on discrete.

This seems to mimic my report as well, - https://bugzilla.mozilla.org/show_bug.cgi?id=1254206#c5
As per our discussion in the critsmash meeting, I've been trying to get valgrind working under OSX to see if it would be able to catch/report anything while reproducing this issue which happens often on my MPB. However, it looks like valgrind doesn't really play well with newer versions of OSX so it's definitely been a process [1]. Julian from the performance team will take a look and see he can find the reason why valgrind keeps crashing with fx.

[1] https://pastebin.mozilla.org/8867428
I've been seeing similar issues on Fennec Nightly, where I open a link in a new tab and it initially renders with improperly positioned tiles from the previous tab. No real STR unfortunately, I'll keep an eye on it.
Also seeing it in Fennec Beta (46.0b1)
(In reply to Kamil Jozwiak [:kjozwiak] from comment #7)
> As per our discussion in the critsmash meeting, I've been trying to get
> valgrind working under OSX to see if it would be able to catch/report
> anything while reproducing this issue which happens often on my MPB.

Have you considered an ASan build? It should be possible to do an ASan Firefox build on OS X.
46 and 47 are affected as well. Wontfix for 46, it is too late in the cycle.
(In reply to Milan Sreckovic [:milan] from comment #4)
> We sort of need a reproducible case.

I can reproduce crashes quite reliably on my MacBook Pro with the following STR:

(1) Start Firefox 45.0.2 with a clean profile.

(2) Install the Tab Auto Reload add-on, this one: https://addons.mozilla.org/en-US/firefox/addon/tab-auto-reload/ and restart Firefox.

(3) Confirm OS X is using integrated graphics (the gfxCardStatus app is useful for this - it shows 'i' or 'd' in the OS X status bar)

(4) Go to about:support, right click on the tab -> click "User defined" -> "5 seconds". Now OS X switches between integrated <-> discrete graphics every 5-10 seconds or so.

(5) Open YouTube or other websites in new tabs and scroll/click etc. It often crashes even if you just have Firefox open in the background though, and often another Firefox instance (like Nightly) will crash instead :)

It may take a few tries. If it doesn't reproduce for anyone else, I'd be happy to test patches or do some IRC debugging with someone.

https://crash-stats.mozilla.com/report/index/ff0018df-7f55-469f-8c45-3c6e42160420
https://crash-stats.mozilla.com/report/index/cf287d59-91c5-4f4c-b81b-c66742160420
https://crash-stats.mozilla.com/report/index/2c858b40-0726-41bc-9d06-ec6982160420
https://crash-stats.mozilla.com/report/index/f8f6db65-c98d-454b-83da-5dda02160420
Hm I just looked at these crash reports and noticed I had Firefox.app set to open in 32-bit mode, but that doesn't make a difference. 64-bit also crashed within a few minutes:

https://crash-stats.mozilla.com/report/index/5cdaaf3e-83bd-4bfa-9630-218262160420
Markus, I have a feeling this is widget-y issue,; either we somehow get to the wrong context or hold on to things we shouldn't be holding on it, or something like that, if not a bug in the drivers.  Thoughts?  Looking at the five crashes from comment 12 and comment 13, it's either widgets or texture upload.
Flags: needinfo?(mstange)
It's not clear to me what we might be doing wrong, or whether we even are doing anything wrong. The Apple docs about GPU switching are here:
https://developer.apple.com/library/mac/qa/qa1734/_index.html#//apple_ref/doc/uid/DTS40010791
https://developer.apple.com/library/mac/technotes/tn2229/_index.html#//apple_ref/doc/uid/DTS40008924
https://developer.apple.com/library/mac/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_contexts/opengl_contexts.html#//apple_ref/doc/uid/TP40001987-CH216-SW5

We don't need to throw away the context. It looks like there are only three rules we need to follow when a GPU switch happens:
 - Call -[NSOpenGLContext update]
 - Re-render the "scene" because our front buffer contents are now garbage
 - Re-query GL abilities and don't rely on abilities that only the old GPU had.

The last one might be the one that we're not doing correctly. Are the crashes happening when we switch from the integrated to the discrete GPU, or when we switch back from the discrete to the integrated one?
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #15)
> Are the crashes happening when we switch from the integrated to the discrete GPU, or
> when we switch back from the discrete to the integrated one?

I think I just got one when switching from discrete to integrated:

https://crash-stats.mozilla.com/report/index/3f784ccc-8c43-43cf-8139-c308b2160423

Note that this one has AppleIntelHD5000GraphicsGLDriver on the stack, and some other crashes have GeForceGLDriver on the stack. So I wouldn't be surprised if it crashed when switching in either direction.
Note that sometimes it crashes with a different signature: CoreFoundation@0xe9aa instead of libsystem_kernel.dylib@0x16f06

https://crash-stats.mozilla.com/report/index/e3320d02-70eb-47f1-8907-148132160423

The libsystem_kernel.dylib crash is a topcrash on Mac, and I've a hunch this might be responsible for some weird and impossible to repro memory corruption crashes we're seeing in JS, mostly on OS X (that's the #1 crash on Mac). So it'd be great to have this fixed. I'd be happy to test (diagnostic) patches.
(In reply to Markus Stange [:mstange] from comment #15)
> 
> The last one might be the one that we're not doing correctly. Are the
> crashes happening when we switch from the integrated to the discrete GPU, or
> when we switch back from the discrete to the integrated one?

I'd say discrete to integrate for me as well.  about:support gets you into discrete (from integrated), but after a couple of seconds bounces you back, and that's when bad stuff happens.
Flags: needinfo?(milan)
Assignee: nobody → mchang
Flags: needinfo?(milan)
See Also: → 1271528
See Also: → 1272138
What's the status of this?
I haven't seen it recently.  I don't think we ever got close to figuring out what was going on.
(In reply to Markus Stange [:mstange] from comment #15)
> We don't need to throw away the context. It looks like there are only three
> rules we need to follow when a GPU switch happens:
>  - Call -[NSOpenGLContext update]

I thought we were already doing this on GPU switches, but it turns out we're not. We only call it on window resizes.
I'll write a patch.
All of these crashes (plus the ones I looked at from my local list) happen when the main thread is within OpenGL code called from -[ChildView updateGLContext], and the compositor thread is within other OpenGL code (usually texture upload or swap buffer, not surprising given that these are longer running operations).

OpenGL isn't threadsafe, so this isn't particularly surprising.

It looks like updateGLContext uses CGLLockContext/CGLUnlockContext in an attempt to synchronise things, but we don't actually use this anywhere else so it won't have any effect.

I think we need to make sure every time we access OpenGL from the compositor it is within the same locking.
Assignee: mchang → nobody
Steven Michaud managed to symbolicate one of these crashes in bug 1176404 comment 2.

I managed to reproduce the texture upload crash locally (using the STR from comment 12) and got this stack:

thread #43: tid = 0x9fe4ca, 0x00007fff84a650ae libsystem_kernel.dylib`__pthread_kill + 10, name = 'Compositor'
    frame #0: 0x00007fff84a650ae libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff86921500 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff8b39237b libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff86540e5c libGPUSupportMercury.dylib`gpusGenerateCrashLog + 158
  * frame #4: 0x00001234402211a7 GeForceGLDriver`___lldb_unnamed_function6443$$GeForceGLDriver + 9
    frame #5: 0x00007fff86542204 libGPUSupportMercury.dylib`gpusSubmitDataBuffers + 162
    frame #6: 0x00001234403151ac GeForceGLDriver`___lldb_unnamed_function11158$$GeForceGLDriver + 335
    frame #7: 0x00001234402e593e GeForceGLDriver`___lldb_unnamed_function10581$$GeForceGLDriver + 263
    frame #8: 0x00001234405121be GeForceGLDriver`___lldb_unnamed_function17702$$GeForceGLDriver + 239
    frame #9: 0x00001234405123e1 GeForceGLDriver`___lldb_unnamed_function17703$$GeForceGLDriver + 349
    frame #10: 0x0000123440512ac5 GeForceGLDriver`___lldb_unnamed_function17705$$GeForceGLDriver + 31
    frame #11: 0x00001234402ca901 GeForceGLDriver`___lldb_unnamed_function10221$$GeForceGLDriver + 695
    frame #12: 0x00001234402cb28a GeForceGLDriver`___lldb_unnamed_function10226$$GeForceGLDriver + 837
    frame #13: 0x0000123440336515 GeForceGLDriver`___lldb_unnamed_function11366$$GeForceGLDriver + 6010
    frame #14: 0x00007fff8695c70c GLEngine`glTexSubImage2D_Exec + 1723
    frame #15: 0x00007fff876b359e libGL.dylib`glTexSubImage2D + 77

I can't find a way to get symbols for the unnamed_function entries, but it looks like it might be the same.

I also found this in my machine's logs:

2/06/16 1:37:00.000 PM kernel[0]: NVDA: Error in allocation list processing (malformed alloc list)
2/06/16 1:37:00.000 PM kernel[0]: NVDA: Calling gpusKillClient for task <ptr>

 
> It looks like updateGLContext uses CGLLockContext/CGLUnlockContext in an
> attempt to synchronise things, but we don't actually use this anywhere else
> so it won't have any effect.

This wasn't quite true. preRender/postRender in ChildView.mm also lock, so we should be holding the lock while compositing.

Most of the crashes occur while uploading textures, which is a time when the lock isn't held. Unfortunately Jan's second crash report (from comment 12) has us crashing in SwapBuffers. The main thread is blocked in CGLLockContext for this one, so we're not actually racing on anything useful.

It's possible that we have two bugs here, and making sure we don't race on updateGLContext will fix the upload crash, but not the SwapBuffers one. If they are the same bug, then it won't help.

I'm going to have a go at writing a patch that will fix the race, and make us call update before SwapBuffers (if necessary).
Attached patch child-view-raceSplinter Review
I can't get it to crash with this patch, but it was pretty hard to do to begin with.

It's almost certain that this won't fix the crash in SwapBuffers, but I think we should give it a shot anyway and see if the volume drops.
Attachment #8759005 - Flags: review?(mstange)
Comment on attachment 8759005 [details] [diff] [review]
child-view-race

Review of attachment 8759005 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/cocoa/nsChildView.h
@@ +204,5 @@
>    // last received event so that, when we receive one of the events, we make sure
>    // to send its pair event first, in case we didn't yet for any reason.
>    BOOL mExpectingWheelStop;
>  
> +  BOOL mNeedsGLUpdate;

Please add a comment.

::: widget/cocoa/nsChildView.mm
@@ +3422,4 @@
>  
>    if (!mGLContext) {
>      [self setGLContext:aGLContext];
>      [self updateGLContext];

This if calls updateGLContext outside the lock, and the one you're adding calls it inside the lock. Can we combine them? setGLContext is only called in this one place, so we might as well get rid of it as a separate method.
It seems like CGLLockContext is re-entrant, otherwise I'd expect a deadlock. But let's not rely on that.

@@ +3668,2 @@
>      [mGLContext setView:self];
>      [mGLContext update];

Please add a comment to updateGLContext saying that it should only be called on the compositor thread. Oh, except there is the force update stuff... should we try removing that again?
Attachment #8759005 - Flags: review?(mstange) → review+
(In reply to Markus Stange [:mstange] from comment #25)

> Please add a comment to updateGLContext saying that it should only be called
> on the compositor thread. Oh, except there is the force update stuff...
> should we try removing that again?

Probably, yes. Hopefully that'll be gone now that we only support 10.7. We should do that separately since we'll probably want to uplift this if it works.
Anything left to do here?
Assignee: nobody → matt.woodrow
Flags: needinfo?(matt.woodrow)
Target Milestone: mozilla49 → ---
Priority: -- → P3
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(matt.woodrow)
Resolution: --- → FIXED
Group: gfx-core-security → core-security-release
(In reply to Matt Woodrow (:mattwoodrow) from comment #26)
> (In reply to Markus Stange [:mstange] from comment #25)
> 
> > Please add a comment to updateGLContext saying that it should only be called
> > on the compositor thread. Oh, except there is the force update stuff...
> > should we try removing that again?
> 
> Probably, yes. Hopefully that'll be gone now that we only support 10.7. We
> should do that separately since we'll probably want to uplift this if it
> works.

This has now been done in bug 1394654.
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.