Closed Bug 1245241 Opened 4 years ago Closed 4 years ago

Categories

(Core :: Graphics, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
e10s ? ---
firefox47 --- fixed

People

(Reporter: gerard-majax, Assigned: lsalzman)

References

Details

(Keywords: crash)

Attachments

(7 files, 3 obsolete files)

18.34 KB, text/plain
Details
33.64 KB, patch
billm
: review+
Details | Diff | Splinter Review
10.64 KB, patch
billm
: review+
Details | Diff | Splinter Review
11.94 KB, patch
billm
: review+
Details | Diff | Splinter Review
17.02 KB, patch
nical
: review+
Details | Diff | Splinter Review
772 bytes, patch
lsalzman
: review+
Details | Diff | Splinter Review
4.34 KB, patch
lsalzman
: review+
Details | Diff | Splinter Review
This is 100% repro on my profile with uptodate nightlies. A debug build with a clean profile shows no crash.
Disabling all the addons on my profile does not help.
Unable to reproduce on my profile when running Nightly (from mozilla.org) under GDB :/
(In reply to Alexandre LISSY :gerard-majax from comment #1)
> This is 100% repro on my profile with uptodate nightlies. A debug build with
> a clean profile shows no crash.

Ok, the difference is loading mozilla-inbound (does not crash) VS mozilla-central (crash). When loading mozilla-central with a local debug build, crash gets reproduced. When loading inbound with nightly release and my profile, no problem.
Summary: Crash when loading treeherder.mozilla.org → Crash when loading https://treeherder.mozilla.org/#/jobs?repo=mozilla-central
My local debug build has no symbol so for now the gdb backtrace is meaningless.
Attached file gdb stacktrace
Flags: needinfo?(nical.bugzilla)
This is on a Ubuntu 15.10, and nothing has changed.
Looks like this is not reproduced in a non e10s window
Summary: Crash when loading https://treeherder.mozilla.org/#/jobs?repo=mozilla-central → (e10s) Crash when loading https://treeherder.mozilla.org/#/jobs?repo=mozilla-central
Blocks: e10s
That's a failure to allocate shared memory on the client side. Usually this means we're out of memory. Treeherder is (in my experience) pretty heavy on both JS and gfx so it's not too surprising that we hit the limits on this page more often than on other pages if we are already close to the edge (which might not be the case, though).
Flags: needinfo?(nical.bugzilla)
With 20GB RAM (at least 12GB-13GB free when testing) plus 16GB of swap (mostly unused) ?
Milan, we're seeing a lot of people crashing pretty reliably on treeherder lately. Bug 1245679 is another issue. Can you get someone to look into this? It's hitting a lot of Mozilla devs.
Flags: needinfo?(milan)
Will do, but we're thin on the linux side, with some 45 blockers in the pipeline, and :nical away until 2/15.  This is linux only, right?
Flags: needinfo?(nical.bugzilla)
I can reproduce this pretty easily by doing |ulimit -n 300| and then loading treeherder in a new Firefox session. I'm not sure why other people are getting up to 1000, but maybe my computer is just a little faster or something.
So, is ulimit -n 2000 a workaround?
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #14)
> So, is ulimit -n 2000 a workaround?

Not really.  I found on my machine 4096 wasn't high enough; su'ing to root, bumping to 32768, then su back to me and running avoids it.   The spike can be very large, depending heavily on timing/loading/phase-of-the-moon.  And we can't ask people to bump the fd limit just to run Firefox of course.

rr in normal mode doesn't crash.  rr using the 'chaos' branch did crash for me.  Even gdb with a breakpoint with an "ignore N 500" on shmem allocation stopped it from failing.
Component: General → Graphics
Product: Firefox → Core
I've found that with 1024 on my system, the beta repo in treeherder is almost insta-crash (in a local linux64 opt build)
This appears to have been regressed by bug 1180942. Milan, can we back that out until Nical returns?

Also, Randell, can you confirm that setting layers.use-image-offscreen-surfaces to false makes the problem go away?
Flags: needinfo?(rjesup)
Flags: needinfo?(milan)
Also means we're in trouble with bug 1241832
Blocks: 1241832
It seems like we should be able to just close the files after mapping them to avoid hitting the limit.
(In reply to Bill McCloskey (:billm) from comment #17)
> This appears to have been regressed by bug 1180942. Milan, can we back that
> out until Nical returns?
> 
> Also, Randell, can you confirm that setting
> layers.use-image-offscreen-surfaces to false makes the problem go away?

Yes - instacrash with it true, no crash with it false.
Flags: needinfo?(rjesup)
(In reply to Bill McCloskey (:billm) from comment #17)
> This appears to have been regressed by bug 1180942. Milan, can we back that
> out until Nical returns?
> 
> Also, Randell, can you confirm that setting
> layers.use-image-offscreen-surfaces to false makes the problem go away?

I don't see that pref being defined at all on my system. Also, toggling gfx.xrender.enabled to false still crashes on treeherder mozilla-central.
> > Also, Randell, can you confirm that setting
> > layers.use-image-offscreen-surfaces to false makes the problem go away?
> 
> I don't see that pref being defined at all on my system. Also, toggling
> gfx.xrender.enabled to false still crashes on treeherder mozilla-central.

You have to use New->boolean in about:config.  I don't crash with the layers pref set; I'm not sure the gfx pref has the same effect.
Sounds like we should re-enable xrender, disable layers.use-image-offscreen-surfaces, reopen relevant bugs (bug 1180942, etc.) until we sort this out.  Lee or Jeff, can you do this?
Flags: needinfo?(milan)
Flags: needinfo?(lsalzman)
Flags: needinfo?(jmuizelaar)
It looks like we're close to improving the situation properly and layers.use-image-offscreen-surfaces has been true since Thu Jan 21 so hopefully the problem isn't too widespread. If we can't find a fix by the end of the week we'll revert the change.
Flags: needinfo?(jmuizelaar)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #25)
> It looks like we're close to improving the situation properly

How so? Is there a different bug?
Lee has a patch to close the file descriptor for a mapping after we map:
https://hg.mozilla.org/try/rev/a8cf60d1410b

It seems to solve the problem but we're running into a different problem with it on try.
Depends on: 1247801
The main side-effect of this patch is that it closes fds after mmaping, which allows us to function even with much lower fd limits.

In the process, while trying to figure out the shmem code, it was hard to understand what was going on with the same code duplicated in 4 places (debug vs. non-debug, basic vs. sysv).

So I attempted to remove a lot of the static dispatching and replace it with better use of virtuals so that we it is much easier to have different types of shmems now.
Assignee: nobody → lsalzman
Status: NEW → ASSIGNED
Flags: needinfo?(lsalzman)
Attachment #8719046 - Flags: review?(wmccloskey)
Fix accidental widget/gtk bug regarding SysV shm.
Attachment #8719046 - Attachment is obsolete: true
Attachment #8719046 - Flags: review?(wmccloskey)
Attachment #8719106 - Flags: review?(wmccloskey)
Comment on attachment 8719106 [details] [diff] [review]
Close Shmem file handles after mapping them when possible to reduce exhaustion issues.

Review of attachment 8719106 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good, but I see two issues:

1. We have an unused set of functions for adopting shared memory. Those will break with this patch, I think, because they depend on the fd remaining open. So I think you should remove the code to support this.

2. Although your new way of handling SYSV shared memory is nicer, we don't actually use SYSV shared memory anymore. So let's remove that too.

::: ipc/chromium/src/base/shared_memory_posix.cc
@@ +257,5 @@
>  }
>  
>  
> +void SharedMemory::Close(bool unmap_view) {
> +  if(unmap_view) {

Should be |if (unmap_view)| (need a space).

::: ipc/glue/SharedMemory.h
@@ +57,5 @@
>  
>    virtual bool Create(size_t size) = 0;
>    virtual bool Map(size_t nBytes) = 0;
>  
> +  virtual void Close() {}

Could this be = 0 instead?
Attachment #8719106 - Flags: review?(wmccloskey)
Can we please resolve this?  Permacrashing on treeherder is getting annoying.... ;-)
Fixed the formatting error, made Close pure virtual, and renamed it to CloseHandle for clarity's sake.
Attachment #8719106 - Attachment is obsolete: true
Attachment #8720384 - Flags: review?(wmccloskey)
This removes the ability to create TYPE_SYSV Shmems with IPDL and associated tests.

SharedMemorySysV is still used by widget/gtk but that is removed in a further patch.
Attachment #8720385 - Flags: review?(wmccloskey)
We're not actually using AdoptShmem anywhere at all, and it becomes trickier to use as a result of this file handle trickery... so here we just remove it instead, as requested.
Attachment #8720386 - Flags: review?(wmccloskey)
Since SharedMemorySysV is no longer used in IPDL or anywhere else in ipc, besides its sole use in nsShmImage for widget/gtk, the tiny few lines of code relating to actually setting up sysv shm segments should just be moved into nsShmImage itself. This way we can just totally remove SharedMemorySysV.
Attachment #8720428 - Flags: review?(nical.bugzilla)
Attachment #8720386 - Flags: review?(wmccloskey) → review+
Attachment #8720384 - Flags: review?(wmccloskey) → review+
Attachment #8720385 - Flags: review?(wmccloskey) → review+
Attachment #8720428 - Flags: review?(nical.bugzilla) → review+
Keywords: crash
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a42e2bd00c6

The pre-existing mochitest gl perma-orange is not from this change.
Duplicate of this bug: 1247948
Nightly with https://hg.mozilla.org/mozilla-central/rev/69ec3dc408a2a720cb2b8210fea33e3504aeec22 (which seems to contain the patches from this bug) and I am still experiencing crashes.
Flags: needinfo?(lsalzman)
(In reply to Alexandre LISSY :gerard-majax from comment #40)
> Nightly with
> https://hg.mozilla.org/mozilla-central/rev/
> 69ec3dc408a2a720cb2b8210fea33e3504aeec22 (which seems to contain the patches
> from this bug) and I am still experiencing crashes.

Crash report: bp-3d8c199d-c418-4e82-82b1-64c852160220
Nightly built from https://hg.mozilla.org/mozilla-central/rev/af6356a3e8c56036b74ba097395356d9c6e6c5a3, still crashing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Alexandre LISSY :gerard-majax from comment #42)
> Nightly built from
> https://hg.mozilla.org/mozilla-central/rev/
> af6356a3e8c56036b74ba097395356d9c6e6c5a3, still crashing.

Sadly, I do not get any tab crash report dialog to submit anything.
(In reply to Alexandre LISSY :gerard-majax from comment #43)
> (In reply to Alexandre LISSY :gerard-majax from comment #42)
> > Nightly built from
> > https://hg.mozilla.org/mozilla-central/rev/
> > af6356a3e8c56036b74ba097395356d9c6e6c5a3, still crashing.
> 
> Sadly, I do not get any tab crash report dialog to submit anything.

Was this a local build? If so can you attach gdb and get a stack? If it's not a local build can you file a bug about not getting a tab crash dialog?
Flags: needinfo?(lissyx+mozillians)
No, it's not a local build.
Flags: needinfo?(lissyx+mozillians)
Just to be clear, the tabs turns to "Hm that tab crashed" but I don't get the UI to send a report.
Flags: needinfo?(jmuizelaar)
(In reply to Alexandre LISSY :gerard-majax from comment #46)
> Just to be clear, the tabs turns to "Hm that tab crashed" but I don't get
> the UI to send a report.

That still sounds like a bug to me.
Flags: needinfo?(jmuizelaar)
Filed as bug 1249995
Flags: needinfo?(lsalzman)
Attachment #8722108 - Flags: review?(wmccloskey)
Attachment #8722108 - Flags: review?(wmccloskey) → review+
Small fix to check against -1 instead of 0.
Attachment #8722108 - Attachment is obsolete: true
Attachment #8722127 - Flags: review+
Lee's going to try to turn off xrender to prevent further suffering while we try to figure this out.
We're still running out of fds. I can crash pretty repeatedly by zooming out a lot and then loading mozilla-central in treeherder. If I bump the fd limit to 60000, I don't crash.
Here's my guess about what's happening here:

When we load treeherder, we send a lot of shared memory segments from the child process to the parent. The parent first reads these messages on the I/O thread and then passes them to the compositor thread. When loading treeherder, the compositor thread seems to be really busy doing stuff in Cairo. I don't know what it's doing, but the main thread of the chrome process is locked for several seconds in SendUpdate waiting for the compositor.

File descriptors are initially used up when messages are read on the I/O thread. They don't get closed until the message is passed to the compositor thread (which is when ReadSegment will be called). So this is basically a case where the content process and the I/O thread get ahead of the compositor thread.

It would really help me if someone could explain why we're quickly creating and then destroying so many textures. My understanding is that the textures for layers live as long as the layer. Are these layers getting immediately destroyed, or is this a different sort of texture?
https://hg.mozilla.org/mozilla-central/rev/23a64d29029c
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
My understanding is that we are still having issues on treeherder with this patch. I am hacking something that will throttle shmem creation passed a certain number by sending a sync message to the compositor. It should fix the remaining issues, and if it doesn't work we'll reenable xrender.
Status: RESOLVED → REOPENED
Flags: needinfo?(nical.bugzilla)
Resolution: FIXED → ---
This fixes the issue for me locally. It makes sure we don't allocate more than 256 shmems within a layers transaction without at least waiting for the compositor, and spits out a performance warning when the limit is hit.
Attachment #8722566 - Flags: review?(lsalzman)
Attachment #8722566 - Flags: review?(lsalzman) → review+
Comment on attachment 8722566 [details] [diff] [review]
Force the main thread to sync with the compsoitor when it tries to allocate silly amounts of shmems

Review of attachment 8722566 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ShadowLayers.cpp
@@ +49,5 @@
> +
> +static void ShmemAllocated(LayerTransactionChild* aProtocol)
> +{
> +  sShmemCreationCounter++;
> +  if (sShmemCreationCounter > 256) {

As the dust settles, if this stays around, make it a pref?

@@ +51,5 @@
> +{
> +  sShmemCreationCounter++;
> +  if (sShmemCreationCounter > 256) {
> +    aProtocol->SendSyncWithCompositor();
> +    ResetShmemCounter();

Should we reset the counter before or after making this sync call?  I guess it doesn't really matter.
Also (in a followup bug), we may want to consider reporting this beyond the performance warning, perhaps in telemetry.
(In reply to Pulsebot from comment #58)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/6cb44006032d

Bill, can you see if this resolves the issue for you?
Flags: needinfo?(wmccloskey)
It's not crashing for me with nical's patch!
Flags: needinfo?(wmccloskey)
https://hg.mozilla.org/mozilla-central/rev/6cb44006032d
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Not crashing anymore with nightly from feb 25th.
No longer blocks: 1263222
You need to log in before you can comment on or make changes to this bug.