Closed Bug 1338771 Opened 7 years ago Closed 2 years ago

[e10s] Crash in libyuv::ARGBSetRow_X86

Categories

(Core :: Graphics, defect, P3)

51 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix

People

(Reporter: kristian, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-1369f601-b6f1-46a9-abb5-3bc272170210.
=============================================================
Firefox running in Docker with help from Xvfb and llvmpipe.
The tab crash when visiting http://version2.dk, I haven't been able to find other sites which also crash the tab, but if I disable e10s (browser.tabs.remote.autostart.2 false) it stop crashing.
Changing the Docker ipc namespace to host seems to solve this, but cost some security.
Could it be a shared memory usage issue?
Switching on xrender also solve the issues.
Keywords: crash
(In reply to Kristian Klausen from comment #1)
> Changing the Docker ipc namespace to host seems to solve this, but cost some
> security.
> Could it be a shared memory usage issue?

Yea, it seems to related to shmem issue. shmem is not used when e10s is disabled. The shmem allocation seemed to be failed, but ShmemTextureData::Create thought it succeeded.
  > https://hg.mozilla.org/releases/mozilla-release/file/327e081221b0/gfx/layers/BufferTexture.cpp#l558
See Also: → 1350721
Bill, seems that in e10s mode running in docker breaks unless you're using docker's --ipc=host flag. Is there something we could do about this?
Flags: needinfo?(wmccloskey)
See Also: 1350721
Sorry, I have no idea. Maybe Jed can think of something.
Flags: needinfo?(wmccloskey) → needinfo?(jld)
The IPC namespace controls SysV IPC.  I could understand needing to use --ipc=host if Firefox were running inside a container and connecting to an X11 server on the host, because of the MIT-SHM extension, but if the X client and server are in the same container (which is what I'd expect for a test setup) then that shouldn't matter.  Possibly there's something besides X that's trying to use SysV IPC; I don't know if there's a good way to find out what that might be other than using strace.

But this gets weirder.  The crash stacks mention "shared memory", but that's Gecko IPC's shared memory (ipc::Shmem); as far as I can tell from searchfox, that doesn't ever use SysV shm: it opens a file and uses mmap().  The crashes are also SIGBUS, which is unusual on x86 Linux; one possible cause, which seems to be the most likely in this context, is accessing past the end of a memory-mapped file.

That shouldn't be possible, because ShmemTextureData::Create appears to be passing the same size to AllocUnsafeShmem and InitBuffer, but apparently it is.  Maybe AllocUnsafeShmem (or whatever it eventually winds up calling) returns a shared memory area that isn't big enough?
Flags: needinfo?(jld)
See Also: → 1342573
I have a reproducible example of this crash in Bug 1323701 if this may be helpful for anyone. This is blocking us updating our test suite to Selenium 3 / Marionette, so am happy to try and provide info if I can.
Jed, paging you for comment #8? :-)
Flags: needinfo?(jld)
It turns out I'm wrong about Docker: --ipc *does* affect /dev/shm as well as SysV IPC; see https://github.com/moby/moby/pull/12159.  (The documentation does have the word POSIX wedged in in one place, I now see, but the rest of it seemed to be talking about SysV IPC so I didn't realize it would also affect the filesystem.)

What I think is going on is that /dev/shm runs out of space — Docker's default is 64M — and we're not actually allocating space when the file is created, so ENOSPC happens in the page fault handler and we get SIGBUS.

(The CrossProcessSemaphore SIGBUS crashes were the big clue here — that's a small fixed-size allocation, not a potentially large array, so out-of-bounds access didn't make sense as the cause.)

So, one part of this (if I'm right) is to raise the --shm-size in the test containers.  

The other thing that could happen is to allocate space (with posix_fallocate) when creating shared memory items and handle failure somehow, even if that's just by immediately crashing with appropriate metadata — at least then we'd see the allocation site, not something else later on, and we could include these in any statistics on OOM crashes.  I don't know if anything in graphics that uses shared memory is actually expecting fallible allocation, but that could also be done.
Flags: needinfo?(jld)
It turns out there's already a bug about pre-allocating shared memory, for exactly this reason: bug 1245239, which has a patch, which I r+ed, but it caused breakage on Try and didn't land.
See Also: → 1245239
There is also an issue for geckodriver, where people see crashes with Docker and Selenium:
https://github.com/mozilla/geckodriver/issues/285

One of our affected users mentioned that attaching the /dev/shm volume to docker container fixed it for him.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #11)
> It turns out there's already a bug about pre-allocating shared memory, for
> exactly this reason: bug 1245239, which has a patch, which I r+ed, but it
> caused breakage on Try and didn't land.

It looks like this patch was submitted over a year ago, and the bug has not seen much movement since. Is this likely to get any movement?
I think I might have accidentally found out what the problem with bug 1245239 was.

Keep in mind that fixing bug 1245239 just means that things will crash (or otherwise fail) in a more friendly way.  The real fix is to use a larger /dev/shm.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #14)
> I think I might have accidentally found out what the problem with bug
> 1245239 was.
> 
> Keep in mind that fixing bug 1245239 just means that things will crash (or
> otherwise fail) in a more friendly way.  The real fix is to use a larger
> /dev/shm.

I see, understood - thanks for the follow up :)
Whiteboard: [gfx-noted]
Blocks: 1410363
See Also: → 1432333

Doesn't happen in any of the recent version of Firefox

Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
You need to log in before you can comment on or make changes to this bug.