Crash in je_free | swrast_dri.so@0x438a90

VERIFIED FIXED in Firefox 51

Status

()

defect
--
major
VERIFIED FIXED
3 years ago
3 years ago

People

(Reporter: mboldan, Assigned: jld)

Tracking

({crash, regression})

50 Branch
mozilla51
All
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s-, firefox47 unaffected, firefox48 unaffected, firefox49 unaffected, firefox50 disabled, firefox51 verified)

Details

(Whiteboard: sb+, crash signature)

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
This bug was filed from the Socorro interface and is 
report bp-57b785b8-0756-43de-9da7-40dce2160801.
=============================================================

[Note]:
This crash is reproducible only on Ubuntu platform, on Firefox 50.0a1 build and with E10s enabled.

[Affected versions]:
Firefox 50.0a1 (2016-07-31)

[Affected platforms]:
Ubuntu 16.04 x64

[Steps to reproduce]:
1. Visit https://s3.amazonaws.com/mozilla-games/tmp/2015-08-28-emunittest_0.4-AngryBots-u5.1.3f1_hg-e1.34.6-release-prof/index.html?playback

[Expected result]:
The video is correctly played.

[Actual result]:
The tab is crashing after the Unity page is loaded.

[Regression range]:
Last good revision: c676d55b6b006a2edb37c7c29c64e69f7cb8012a
First bad revision: 23140396a80eb27ff586c41fdc1cad62c875c9b1
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c676d55b6b006a2edb37c7c29c64e69f7cb8012a&tochange=23140396a80eb27ff586c41fdc1cad62c875c9b1

Looks like the following bug has the changes which introduced the regression:
https://bugzilla.mozilla.org/show_bug.cgi?id=742434
Note: apparently caused by turning on seccomp-bpf on linux
Crash appears to be in mozilla::WebGLContext::DrawElements() apparently in mSymbols.fDrawRangeElements()
-> WebGL, but more likely is sandboxing

NI to various people
Component: Web Audio → Canvas: WebGL
Flags: needinfo?(milan)
Flags: needinfo?(julian.r.hector)
Crash reason is SIGILL, and the crash address isn't within a mapped file — jitcode, maybe?  If someone with permission to look at full minidumps could disassemble the code around %rip (I think that's in the minidump?) that might help understand what's going on.
Note that the je_free frame is from stack scanning, so it (and everything else there besides the anonymous code at %rip) could be from old stack frames that have since returned, or could be stack-spilled variables that happen to point into a text section, etc.  (This also means the crash signature could be completely meaningless.)

I tried reproducing this in a VM, but it doesn't crash for me.  It might depend on CPU model, or maybe GPU.

Updated

3 years ago
tracking-e10s: --- → ?
Let's deal with this before we start building on top of bug 742434 (I see a few bugs depending on it) in case we don't find a fix and need to back this out?
Component: Canvas: WebGL → Security: Process Sandboxing
Flags: needinfo?(milan)
Mihai - what's the machine configuration (especially GFX card/driver info and OS version)?  Can you attach the about:support data?
Flags: needinfo?(mihai.boldan)
I took a look at the minidump file (had to convert it to a core file)

I couldn't disassemble the code around RIP (memory was not accessible from the core file). The violating RIP address (0x7f2aeada80a2) seems to be an anonymous r-x mapping.

> 7f2aeada3000-7f2aeada4000 r--s 00000000 08:07 278562    [censored]
> 7f2aeada7000-7f2aeadb0000 r-xp 00000000 00:00 0 
> 7f2aeadb0000-7f2aeadc4000 r--p 00000000 08:07 658617    [censored]

backtrace isn't any good either, the return address is pointing somewhere into the stack...
Flags: needinfo?(julian.r.hector)
(Reporter)

Comment 9

3 years ago
(In reply to Randell Jesup [:jesup] from comment #7)
> Mihai - what's the machine configuration (especially GFX card/driver info
> and OS version)?  Can you attach the about:support data?

Here is the requested info:
OS: Ubuntu 16.04 x64
About support data: http://pastebin.com/GCAEYjNQ

product: Sky Lake Integrated Graphics [8086:1916]
vendor: Intel Corporation [8086]
bus info: pci@0000:00:02.0
version: 07
width: 64 bits
clock: 33MHz
capabilities:
	vga_controller,
	bus mastering,
	PCI capabilities listing,
	extension ROM
configuration:
	driver: i915_bpo
	latency: 0
resources:
	irq: 128
	memory: c1000000-c1ffffff
	memory: a0000000-afffffff
	ioport: 5000(size=64)
Flags: needinfo?(mihai.boldan)

Updated

3 years ago
Whiteboard: sb+
are you still seeing this in latest nightlies?
Flags: needinfo?(mihai.boldan)
I looked at the minidump with the "minidump_dump" tool, which prints all the metadata structures and hexdumps all the memory regions, and there is a region around the bad instruction.  I un-hexdump'ed it and ran the binary through this command (the offset is also from the minidump_dump output) to wrap it in ELF:

objcopy -B i386 -I binary -O elf64-x86-64 --adjust-vma=0x7f2aeada8022 --rename-section=.data=.text,code,contents bug1290896.bin bug1290896.o

And disassembled it, after finding a new enough version of binutils:

    7f2aeada8099:       40 18 f6                sbb    %sil,%sil
    7f2aeada809c:       81 e6 01 00 00 00       and    $0x1,%esi
*** 7f2aeada80a2:       c5 f8 92 c6             kmovw  %esi,%k0
    7f2aeada80a6:       c5 f9 92 ca             kmovb  %edx,%k1
    7f2aeada80aa:       c5 f4 45 c0             korw   %k0,%k1,%k0

kmov is an AVX-512 instruction.  And it's not JS jitcode (judging by a grep in js/src; also I think we allocate larger memory regions than that?) but I'm guessing it's llvmpipe jitcode (http://www.mesa3d.org/llvmpipe.html).

Which leaves me with some questions:

1. Does the CPU actually support AVX-512?  Apparently Skylake CPUs models that are ”Xeon” branded support it and others don't, so this would need the CPU model number.

2a. If it doesn't, then why is something deciding that it does, and how does it make that determination?

2b. If it *does*, then why is it not enabled when that instruction is executed?

3. Is this reproducible on the same / a similar model of CPU?

Possible things to troubleshoot (I can post patches for the ones that need code changes if need be):

* Trying the same build with content sandboxing disabled (MOZ_DISABLE_CONTENT_SANDBOX in the environment, or setting the pref security.sandbox.content.level to 0 and restarting).

* The only syscalls I'm seeing where the policy says to return an error instead of allowing or crashing are readlink/readlinkat.  Changing those to Allow() could be informative, especially if the answer to question 1 is “no” and it's a case of broken CPU feature probing. 

* Replacing the *entire* sandbox policy with Allow() and seeing if that's different from no seccomp-bpf.  (And, if it does, trying with a handwritten BPF program that really accepts everything, instead of having the architecture check prologue that the Chromium seccomp-bpf compiler inserts.)  If the answer to question 1 is “yes”, I'm wondering if there might be a bug in syscall handling that gets the FPU state wrong somehow in the slower path used by seccomp-bpf, or something like that.
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #11)
> 1. Does the CPU actually support AVX-512?  Apparently Skylake CPUs models
> that are ”Xeon” branded support it and others don't, so this would need the
> CPU model number.

I think I misread something there — it looks like AVX-512 was maybe planned for Skylake at some point but didn't happen, and actual shipping hardware with AVX-512 hasn't quite happened yet.

Which leaves me confused about how LLVM would misread the CPU features — I've found what looks like the code for it, and it's doing the usual kind of thing with CPUID, which shouldn't be affected by anything to do with system calls.
(Reporter)

Comment 13

3 years ago
(In reply to Jim Mathies [:jimm] from comment #10)
> are you still seeing this in latest nightlies?

The issue is still reproducible on Firefox 51.0a1 (2016-08-04) and on Ubuntu 16.04 x64.
Note that the internet connection is through Wi-Fi.
Flags: needinfo?(mihai.boldan)

Comment 14

3 years ago
This is a bug in LLVM 3.8 -- https://bugs.archlinux.org/task/49518. LLVM thought Skylake chips did support AVX-512, cf. https://github.com/llvm-mirror/llvm/blob/release_38/lib/Target/X86/X86.td#L487-L523.
So why would sandboxing trigger this bug?

My guess would be that we're blocking something that causes the real graphics driver to fail to load, and the user gets llvmpipe as a fallback, which then ends up crashing.

Unfortunately I don't know an easy way to see which renderer is backing MESA. Here's what we can try:
do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on the revision that works and the current ones that are failing. Maybe we can see if the renderer switches.

[1] I would've expected to write content process here, but on my system I only see the gfx driver loaded into chrome.
Flags: needinfo?(mihai.boldan)
(Reporter)

Comment 16

3 years ago
(In reply to Gian-Carlo Pascutto [:gcp] from comment #15)
> So why would sandboxing trigger this bug?
> 
> My guess would be that we're blocking something that causes the real
> graphics driver to fail to load, and the user gets llvmpipe as a fallback,
> which then ends up crashing.
> 
> Unfortunately I don't know an easy way to see which renderer is backing
> MESA. Here's what we can try:
> do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on
> the revision that works and the current ones that are failing. Maybe we can
> see if the renderer switches.
> 
> [1] I would've expected to write content process here, but on my system I
> only see the gfx driver loaded into chrome.

Here are the pmap results:
- Results from latest Nightly - http://pastebin.com/1ULayQ8F
- Results from Firefox 48.0a1 (2016-03-21) - http://pastebin.com/tw5U4fsL
Flags: needinfo?(mihai.boldan)
Unfortunately I don't see any gfx driver libs in there. Might be reproducible with a VM on a Skylake machine.
(Reporter)

Comment 20

3 years ago
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #19)
> Can you try to reproduce the bug with this Try build?
> https://archive.mozilla.org/pub/firefox/try-builds/jedavis@mozilla.com-
> 12522edae0a88a52800d2e7c8504d31709214581/try-linux64/firefox-51.0a1.en-US.
> linux-x86_64.tar.bz2
> 
> I changed the sandbox rules that I suspect might be involved:
> https://hg.mozilla.org/try/rev/12522edae0a88a52800d2e7c8504d31709214581

The crash is no longer reproducible on this build. 
The tests were performed on the same laptop with Ubuntu 16.04 x64 OS.
Flags: needinfo?(mihai.boldan)
Assignee: nobody → jld
Attachment #8779507 - Flags: review?(gpascutto)
Attachment #8779507 - Flags: review?(gpascutto) → review+
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12522edae0a88a52800d2e7c8504d31709214581

There's some orange and a bunch of tests that still haven't started, but so far nothing that looks like my fault.

(Self-ni? for uplift request when landed and stable.)
Flags: needinfo?(jld)
Keywords: checkin-needed
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #22)
> (Self-ni? for uplift request when landed and stable.)

Ignore that; seccomp-bpf being on by default is Nightly-only.
Flags: needinfo?(jld)

Comment 24

3 years ago
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f416db46e66e
Allow readlink() in desktop Linux content processes. r=gps
Keywords: checkin-needed

Comment 25

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/f416db46e66e
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
(Reporter)

Comment 26

3 years ago
The crash is no longer reproducible on Firefox 51.0a1 (2016-08-17), using STR from Comment 0.
The tests were performed under Ubuntu 16.04 x64, Mac OS X 10.11.1 and under Windows 10x64.
I am marking this issue Verified Fixed.
Status: RESOLVED → VERIFIED
I was able to reproduce this issue again on latest Nightly 52.0a1 (2016-10-19), using STR from comment 0. Also this is reproducibile by playing one of the WebGL Samples (e.g. "Aquarium") from here: http://webglsamples.org/
Note: I used the same machine and OS mentioned by Mihai in comment 9. 

See Crash Signature: https://crash-stats.mozilla.com/report/index/81282b10-329c-4b50-8a4e-fb9182161020

-- I run mozregression for a regression range and this are the results (although I'm not sure what regressed this):
 Last good revision: da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8
 First bad revision: d8e1f5cf0a70a53e8a5532809096a0a5bf729196
 Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196
Given that we're talking about Ubuntu, perhaps the sandboxing work and preference change from bug 1289718?
Flags: needinfo?(gpascutto)
Yes, please file a new bug. The Intel driver is refusing to load and it's falling back to a broken MESA LLVM.
Flags: needinfo?(gpascutto)
You need to log in before you can comment on or make changes to this bug.