Closed Bug 1357734 Opened 7 years ago Closed 7 years ago

Crash in mozilla::wr::DisplayListBuilder::PushBorder

Categories

(Core :: Graphics: WebRender, defect)

55 Branch
Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- disabled
firefox56 --- fixed

People

(Reporter: marcia, Assigned: sotaro)

References

(Depends on 1 open bug)

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(5 files)

This bug was filed from the Socorro interface and is 
report bp-212d466d-cc37-405e-85ca-666070170419.
=============================================================

Seen while looking at nightly crash data - these crashes started using 20170418030220 with 26 crashes/5 installations: http://bit.ly/2pBAYsU

The App notes doesn't show the WR pref as being enabled on the reports I looked at.
(In reply to Marcia Knous [:marcia - use ni] from comment #0)
> The App notes doesn't show the WR pref as being enabled on the reports I
> looked at.

The WR annotation in the app notes only shows up in parent process crashes, I think. It seems that each process has independent app notes, and we only run the gfxPlatform code that creates the annotation in the parent process [1].

Also if crashes showed up in the Apr 18 nightly it's likely from something the graphics -> m-c merge [2] that we did the previous day.

[1] http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/gfx/thebes/gfxPlatform.cpp#714-716
[2] https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=2b6a66a98e25
Nothing that pushlog seems relevant to the crash stack, though :(
(In reply to Brian Carpenter [:geeknik] from comment #3)
> Created attachment 8859703 [details]
> firefox-debug.log
> 
> I'm seeing this crash after enabling WebRender in Nightly when I click
> around Yahoo, specifically https://www.yahoo.com/news/politics/:
> 
> https://crash-stats.mozilla.com/report/index/3c4a473f-179e-42cb-a81b-
> 5a1820170419
> https://crash-stats.mozilla.com/report/index/aa11e7a5-e7df-4edf-a1cb-
> 89b9b0170419
> https://crash-stats.mozilla.com/report/index/6d82c284-a965-4fa9-b4c1-
> 435340170419
> 
> Since it's easy to repro, I was able to get a log from windbg.

Sync with today's Nightly build, plus enabling WebRender, I can't reproduce this issue after surfing yahoo link many times. The platform I tried was on Windows 10.

Looked into the crash report, a null pointer deference happens in [1].  

[1]:https://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/gfx/webrender_bindings/WebRenderAPI.cpp#652 

From mini-dump, it seems that wr_dp_push_border didn't point to a null address. Based on the current information, I don't have any idea to think what is the root cause.
Yeah the crash here doesn't make much sense. We probably need to inspect the disassembly to see where the 0xffffffffffffffff is that's trying to read from.
[Tracking Requested - why for this release]:

This is the #4 Windows topcrash in Nightly 20170421030241.
Keywords: topcrash
Ethan, can you take a look to see if there's anything that jumps out at you here?
Flags: needinfo?(ethlin)
Whiteboard: [gfx-noted]
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Ethan, can you take a look to see if there's anything that jumps out at you
> here?

Sorry, I couldn't reproduce the problem on my Mac or Windows10 devices. I also reviewed the code around the function and checked the minidump file, but I have no idea right now.
Flags: needinfo?(ethlin)
Tracking 55+ for this top crash.
I downloaded the build [1] for the crash report in comment 0 and ran dumpbin.exe on the xul.dll to get the disassembly. I found the disasm for the PushBorder function at offset 119ee18, just as predicted by the .sym file. I'm attaching the disassembly. The crash is supposedly at offset 0x52 into the function, so I marked that in the file.

Assuming the ip is incremented before executing the instruction, the crash happens on this instruction: 

movaps      xmm1,xmmword ptr [r8+30h]

which is an aligned 128-bit read from the address at r8+0x30. The raw dump from the crash report says r8 is 0x000000ff5b5fd230 which means this should be reading from 0x000000ff5b5fd260. And the crash report indicates that the instruction tried to an invalid read from 0xffffffffffffffff. I don't really know how that's possible.

[1] https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-18-03-02-20-mozilla-central/firefox-55.0a1.en-US.win64.zip
If we give user the nightly debug build, does it help?
Ah, the 0xffffffffffffffff is bogus, see bug 974420. Looking at the debug log that :geeknik attached we see this:

00007ffc`cc48555e 0f2800          movaps  xmm0,xmmword ptr [rax] ds:0000000f`73ffd108=3f8000003f41c1c33f41c1c33f41c1c3

Firstly it looks like that's the instruction after the one I pegged - it's reading from [rax] instead of [r8+30h]. And secondly the debugger provided the actual value of what that resolves to, and it's not 0xffffffffffffffff. So it's just a regular invalid read, probably of free'd memory. Considering the PushBorder function just delegates to wr_dp_push_border [1], the most likely thing that's happening here is that the DisplayListBuilder object has already been freed, and we're trying to read the mWrState from a dead object.

[1] http://searchfox.org/mozilla-central/rev/baf47b352e873d4516d7656186d6d7c7447d3873/gfx/webrender_bindings/WebRenderAPI.cpp#652
Looking at the call stack, the DisplayListBuilder is allocated on the stack just a few frames up, so there's no way it could have been freed. However maybe the stack is getting corrupted. Looking at the WR calls that happens just before, I see there are potentially a couple of calls to build clip regions at [1] which pass a rect and no mask. That trickles down to [2] where it passes nullptr for the complex clip region pointer and then to [3] which tries to make a slice out of (nullptr, 0). Guess what the slice::from_raw_parts documentation says is not allowed? "p must be non-null, even for zero-length slices." [4]. Bug!

Not really sure that this is the root cause of the observed crash, but it's something that needs fixing at least.

[1] http://searchfox.org/mozilla-central/rev/baf47b352e873d4516d7656186d6d7c7447d3873/layout/painting/nsCSSRenderingBorders.cpp#3563-3565
[2] http://searchfox.org/mozilla-central/rev/baf47b352e873d4516d7656186d6d7c7447d3873/gfx/webrender_bindings/WebRenderAPI.cpp#739
[3] http://searchfox.org/mozilla-central/rev/baf47b352e873d4516d7656186d6d7c7447d3873/gfx/webrender_bindings/src/bindings.rs#1119
[4] https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html
This code has now changed so that an empty slice is now properly created from a nullptr[1]. I'm not sure if this fixed the crash or not.

[1] http://searchfox.org/mozilla-central/source/gfx/webrender_bindings/src/bindings.rs#81
Ah, I see that this was handled in the dependent bug 1359462. Sorry for the noise.
I don't see any crash reports matching this signature on a buildid more recent than 20170421030241. I'm going to assume it's fixed, although it was probably not fixed by bug 1359462 as that would only have showed up in the April 28 nightly. The crashes stopped a week before that.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
This signature appeared again in the nightly of 20170601030206, although
it's only a single installation.
I looked at the last 6 months of crash data for this signature and here are the build ids it shows up on:

Rank    BuildId         Count   Percent
3       20170418030220  47      21.08 %
1       20170419030223  87      39.01 %
2       20170421030241  67      30.04 %
5       20170524030204  7       3.14 %   <-- reappears in may 24
6       20170529030204  6       2.69 %
4       20170601030206  9       4.04 %

So it seems that after going away in april, it reappeared in 20170524030204. However looking at the "install time" of the may/june crash reports, I suspect all the may/june reports are from a single user. Which likely means it's something in their environment that's triggering this.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Kats, I see two different users with this crash in Nightly 20170609030207:

https://crash-stats.mozilla.com/report/index/246fc936-4490-4a60-a0b0-069570170610
https://crash-stats.mozilla.com/report/index/1682e345-231c-4815-b1db-1619a0170609

Does that change your conclusion?
Flags: needinfo?(bugmail)
Yes, I agree - this is probably a bug in our code. Problem is, I have no idea what it could be.
Flags: needinfo?(bugmail)
Try enabling webrendest on top of webrender. I'm seeing some crashes that only show up with both enabled.
20170629030206 @ Windows 10 x64 with (only) webrender enabled

The new newtab page has gotten some elements from the acticity stream project. I clicked on the top-right gear and wanted to switch from "blank page" to "most visited sites". Got browser crashes:

Meldungs-ID 	Sendedatum
bp-0c6fa79f-21bb-4934-888d-111310170629	29.06.2017	19:18
bp-d360b20a-39f4-4fc2-ba67-ed66d0170629	29.06.2017	19:18
bp-c05f6bf0-da97-4b29-b3dc-a812b0170629	29.06.2017	19:18
(In reply to Darkspirit from comment #22)
With browser.newtabpage.enabled;false at least the newtabpage is fine. But if a click on a bug id inside a crash report and want to login into bugzilla, crash. Switch tabs. Crash. If you can't reproduce it with this build, I would give you my whole config or test whatever debug binary you'd gave me.
Nightly 56 Build 20170629030206 @ Windows 10 x64 (Radeon RX 480)

STR
1. Edit Nightly's Desktop shortcut to "C:\Program Files\Nightly\firefox.exe" -P
2. create a fresh profile
3. about:config > set gfx.webrender.enabled;true
4. open a new tab. browser crash.

bp-404c3a25-3d6f-4555-a72a-34eb30170629	29.06.2017 21:37
bp-479c5e59-34d4-4828-8278-4a8e70170629	29.06.2017 21:37
bp-13b22d13-c567-4218-9ecd-7b73e0170629	29.06.2017 21:36
bp-553d9b71-f970-41f1-81fd-e7a090170629	29.06.2017 21:36
bp-838cb60d-9434-469e-bcf2-60e610170629	29.06.2017 21:36
bp-2a0cf12b-a7b9-4061-9cb8-d4f3f0170629	29.06.2017 21:35
I also saw the crash with nightly, but I did not see the crash with my local build.
On this bugzilla page I only get a crash by mouseover those slim "Status", "People", "Tracking" etc. bars at the top. And by visiting http://www.dict.cc I get immediately a crash.
Attached video 2017-06-29 23-17-01.mp4
(In reply to Darkspirit from comment #26)
> On this bugzilla page I only get a crash by mouseover those slim "Status", "People", "Tracking" etc. bars at the top. And by visiting http://www.dict.cc I get immediately a crash.

While making the first video of this, I discovered a visible bug regarding borders (and when I mouseover them, I get a crash).

1. about:config, enabled webrender
2. Shift+F2 restart
3. switched to this tab. Recording of this video starts.
4. You see borders around the bug's meta fields ("Status", "People", "Tracking" etc).
5. At 3s I scroll down, the borders disappear immediately (bug).
6. I scroll up, mouseover on of those borders: crash (bug).
7. Crash on opening dict.cc

Meldungs-ID 	Sendedatum
bp-b5ec7870-f8fd-468c-aac3-870260170629 29.06.2017 23:17
bp-b1bf98cc-4834-467e-85b4-ce9110170629 29.06.2017 23:17
bp-a589df7e-d08e-43a9-87ff-8bb420170629 29.06.2017 23:17
bp-f29fecc5-ddbe-41d1-bf18-005f80170629 29.06.2017 23:17
bp-b3c12979-8d32-4a74-adc5-455590170629 29.06.2017 23:17
I debugged on nightly with :jrmuizel.

At first, "Access violation reading location 0xFFFFFFFFFFFFFFFF" seemed to wrong suggestion.

The exception happened at "movaps      xmm0,xmmword ptr [rcx]" and rcx was 0x0000005cdf1fc928. Instead, the exception happened because rcx=0x0000005cdf1fc928 was not aligned to 16bit. movaps requests that operand is aligned to 16bit boundary.
   http://x86.renejeschke.de/html/file_module_x86_id_180.html
It is not clear why compiler uses movaps for a pointer. It might be caused by compiler's bug.
> The exception happened at "movaps      xmm0,xmmword ptr [rcx]" and rcx was 0x0000005cdf1fc928.

My local build uses "movups      xmm0,xmmword ptr [rcx] ". movups was used instead of movaps.
(In reply to Darkspirit from comment #24)
> Nightly 56 Build 20170629030206 @ Windows 10 x64 (Radeon RX 480)
Missed something: Enable webrendest, immediate crash. No escape. Had to edit prefs.js.
Attached video 2017-06-30 00-24-01.mp4
32 bit works.
Win32 Build 20170629030206
downloaded from http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central-l10n/firefox-56.0a1.de.win32.zip
does not crash (even not with webrendest enabled), but shows us solid borders at the left and right side on bugzilla which is a bug.
(In reply to Darkspirit from comment #32)
> but shows us solid borders at the left and right side on bugzilla which is a bug.
I am too fast in submitting, sorry. Those borders are fine with webrendest disabled.

It's only about 32 bit (works) vs. 64 bit (crashes).
(In reply to Sotaro Ikeda [:sotaro] from comment #29)
> It is not clear why compiler uses movaps for a pointer. It might be caused
> by compiler's bug.

There was similar problem in Visual Studio 2013, but it seems already fixed.
https://connect.microsoft.com/VisualStudio/feedback/details/956733/misaligned-memory-access-with-sse-optimization-resulting-in-an-application-crash-visual-c
http://www.beta.microsoft.com/VisualStudio/feedbackdetail/view/766477/c-optimizer-generates-aligned-read-movaps-for-possibly-unaligned-pointer-to-member-function
Sorry. It seems to be clear for one longer moment, then I comment here, then it changes.

(In reply to Darkspirit from comment #27)
> 5. At 3s I scroll down, the borders disappear immediately (bug).
Maybe I should've said something like box shadow.

This visual bug behaves the same with only webrender enabled on Nightly win32 (no crash) and win64 (crash on mouseover).
(In reply to Sotaro Ikeda [:sotaro] from comment #30)
> > The exception happened at "movaps      xmm0,xmmword ptr [rcx]" and rcx was 0x0000005cdf1fc928.
> 
> My local build uses "movups      xmm0,xmmword ptr [rcx] ". movups was used
> instead of movaps.

My local build uses Visual Studio 2015 Update 3. Official nightly windows build seems also use Visual Studio 2015 Update 3. So, I am not sure what causes the difference.
> 
> (In reply to Darkspirit from comment #27)
> > 5. At 3s I scroll down, the borders disappear immediately (bug).
> Maybe I should've said something like box shadow.

It seems like Bug 1372083.
The crash seems to happen only on PGO build. When I downloaded x64 opt and x64 pgo from the following, the crash happened only on x64 pgo.
   https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedJob=110898139
visual studio update 2 seemed to fix "PGO build generates movaps to unaligned address". But seems not addressed actually:(
 https://blogs.msdn.microsoft.com/vcblog/2016/03/31/visual-c-2015-update-2-bug-fixes/
By disabling PGO under webrender_bindings, the crash seemed to be addressed.
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=876da1cdf1f4dafecc644e03a6f0f44c4955786f
Default build for comparison.
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=888f6b684935e491451dbf0bbdee67219f15b564

Default win x64 pgo build caused the crash.
Modified argument of DisplayListBuilder::PushBorder() also bypassed the problem.
 https://treeherder.mozilla.org/#/jobs?repo=try&revision=4e7efc4f46d0fa10afd488afe021e27f4178993e
Attachment #8882656 - Flags: review?(jmuizelaar)
Attachment #8882656 - Flags: review?(jmuizelaar) → review+
Pushed by sikeda@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5954289fc962
Change argument of DisplayListBuilder::PushBorder() to bypass compiler bug r=jrmuizel
https://hg.mozilla.org/mozilla-central/rev/5954289fc962
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Last crash seen with 20170701030203.
Does not crash anymore with STR of comment 27 with Nightly 56 x64 20170702030204 @ Win10 x64 (Radeon RX480).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: