Closed
Bug 1274450
Opened 9 years ago
Closed 8 years ago
Win PGO tests are red with crash in nsLayoutUtils::DrawString
Categories
(Firefox Build System :: General, defect)
Firefox Build System
General
Tracking
(firefox49 fixed)
RESOLVED
FIXED
mozilla49
Tracking | Status | |
---|---|---|
firefox49 | --- | fixed |
People
(Reporter: dholbert, Assigned: dbaron)
Details
Attachments
(1 file)
2.04 KB,
patch
|
sfink
:
review+
|
Details | Diff | Splinter Review |
Mozilla-inbound is currently closed, due to Win PGO crashtests & reftests crashing in nsLayoutUtils::DrawString.
Here's the current bad treeherder run:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f186693ee23aa8444edc705ebce11667a8a0c1c1
Here's the last treeherder run that had a Windows PGO build (which had green crashtest/reftest runs):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a32fdbf877e09609cf4894cf6fa27711795839e2
So, that yields this "blame" pushlog for whatever introduced this crash:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a32fdbf877e0&tochange=f186693ee23a
Reporter | ||
Comment 1•9 years ago
|
||
From IRC:
<KWierso> these same win32 pgo failures popped up for a few runs yesterday and then mysteriously went away after some unrelated backouts
<KWierso> none of those unrelated backouts have re-landed
<KWierso> now the failure's back
<KWierso> wheeeee
Reporter | ||
Comment 2•9 years ago
|
||
So, comment 1 may mean my "blame" pushlog link is bogus; or, maybe it's "real" but only tangentially related (e.g. codesize is tripping over some threshold, or probably something more bizarrely side-effect-ish like that)...
So, the crash has FrameLayerBuilder on the stack, which bug 1271432 touched.
DrawSTring uses nsIFrame, which bug 1271392 touched (along with other nsLayoutUtils things).
Any chance either of those bugs could have caused this?
These WinPGO failures popped up for a bit yesterday then went away after some unrelated bugs got backed out.
I'm guessing the profiling is upsetting something occasionally.
Flags: needinfo?(mats)
Flags: needinfo?(bugmail.mozilla)
Comment 4•9 years ago
|
||
I don't think bug 1271432 could have caused this, (except of course tangentially as dholbert mentioned in comment 2). However Ralph's patches for rust compilation in that pushlog target win32 platforms, and those are exactly the platforms affected, so that seems suspicious.
Flags: needinfo?(bugmail.mozilla) → needinfo?(giles)
Reporter | ||
Comment 5•9 years ago
|
||
For reference, the first treeherder runs from yesterday with this same crash was:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=7fb6f9a2d7fee76746c73fa5ba1a5c091073e199
The previous treeherder run had a PGO build as well, and did not have this issue. So superficially, this looks like (in yesterday's case) this was a regression from bug 1141468, which was some changes to layers code (which were later backed out, which maybe (?) is what made this go back to being OK).
We also had some changes to layers code today, in the "blame" range of my pushlog from comment 0 -- bug 1272398. I wonder if that might be the proximal cause of this, for today's instance? (I'm assuming there's some more subtle root cause, but it seems at the moment like "changes to layers code can trigger this" is a decent enough theory...)
Reporter | ||
Comment 6•9 years ago
|
||
Also: I downloaded and ran the "bad" PGO build from my first TreeHerder link in comment 0...
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1463689833/firefox-49.0a1.en-US.win32.zip
It is indeed *SUPER* crashy, particularly with XUL stuff. I insta-crash if I do any of the following, in a fresh profile with this pgo build:
- visit about:config and click the "I accept the risk" button. CRASH
- Open the bookmark/history manager (Bookmarks | Show all Bookmarks) CRASH
- View the search pane of the Firefox Options UI (Tools|Options, click "Search") CRASH
Here are some crash reports from these crashes:
bp-9c8effca-829a-4fea-b5e6-374f02160520
bp-869b76d0-059e-4741-bf42-56fbf2160520
bp-9f83256b-e276-4372-a336-38d0d2160520
Unfortunately, the crash reports don't have symbols (because I don't have an official build), and I can't attach a debugger because MSVC Community Edition is spamming a "30 day trial expired" modal dialog at me, which only has options for "sign-in" (which is broken) and "Quit". (This is an indication of how infrequently I develop/debug under Windows.) Hopefully someone who has a working Windows+MSVC installation can download the above-linked PGO build and attach a debugger and see if there's any info they can grab from the crash...?
Assignee | ||
Comment 7•9 years ago
|
||
I got closer, but I can't figure out how to get the MSVC debugger to use the downloaded symbols from the file next to the build. (I tried both adding a directory with them to the symbol path, directly loading the file via "Load symbols" on the stack trace, and various things with renaming/reorganizing the structure from unzipping the file.)
Reporter | ||
Comment 8•9 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #5)
> We also had some changes to layers code today, in the "blame" range of my
> pushlog from comment 0 -- bug 1272398. I wonder if that might be the
> proximal cause of this, for today's instance? (I'm assuming there's some
> more subtle root cause, but it seems at the moment like "changes to layers
> code can trigger this" is a decent enough theory...)
I backed out bug 1272398 to test this ^^ theory.
(Meanwhile, there's an in-progress PGO build on an intermediate cset (f70b8561b479) which hasn't yet completed; it's possible that this PGO build will have green test results, and my backout will prove to be needless. We'll find out in a while, I guess.)
Assignee | ||
Comment 9•9 years ago
|
||
That said, we're crashing pretty early in DrawString:
62737C6F 57 push edi
62737C70 8B FA mov edi,edx
62737C72 85 F6 test esi,esi
62737C74 75 03 jne 62737C79
62737C76 8B 70 18 mov esi,dword ptr [eax+18h]
62737C79 56 push esi
62737C7A 8D 4D 17 lea ecx,[ebp+17h]
62737C7D E8 0F 50 B5 FE call 6128CC91
62737C82 8B CE mov ecx,esi
62737C84 0F B6 00 movzx eax,byte ptr [eax]
62737C87 83 E0 01 and eax,1
62737C8A 88 47 55 mov byte ptr [edi+55h],al <== CRASH HERE
62737C8D E8 9E 1A B8 FE call 612B9730
with registers:
EAX = 00000000 EBX = 80004005 ECX = 196B0DE8 EDX = 1951F648 ESI = 196B0DE8 EDI = 00000000 EIP = 62737C8A ESP = 0018D9E8 EBP = 0018D9F8 EFL = 00010244
C72 and C74 are the null check of aStyleContext (which is in ESI)
C76 is the aStyleContext = aFrame->StyleContext(), which says aFrame is in EAX
C79 is passing the aStyleContext to the WritingMode constructor
C7A is passing this to the WritingMode constructor in ECX
C7D is calling the WritingMode constructor
C82 puts aStyleContext in ECX as well
C84 dereferences something resulting from the WritingMode constructor, I guess
C87 is testing the bit for the IsVertical test
C8A is thus presumably storing to nsFontMetrics::mVertical. (I didn't check the offset, but 0x55 makes sense.)
So it looks like the problem here is that we're calling DrawString with aFontMetrics being null.
Reporter | ||
Comment 10•9 years ago
|
||
(The in-progress PGO build hinted at in comment 8 ended up busted as well. Sadly, there is no PGO build scheduled on my backout-push, and I am not sure how to add one. [the "backfill" treeherder UI seems to be broken for me, per bug 1259494] Hopefully a sheriff or someone who can successfully trigger arbitrary TreeHerder jobs can add PGO coverage for my backout push, and perhaps other intermediate pushes as well.)
(In reply to Daniel Holbert [:dholbert] from comment #10)
> (The in-progress PGO build hinted at in comment 8 ended up busted as well.
> Sadly, there is no PGO build scheduled on my backout-push, and I am not sure
> how to add one. [the "backfill" treeherder UI seems to be broken for me, per
> bug 1259494] Hopefully a sheriff or someone who can successfully trigger
> arbitrary TreeHerder jobs can add PGO coverage for my backout push, and
> perhaps other intermediate pushes as well.)
'v' menu button in the desired push's header, BuildAPI.
Copy your push's revision from the address bar, scroll to the bottom of the BuildAPI page, and pasted it into the textbox for triggering PGO builds, click the button to trigger them.
Comment 12•9 years ago
|
||
With the fastcall convention, ecx and edx are used to pass the first two parameters of a function, so it looks like edx at the start of the function stores aFontMetrics, and it is then copied to edi, so edi should refer to aFontMetrics. Hence it seems like yes we are calling this function with a null aFontMetrics.
Comment 13•9 years ago
|
||
inbound and fx-team are closed for this issue. The issue popped up on fx-team after a green pgo run and without a merge from inbound: https://treeherder.mozilla.org/#/jobs?repo=fx-team&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&autoclassify&fromchange=c62705a833ff9c985c71ddf4c07054de113525c3&tochange=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a&filter-searchStr=windows%20pgo
Severity: normal → blocker
Comment 14•9 years ago
|
||
Now that the issue suddenly showed up on fx-team, is that a caching or a pgo issue (a .h file got changed in the last push to fx-team)? Needinfoing ted for the latter part.
Component: Layout → Buildduty
Flags: needinfo?(ted)
Product: Core → Release Engineering
QA Contact: bugspam.Callek
Version: Trunk → unspecified
Comment 15•9 years ago
|
||
John, could this related to your changes yesterday? This bug keeps the trees currently closed, so please prioritize an answer.
Flags: needinfo?(jhford)
Comment 16•9 years ago
|
||
I'm not sure how to read the logs posted here. What was deployed yesterday is essentially a proxy which redirects artifacts being downloaded from the queue. If a request to retreive an artifact is received by the queue from ec2 in us-west-1 or us-east-1, the queue redirects to cloud-mirror.taskcluster.net/v1/redirect/s3/:region/:url. Requests received in us-west-2 are served directly from in-region S3 and requests from outside us-west-1, us-west-2 and us-east-1 are redirected to cloudfront.
When redirected to cloud mirror, cloud-mirror will either:
1. copy an s3 object into us-west-1/us-east-1 if it's never been requested in that region before then 302 redirect to it
2. if the object is already in the requested region, will 302 redirect to it
3. if neither of these operations is successful, it will 302 redirect to the original S3 url.
It looks like the host that's having trouble here is an IX machine. Those are physical boxes and are not located in a region that Cloud Mirror would be related to. Further, cloud-mirror is only involved in downloading of artifacts. It looks like the tests are running, which suggests that they are able to download the file.
I'm not seeing anything on my end that would suggest that this is related. Can you reply with the specific failure and I could give you a much better answer here.
Dustin, if this is what we spoke about last night regarding the weird error messages, can you please clarify here?
Flags: needinfo?(jhford) → needinfo?(dustin)
Comment 17•9 years ago
|
||
Here's an example of what cloud-mirror is:
http://cloud-mirror.taskcluster.net/v1/redirect/s3/us-west-1/https%3A%2F%2Fwww.mozilla.org%2Fmedia%2Fimg%2Fhome%2Fvoices%2Fmosaic-background.565019b6e6bc.jpg
Comment 18•9 years ago
|
||
(In reply to John Ford [:jhford] from comment #16)
> It looks like the host that's having trouble here is an IX machine.
Because either many or almost none (= normal level) tests fail, it's more likely an issue with the builder. E.g. e: b-2008-spot-102 produced the latest build with the bunch on failures on fx-team: https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a
The 'broken' build's application.ini contained:
>SourceRepository=https://hg.mozilla.org/integration/fx-team
>SourceStamp=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a
The parent build's application.ini contained:
>SourceRepository=https://hg.mozilla.org/integration/mozilla-inbound
>SourceStamp=c62705a833ff9c985c71ddf4c07054de113525c3
As RyanVM noticed, the first Windows XP pgo build directory linked from https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a has only a log file.
Comment 19•9 years ago
|
||
Bug 1271392 only touched code that execute if you use display:grid, with one exception:
in part 3 I messed with the [Excess]OverflowContainers frame child lists in
nsContainerFrame::DrainExcessOverflowContainersList which affects all frame types that
can fragment. It seems unlikely for that to lead to super-crashy builds though,
since those child lists are only used in rare edge cases.
Flags: needinfo?(mats)
Comment 20•9 years ago
|
||
On IRC, we confirmed that all failures were occuring in us-west-2. Cloud-mirror is completely absent from the codepath in us-west-2, so we can effectively rule it out as a culprit.
Updated•9 years ago
|
Flags: needinfo?(dustin)
Comment 21•9 years ago
|
||
I'm not sure how this could be due to the build infrastructure. Not sure where to re-file this though.
Component: Buildduty → Build Config
Product: Release Engineering → Firefox
QA Contact: bugspam.Callek
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(giles)
Assignee | ||
Comment 22•9 years ago
|
||
This seems somewhat like PGO just blowing up due to hitting some sort of size limit.
Does anybody know how to get the pdb files, or convert the sym files into pdb files, so that we can debug it?
Assignee | ||
Comment 23•8 years ago
|
||
10:47:50 <ted> Aryx, dbaron: you can't convert sym files into pdb files, but we could do a try push with http://hg.mozilla.org/users/tmielczarek_mozilla.com/mq/raw-file/tip/enable-full-symbols
10:47:54 <ted> which gives you pdb files in the symbols.zip
10:48:14 <ted> (we also have a bug on just making that the default now that disk space for storing builds isn't at a premium)
10:48:26 <ted> https://bugzilla.mozilla.org/show_bug.cgi?id=1253070
Comment 24•8 years ago
|
||
I'm not sure if this is related, but I can basically perma startup crash with Nightly on 5-20-2016 and 05-19-2016 worked fine.
Crash report: https://crash-stats.mozilla.com/report/index/d4f23c48-ccc2-4e14-9e64-b9e412160520
Manually building the m-c revisions locally for 05/19 and 05/20 nightly both do not reproduce the issue for me, but only on Windows 7. Windows 10 seems to work.
STR:
1) Go to about:config
2) Set the preference "gfx.content.azure.backends" to "skia"
3) Restart nightly and permacrash.
Comment 25•8 years ago
|
||
>Crash report: https://crash-stats.mozilla.com/report/index/d4f23c48-ccc2-4e14-9e64-b9e412160520
After looking at the stacktrace: Fallout from bug 1274145 / https://hg.mozilla.org/integration/mozilla-inbound/rev/9107ca27ed84 ?
Comment 26•8 years ago
|
||
I'm doing a PGO build on my office desktop. Since dbaron works at the desk next to mine, perhaps he'll find things easier to debug with a full dev environment...
Reporter | ||
Comment 27•8 years ago
|
||
I suspect comment 24 is independent of this bug, & deserves its own bug for a more thorough investigation & potential backouts/fixes.
Flags: needinfo?(mchang)
Comment 28•8 years ago
|
||
The earlier occurrence of this issue mentioned in comment 3 seem to be these 3 pushes: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-searchStr=windows%20pgo&fromchange=7fb6f9a2d7fee76746c73fa5ba1a5c091073e199&tochange=017017aeacf14642ea1406468eb045fa5f1b8310
Assignee | ||
Comment 29•8 years ago
|
||
So I debugged a bit of what's happening on gps's machine.
This really seems like some sort of miscompilation in nsTreeBodyFrame::PaintText.
In particular, when fontMet is initialized, it ends up both in ESI and saved to the stack at ESP+34h (at instruction 52198160 in gps's build). The initial stuff that uses fontMet operates off the copy in ESI. However, a single instruction after saving ESI to the stack, the instruction immediately following (52198164) writes a null to the same address by writing to EAX.
Then, later in the function, it restores fontMet from ESP+34h, which is now null because of the instruction that wrote to EAX immediately after saving it to the stack.
In particular, at the time we're saving fontMet to the stack, we have:
EAX = 0018D904 EBX = 00000000 ECX = 00000000 EDX = 00000000 ESI = 0BDD2700
EDI = 0018D8FC EIP = 52198129 ESP = 0018D8D0 EBP = 1704EEF0 EFL = 00000200
and the code is:
textRect.Deflate(textMargin);
52198118 lea eax,[esp+40h]
5219811C push eax
5219811D lea ecx,[esp+20h]
52198121 call
mozilla::gfx::BaseRect<int,nsRect,nsPoint,nsSize,nsMargin>::Deflate
(50C49E36h)
52198126 xorps xmm0,xmm0
// Adjust the rect for its border and padding.
nsMargin bp(0,0,0,0);
GetBorderPadding(textContext, bp);
52198129 lea edx,[esp+60h]
5219812D mov ecx,esi
5219812F movups xmmword ptr [esp+60h],xmm0
52198134 call GetBorderPadding (52191294h)
textRect.Deflate(bp);
52198139 lea eax,[esp+60h]
5219813D push eax
5219813E lea ecx,[esp+20h]
52198142 call
mozilla::gfx::BaseRect<int,nsRect,nsPoint,nsSize,nsMargin>::Deflate
(50C49E36h)
// Compute our text size.
RefPtr<nsFontMetrics> fontMet =
nsLayoutUtils::GetFontMetricsForStyleContext(textContext);
52198147 movss xmm2,dword ptr [__real@3f800000 (52B8F07Ch)]
5219814F lea ecx,[esp+34h]
52198153 push ebx
52198154 mov edx,esi
52198156 call nsLayoutUtils::GetFontMetricsForStyleContext
(50CF5055h)
5219815B pop ecx
5219815C mov esi,dword ptr [eax]
nscoord height = fontMet->MaxHeight();
5219815E mov ecx,esi
// Compute our text size.
RefPtr<nsFontMetrics> fontMet =
nsLayoutUtils::GetFontMetricsForStyleContext(textContext);
52198160 mov dword ptr [esp+34h],esi
52198164 mov dword ptr [eax],ebx
nscoord height = fontMet->MaxHeight();
52198166 call nsFontMetrics::MaxHeight (50F63210h)
nscoord baseline = fontMet->MaxAscent();
5219816B mov ecx,esi
5219816D mov edi,eax
5219816F call nsFontMetrics::MaxAscent (50C49C3Bh)
52198174 mov dword ptr [esp+5Ch],eax
I still haven't worked out what this instruction at 164 is supposed to be doing.
Assignee | ||
Comment 30•8 years ago
|
||
So I think that assignment to 0 (I think ebx is basically a 0 register there, used both for passing aVariantWidth to GetFontMetricsForStyleContext and for this) is essentially the null-assignment that happens in the already_AddRefed::take() call from RefPtr<T>::operator=(already_AddRefed<I>&), since it looks like EAX is part of the return-a-struct calling convention for GetFontMetricsForStyleContext. (It's rather disappointing that already_AddRefed return values work that way instead of more efficiently, though!) So it seems like the compiler has decided to:
* use the same stack storage space for the already_AddRefed result of GetFontMetricsForStyleContext and for fontMet (which are actually the same thing)
* then not optimize away the store-into-fontMet and the nulling-out of mRawPtr in already_AddRefed<T>::take(), and, worse, do them in the wrong order
Assignee | ||
Comment 31•8 years ago
|
||
MozReview-Commit-ID: GL7gz3vuvAD
Attachment #8755028 -
Flags: review?(sphink)
Updated•8 years ago
|
Attachment #8755028 -
Flags: review?(sphink) → review+
Comment 33•8 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> However Ralph's patches
> for rust compilation in that pushlog target win32 platforms, and those are
> exactly the platforms affected, so that seems suspicious.
Those changes would only be invoked in mp4 file playback, which seems an unlikely source of startup crashes. If it's anything I've done it's setting -arch:SSE2 (the MSVC default!) from bug 1271794. However, that landed earlier and made it into central, which seems fine.
Comment 34•8 years ago
|
||
I performed a PGO build with 30b83fcb0009 and about:config didn't crash like it did before \o/
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(ted)
Flags: needinfo?(mchang)
Reporter | ||
Comment 35•8 years ago
|
||
[sorry, meant to leave open ni for mchang per comment 27]
Flags: needinfo?(mchang)
Assignee | ||
Comment 36•8 years ago
|
||
Hopeful fix landed on mozilla-central in: https://hg.mozilla.org/mozilla-central/rev/30b83fcb0009b524cf3982a8aeb2bc0fc6d2d5b7
RyanVM also points out that https://hg.mozilla.org/mozilla-central/rev/154b951082e387e015c8f553ed93124b10a25d72 is a plausible cause of this.
Assignee | ||
Comment 37•8 years ago
|
||
Also, "seems fine" isn't necessarily a great indication, since it seems to have been intermittent.
I've switched inbound and fx-team to approval-required just to switch things up a bit. Not sure if I should just reopen things on the assumption that that patch fixed things, but I don't know what else to do other than just wait for it to intermittently happen again.
I'll probably reopen things before I head out tonight assuming nothing breaks in the meantime.
Inbound and fx-team reopened at 5pm Pacific since I don't think holding things closed is going to prove anything. Lets reopen this if the pgo failures pop up again.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 49
Comment 40•8 years ago
|
||
(In reply to Sebastian H. [:aryx][:archaeopteryx] from comment #25)
> >Crash report: https://crash-stats.mozilla.com/report/index/d4f23c48-ccc2-4e14-9e64-b9e412160520
> After looking at the stacktrace: Fallout from bug 1274145 /
> https://hg.mozilla.org/integration/mozilla-inbound/rev/9107ca27ed84 ?
Nope, today's nightly still permacrashing :(.
Flags: needinfo?(mchang)
Reporter | ||
Comment 41•8 years ago
|
||
Mason, let's get your permacrash tracked in its own bug - as I noted in comment 27, I suspect it's distinct from this bug here.
Comment 42•8 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #41)
> Mason, let's get your permacrash tracked in its own bug - as I noted in
> comment 27, I suspect it's distinct from this bug here.
Filed as bug 1275030.
Updated•7 years ago
|
Assignee: nobody → dbaron
Updated•6 years ago
|
Component: Build Config → General
Product: Firefox → Firefox Build System
Updated•6 years ago
|
Target Milestone: Firefox 49 → mozilla49
You need to log in
before you can comment on or make changes to this bug.
Description
•