Closed Bug 1274450 Opened 9 years ago Closed 8 years ago

Win PGO tests are red with crash in nsLayoutUtils::DrawString

Categories

(Firefox Build System :: General, defect)

defect
Not set
blocker

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: dholbert, Assigned: dbaron)

Details

Attachments

(1 file)

Mozilla-inbound is currently closed, due to Win PGO crashtests & reftests crashing in nsLayoutUtils::DrawString. Here's the current bad treeherder run: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f186693ee23aa8444edc705ebce11667a8a0c1c1 Here's the last treeherder run that had a Windows PGO build (which had green crashtest/reftest runs): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a32fdbf877e09609cf4894cf6fa27711795839e2 So, that yields this "blame" pushlog for whatever introduced this crash: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a32fdbf877e0&tochange=f186693ee23a
From IRC: <KWierso> these same win32 pgo failures popped up for a few runs yesterday and then mysteriously went away after some unrelated backouts <KWierso> none of those unrelated backouts have re-landed <KWierso> now the failure's back <KWierso> wheeeee
So, comment 1 may mean my "blame" pushlog link is bogus; or, maybe it's "real" but only tangentially related (e.g. codesize is tripping over some threshold, or probably something more bizarrely side-effect-ish like that)...
So, the crash has FrameLayerBuilder on the stack, which bug 1271432 touched. DrawSTring uses nsIFrame, which bug 1271392 touched (along with other nsLayoutUtils things). Any chance either of those bugs could have caused this? These WinPGO failures popped up for a bit yesterday then went away after some unrelated bugs got backed out. I'm guessing the profiling is upsetting something occasionally.
Flags: needinfo?(mats)
Flags: needinfo?(bugmail.mozilla)
I don't think bug 1271432 could have caused this, (except of course tangentially as dholbert mentioned in comment 2). However Ralph's patches for rust compilation in that pushlog target win32 platforms, and those are exactly the platforms affected, so that seems suspicious.
Flags: needinfo?(bugmail.mozilla) → needinfo?(giles)
For reference, the first treeherder runs from yesterday with this same crash was: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=7fb6f9a2d7fee76746c73fa5ba1a5c091073e199 The previous treeherder run had a PGO build as well, and did not have this issue. So superficially, this looks like (in yesterday's case) this was a regression from bug 1141468, which was some changes to layers code (which were later backed out, which maybe (?) is what made this go back to being OK). We also had some changes to layers code today, in the "blame" range of my pushlog from comment 0 -- bug 1272398. I wonder if that might be the proximal cause of this, for today's instance? (I'm assuming there's some more subtle root cause, but it seems at the moment like "changes to layers code can trigger this" is a decent enough theory...)
Also: I downloaded and ran the "bad" PGO build from my first TreeHerder link in comment 0... http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1463689833/firefox-49.0a1.en-US.win32.zip It is indeed *SUPER* crashy, particularly with XUL stuff. I insta-crash if I do any of the following, in a fresh profile with this pgo build: - visit about:config and click the "I accept the risk" button. CRASH - Open the bookmark/history manager (Bookmarks | Show all Bookmarks) CRASH - View the search pane of the Firefox Options UI (Tools|Options, click "Search") CRASH Here are some crash reports from these crashes: bp-9c8effca-829a-4fea-b5e6-374f02160520 bp-869b76d0-059e-4741-bf42-56fbf2160520 bp-9f83256b-e276-4372-a336-38d0d2160520 Unfortunately, the crash reports don't have symbols (because I don't have an official build), and I can't attach a debugger because MSVC Community Edition is spamming a "30 day trial expired" modal dialog at me, which only has options for "sign-in" (which is broken) and "Quit". (This is an indication of how infrequently I develop/debug under Windows.) Hopefully someone who has a working Windows+MSVC installation can download the above-linked PGO build and attach a debugger and see if there's any info they can grab from the crash...?
I got closer, but I can't figure out how to get the MSVC debugger to use the downloaded symbols from the file next to the build. (I tried both adding a directory with them to the symbol path, directly loading the file via "Load symbols" on the stack trace, and various things with renaming/reorganizing the structure from unzipping the file.)
(In reply to Daniel Holbert [:dholbert] from comment #5) > We also had some changes to layers code today, in the "blame" range of my > pushlog from comment 0 -- bug 1272398. I wonder if that might be the > proximal cause of this, for today's instance? (I'm assuming there's some > more subtle root cause, but it seems at the moment like "changes to layers > code can trigger this" is a decent enough theory...) I backed out bug 1272398 to test this ^^ theory. (Meanwhile, there's an in-progress PGO build on an intermediate cset (f70b8561b479) which hasn't yet completed; it's possible that this PGO build will have green test results, and my backout will prove to be needless. We'll find out in a while, I guess.)
That said, we're crashing pretty early in DrawString: 62737C6F 57 push edi 62737C70 8B FA mov edi,edx 62737C72 85 F6 test esi,esi 62737C74 75 03 jne 62737C79 62737C76 8B 70 18 mov esi,dword ptr [eax+18h] 62737C79 56 push esi 62737C7A 8D 4D 17 lea ecx,[ebp+17h] 62737C7D E8 0F 50 B5 FE call 6128CC91 62737C82 8B CE mov ecx,esi 62737C84 0F B6 00 movzx eax,byte ptr [eax] 62737C87 83 E0 01 and eax,1 62737C8A 88 47 55 mov byte ptr [edi+55h],al <== CRASH HERE 62737C8D E8 9E 1A B8 FE call 612B9730 with registers: EAX = 00000000 EBX = 80004005 ECX = 196B0DE8 EDX = 1951F648 ESI = 196B0DE8 EDI = 00000000 EIP = 62737C8A ESP = 0018D9E8 EBP = 0018D9F8 EFL = 00010244 C72 and C74 are the null check of aStyleContext (which is in ESI) C76 is the aStyleContext = aFrame->StyleContext(), which says aFrame is in EAX C79 is passing the aStyleContext to the WritingMode constructor C7A is passing this to the WritingMode constructor in ECX C7D is calling the WritingMode constructor C82 puts aStyleContext in ECX as well C84 dereferences something resulting from the WritingMode constructor, I guess C87 is testing the bit for the IsVertical test C8A is thus presumably storing to nsFontMetrics::mVertical. (I didn't check the offset, but 0x55 makes sense.) So it looks like the problem here is that we're calling DrawString with aFontMetrics being null.
(The in-progress PGO build hinted at in comment 8 ended up busted as well. Sadly, there is no PGO build scheduled on my backout-push, and I am not sure how to add one. [the "backfill" treeherder UI seems to be broken for me, per bug 1259494] Hopefully a sheriff or someone who can successfully trigger arbitrary TreeHerder jobs can add PGO coverage for my backout push, and perhaps other intermediate pushes as well.)
(In reply to Daniel Holbert [:dholbert] from comment #10) > (The in-progress PGO build hinted at in comment 8 ended up busted as well. > Sadly, there is no PGO build scheduled on my backout-push, and I am not sure > how to add one. [the "backfill" treeherder UI seems to be broken for me, per > bug 1259494] Hopefully a sheriff or someone who can successfully trigger > arbitrary TreeHerder jobs can add PGO coverage for my backout push, and > perhaps other intermediate pushes as well.) 'v' menu button in the desired push's header, BuildAPI. Copy your push's revision from the address bar, scroll to the bottom of the BuildAPI page, and pasted it into the textbox for triggering PGO builds, click the button to trigger them.
With the fastcall convention, ecx and edx are used to pass the first two parameters of a function, so it looks like edx at the start of the function stores aFontMetrics, and it is then copied to edi, so edi should refer to aFontMetrics. Hence it seems like yes we are calling this function with a null aFontMetrics.
Now that the issue suddenly showed up on fx-team, is that a caching or a pgo issue (a .h file got changed in the last push to fx-team)? Needinfoing ted for the latter part.
Component: Layout → Buildduty
Flags: needinfo?(ted)
Product: Core → Release Engineering
QA Contact: bugspam.Callek
Version: Trunk → unspecified
John, could this related to your changes yesterday? This bug keeps the trees currently closed, so please prioritize an answer.
Flags: needinfo?(jhford)
I'm not sure how to read the logs posted here. What was deployed yesterday is essentially a proxy which redirects artifacts being downloaded from the queue. If a request to retreive an artifact is received by the queue from ec2 in us-west-1 or us-east-1, the queue redirects to cloud-mirror.taskcluster.net/v1/redirect/s3/:region/:url. Requests received in us-west-2 are served directly from in-region S3 and requests from outside us-west-1, us-west-2 and us-east-1 are redirected to cloudfront. When redirected to cloud mirror, cloud-mirror will either: 1. copy an s3 object into us-west-1/us-east-1 if it's never been requested in that region before then 302 redirect to it 2. if the object is already in the requested region, will 302 redirect to it 3. if neither of these operations is successful, it will 302 redirect to the original S3 url. It looks like the host that's having trouble here is an IX machine. Those are physical boxes and are not located in a region that Cloud Mirror would be related to. Further, cloud-mirror is only involved in downloading of artifacts. It looks like the tests are running, which suggests that they are able to download the file. I'm not seeing anything on my end that would suggest that this is related. Can you reply with the specific failure and I could give you a much better answer here. Dustin, if this is what we spoke about last night regarding the weird error messages, can you please clarify here?
Flags: needinfo?(jhford) → needinfo?(dustin)
(In reply to John Ford [:jhford] from comment #16) > It looks like the host that's having trouble here is an IX machine. Because either many or almost none (= normal level) tests fail, it's more likely an issue with the builder. E.g. e: b-2008-spot-102 produced the latest build with the bunch on failures on fx-team: https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a The 'broken' build's application.ini contained: >SourceRepository=https://hg.mozilla.org/integration/fx-team >SourceStamp=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a The parent build's application.ini contained: >SourceRepository=https://hg.mozilla.org/integration/mozilla-inbound >SourceStamp=c62705a833ff9c985c71ddf4c07054de113525c3 As RyanVM noticed, the first Windows XP pgo build directory linked from https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=00ed91d9c2c83fec8c61565ef1f73ed7a7df724a has only a log file.
Bug 1271392 only touched code that execute if you use display:grid, with one exception: in part 3 I messed with the [Excess]OverflowContainers frame child lists in nsContainerFrame::DrainExcessOverflowContainersList which affects all frame types that can fragment. It seems unlikely for that to lead to super-crashy builds though, since those child lists are only used in rare edge cases.
Flags: needinfo?(mats)
On IRC, we confirmed that all failures were occuring in us-west-2. Cloud-mirror is completely absent from the codepath in us-west-2, so we can effectively rule it out as a culprit.
Flags: needinfo?(dustin)
I'm not sure how this could be due to the build infrastructure. Not sure where to re-file this though.
Component: Buildduty → Build Config
Product: Release Engineering → Firefox
QA Contact: bugspam.Callek
Flags: needinfo?(giles)
This seems somewhat like PGO just blowing up due to hitting some sort of size limit. Does anybody know how to get the pdb files, or convert the sym files into pdb files, so that we can debug it?
10:47:50 <ted> Aryx, dbaron: you can't convert sym files into pdb files, but we could do a try push with http://hg.mozilla.org/users/tmielczarek_mozilla.com/mq/raw-file/tip/enable-full-symbols 10:47:54 <ted> which gives you pdb files in the symbols.zip 10:48:14 <ted> (we also have a bug on just making that the default now that disk space for storing builds isn't at a premium) 10:48:26 <ted> https://bugzilla.mozilla.org/show_bug.cgi?id=1253070
I'm not sure if this is related, but I can basically perma startup crash with Nightly on 5-20-2016 and 05-19-2016 worked fine. Crash report: https://crash-stats.mozilla.com/report/index/d4f23c48-ccc2-4e14-9e64-b9e412160520 Manually building the m-c revisions locally for 05/19 and 05/20 nightly both do not reproduce the issue for me, but only on Windows 7. Windows 10 seems to work. STR: 1) Go to about:config 2) Set the preference "gfx.content.azure.backends" to "skia" 3) Restart nightly and permacrash.
I'm doing a PGO build on my office desktop. Since dbaron works at the desk next to mine, perhaps he'll find things easier to debug with a full dev environment...
I suspect comment 24 is independent of this bug, & deserves its own bug for a more thorough investigation & potential backouts/fixes.
Flags: needinfo?(mchang)
So I debugged a bit of what's happening on gps's machine. This really seems like some sort of miscompilation in nsTreeBodyFrame::PaintText. In particular, when fontMet is initialized, it ends up both in ESI and saved to the stack at ESP+34h (at instruction 52198160 in gps's build). The initial stuff that uses fontMet operates off the copy in ESI. However, a single instruction after saving ESI to the stack, the instruction immediately following (52198164) writes a null to the same address by writing to EAX. Then, later in the function, it restores fontMet from ESP+34h, which is now null because of the instruction that wrote to EAX immediately after saving it to the stack. In particular, at the time we're saving fontMet to the stack, we have: EAX = 0018D904 EBX = 00000000 ECX = 00000000 EDX = 00000000 ESI = 0BDD2700 EDI = 0018D8FC EIP = 52198129 ESP = 0018D8D0 EBP = 1704EEF0 EFL = 00000200 and the code is: textRect.Deflate(textMargin); 52198118 lea eax,[esp+40h] 5219811C push eax 5219811D lea ecx,[esp+20h] 52198121 call mozilla::gfx::BaseRect<int,nsRect,nsPoint,nsSize,nsMargin>::Deflate (50C49E36h) 52198126 xorps xmm0,xmm0 // Adjust the rect for its border and padding. nsMargin bp(0,0,0,0); GetBorderPadding(textContext, bp); 52198129 lea edx,[esp+60h] 5219812D mov ecx,esi 5219812F movups xmmword ptr [esp+60h],xmm0 52198134 call GetBorderPadding (52191294h) textRect.Deflate(bp); 52198139 lea eax,[esp+60h] 5219813D push eax 5219813E lea ecx,[esp+20h] 52198142 call mozilla::gfx::BaseRect<int,nsRect,nsPoint,nsSize,nsMargin>::Deflate (50C49E36h) // Compute our text size. RefPtr<nsFontMetrics> fontMet = nsLayoutUtils::GetFontMetricsForStyleContext(textContext); 52198147 movss xmm2,dword ptr [__real@3f800000 (52B8F07Ch)] 5219814F lea ecx,[esp+34h] 52198153 push ebx 52198154 mov edx,esi 52198156 call nsLayoutUtils::GetFontMetricsForStyleContext (50CF5055h) 5219815B pop ecx 5219815C mov esi,dword ptr [eax] nscoord height = fontMet->MaxHeight(); 5219815E mov ecx,esi // Compute our text size. RefPtr<nsFontMetrics> fontMet = nsLayoutUtils::GetFontMetricsForStyleContext(textContext); 52198160 mov dword ptr [esp+34h],esi 52198164 mov dword ptr [eax],ebx nscoord height = fontMet->MaxHeight(); 52198166 call nsFontMetrics::MaxHeight (50F63210h) nscoord baseline = fontMet->MaxAscent(); 5219816B mov ecx,esi 5219816D mov edi,eax 5219816F call nsFontMetrics::MaxAscent (50C49C3Bh) 52198174 mov dword ptr [esp+5Ch],eax I still haven't worked out what this instruction at 164 is supposed to be doing.
So I think that assignment to 0 (I think ebx is basically a 0 register there, used both for passing aVariantWidth to GetFontMetricsForStyleContext and for this) is essentially the null-assignment that happens in the already_AddRefed::take() call from RefPtr<T>::operator=(already_AddRefed<I>&), since it looks like EAX is part of the return-a-struct calling convention for GetFontMetricsForStyleContext. (It's rather disappointing that already_AddRefed return values work that way instead of more efficiently, though!) So it seems like the compiler has decided to: * use the same stack storage space for the already_AddRefed result of GetFontMetricsForStyleContext and for fontMet (which are actually the same thing) * then not optimize away the store-into-fontMet and the nulling-out of mRawPtr in already_AddRefed<T>::take(), and, worse, do them in the wrong order
MozReview-Commit-ID: GL7gz3vuvAD
Attachment #8755028 - Flags: review?(sphink)
Attachment #8755028 - Flags: review?(sphink) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4) > However Ralph's patches > for rust compilation in that pushlog target win32 platforms, and those are > exactly the platforms affected, so that seems suspicious. Those changes would only be invoked in mp4 file playback, which seems an unlikely source of startup crashes. If it's anything I've done it's setting -arch:SSE2 (the MSVC default!) from bug 1271794. However, that landed earlier and made it into central, which seems fine.
I performed a PGO build with 30b83fcb0009 and about:config didn't crash like it did before \o/
Flags: needinfo?(ted)
Flags: needinfo?(mchang)
[sorry, meant to leave open ni for mchang per comment 27]
Flags: needinfo?(mchang)
Also, "seems fine" isn't necessarily a great indication, since it seems to have been intermittent.
I've switched inbound and fx-team to approval-required just to switch things up a bit. Not sure if I should just reopen things on the assumption that that patch fixed things, but I don't know what else to do other than just wait for it to intermittently happen again. I'll probably reopen things before I head out tonight assuming nothing breaks in the meantime.
Inbound and fx-team reopened at 5pm Pacific since I don't think holding things closed is going to prove anything. Lets reopen this if the pgo failures pop up again.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 49
(In reply to Sebastian H. [:aryx][:archaeopteryx] from comment #25) > >Crash report: https://crash-stats.mozilla.com/report/index/d4f23c48-ccc2-4e14-9e64-b9e412160520 > After looking at the stacktrace: Fallout from bug 1274145 / > https://hg.mozilla.org/integration/mozilla-inbound/rev/9107ca27ed84 ? Nope, today's nightly still permacrashing :(.
Flags: needinfo?(mchang)
Mason, let's get your permacrash tracked in its own bug - as I noted in comment 27, I suspect it's distinct from this bug here.
(In reply to Daniel Holbert [:dholbert] from comment #41) > Mason, let's get your permacrash tracked in its own bug - as I noted in > comment 27, I suspect it's distinct from this bug here. Filed as bug 1275030.
Assignee: nobody → dbaron
Component: Build Config → General
Product: Firefox → Firefox Build System
Target Milestone: Firefox 49 → mozilla49
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: