Closed Bug 473013 Opened 11 years ago Closed 11 years ago

Need stack traces for crashing test boxes

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vlad, Assigned: joduinn)

Details

(I'm not sure if there are already bugs filed for these; if there are, I can't find them and they don't seem to be linked form anywhere.)

- qm-plinux-trunk02 has been crashing for the past 2 days; we need a stack trace from the crash.

- qm-pmac-trunk02 has had intermittent crashes; need stack trace from the crashes.

- qm-linux-fast03 had a sunspider crash recently; doesn't seem to be recurring, but it's terrifying; not sure what data we can grab there.

- 'mozilla-central unit test' has had intermittent crashtest failures; these look like some test failures waiting for onload to fire though, not actual crashes.
We've been down this road before (bug 461020), to recap - Talos is testing optimised builds, where we only store symbols for nightlies (bug 385785), and Talos disables the Breakpad crash reporter anyway. How do you get a useful stack in that situation ?

More importantly, qm-plinux-trunk02 is part of a matched set of three machines (with 01 and 03), and the other two boxes have been continuously green for the last four days. Given qm-plinux-trunk02 seems to die towards the end of the Tp cycle, I would suggest there is something wrong with the hardware of only this machine (overheating?), or that it needs to be reimaged. Alice, is this a familiar failure path ? 

Much the same argument applies to qm-pmac-trunk02. We could hide these boxes from the tbox waterfall until Alice can comment.

qm-linux-fast03 I doubt we can do much about. The build is gone from the ftp now (more than 24 hours ago), the source stamp was 
  http://hg.mozilla.org/mozilla-central/rev/1203433cd9a7
a layout change, which was built on moz2-linux-slave06 at 2009/01/10 08:41. Can't think of anything else we can do apart from watching for more crashes to corroborate.

For the unittest crashtest problems, dbaron put this comment on the tbox waterfall - "Either somebody broke onload or this tinderbox is really slow.  I'm guessing the latter." IT recently added capacity to our set of VM hosts but it's possible the 100% slave utilisation last week may have been causing problems. There are also onload timeouts on mac (ie not VMs), eg 
 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1231606043.1231615341.28267.gz&fulltext=1#err0,
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/base/crashtests/403175-1.html | timed out waiting for reftest-wait to be removed (after onload fired)
so we shouldn't discount a problem in layout or a change impacting the test harnesses perf in general.
cc-ing alice directly for weigh-in on comment #1.
Priority: -- → P1
(In reply to comment #1)

Vlad;


> qm-linux-fast03 I doubt we can do much about. The build is gone from the ftp
> now (more than 24 hours ago), the source stamp was 
>   http://hg.mozilla.org/mozilla-central/rev/1203433cd9a7
> a layout change, which was built on moz2-linux-slave06 at 2009/01/10 08:41.
> Can't think of anything else we can do apart from watching for more crashes to
> corroborate.

Has this been happening again, or can we drop this from the list?


> For the unittest crashtest problems, dbaron put this comment on the tbox
> waterfall - "Either somebody broke onload or this tinderbox is really slow. 
> I'm guessing the latter." IT recently added capacity to our set of VM hosts but
> it's possible the 100% slave utilisation last week may have been causing
> problems. There are also onload timeouts on mac (ie not VMs), eg 
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1231606043.1231615341.28267.gz&fulltext=1#err0,
> REFTEST TEST-UNEXPECTED-FAIL |
> file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/base/crashtests/403175-1.html
> | timed out waiting for reftest-wait to be removed (after onload fired)
> so we shouldn't discount a problem in layout or a change impacting the test
> harnesses perf in general.

Looks like the problem is happening on different machines, including physical mac hardware. Can you reproduce the crash if you send the same job to try server?
I'm wary of qm-plinux-trunk02, it's a frequent offender in terms of poor
behavior - it's been re-imaged at least 3 times now and has had it's physical
box switched once and is still not consistent.  I'm not sure if the spot in the
colo is cursed, but I would take it into account when looking at results for
that machine.

That said, the talos set up is a performance testing harness not a crash
tester.  We can't get stack traces from the boxes, at least, not till we figure
out what to do with symbols for hourly builds.  In the past when we've hit
something that we were interested in tracing a machine was loaned out to a
developer in the MV office to configure to collect the necessary information. 
This was for an intermittent crash that was occurring in all machines
associated with a platform.  Having these crashes only on single boxes makes me
lean pretty heavily towards machine failure.

I'd like to see better history on the crashes in question and any information
on if they are occurring on any other machines.  Otherwise, I would slate the
boxes in question for re-image and try and get them reporting consistently.
If we're going to re-image, can we preserve the current image somewhere, so that we can compare the before-and-after?  If we're getting system corruption that's fixed by re-imaging, it seems like we should get to the bottom of that.  If it's flaky hardware, then let's just replace the hardware rather than wait to lose more developer time to it failing again later.  (We can dream up some torture test for it to pass before it's re-admitted, perhaps.)

Before that, can we pull the machine out of rotation (since it's unreliable anyway) and run talos repeatedly on a symbol-laden build in a debugger?
Looks like three talos boxes are in question (qm-plinux-trunk02, qm-plinux-fast03 and qm-pmac-trunk02), which one did you want pulled out of rotation?  Is there a developer that we can give access to who is interested in running these tests?

The usual steps are to do a re-image, monitor for a few days, then decide if a change in machine is necessary.
(In reply to comment #6)
> The usual steps are to do a re-image, monitor for a few days, then decide if a
> change in machine is necessary.

Do we have any historical record of what necessitates re-imaging?  Like, what files were replaced by the re-image with something different, so that we can figure out how they got corrupted or changed?
We have not tracked re-imaging in terms of differences on the busted box compared to the clean image.  The only information I have is results, such as number changes up/down or inconsistent high-variance reports.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1231819807.1231827661.14613.gz for another tsspider fail on nochrome qm-plinux-trunk07. Do we actually know that that's a crash, as opposed to something like the intermittent onload failures the unit test boxes have, making the harness think "omg it crashed"?
I talked with shaver and vlad about this yesterday, let me take this and see what they still want to do here.
Assignee: nobody → joduinn
Per discussion with damon/vlad: 

1) This has been working since 10jan. Removing P1/blocker as this was set over weekend while trying to figure out escalation process. This intermittent crash needs to be figured out, but it is not a blocker.

2) To debug this intermittent crash, lets try:
* have vlad/damon submit the same code to try server, multiple times in a row, just like any usual user of try server.
* have RelEng standing by waiting to take that try talos machine out of production, and give vlad/damon access to that talos machine to look around. 

We used this same approach before, when mrbkap was chasing an intermittent problem, and it worked nicely to quickly confirm if its a machine specific problem or an intermittent code/test problem and also quickly reproduce the intermittent race condition by trying the same thing again-and-again-and-again...
Severity: blocker → normal
Priority: P1 → P2
John, sounds like Comment #11 is the way to go.  Do we need to follow up otherwise?  Doesn't look like it.
(In reply to comment #12)
> John, sounds like Comment #11 is the way to go.  Do we need to follow up
> otherwise?  Doesn't look like it.

Damon: ok, cool - in that case, no need for meeting. 

From comment#3, comment#11, feels like the next step is to have someone from Dev who knows what the crash looks like attempt to reproduce the problem on try server.
FWIW, the test mentioned in comment 1 (403175-1.html) just timed out again:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1233086333.1233092326.19544.gz
OS X 10.5.2 mozilla-central unit test on 2009/01/27 11:58:53
So, we need to get stack traces in general, which is what I think this bug is about, really.  See bug 387555.  Do we need to dupe this to that?
This bug mentions both Talos and unit test crashes, which are distinct issues currently.
This got fixed in bug 481732  and bug 480577. Unit tests and Talos boxes can now both dump stacks on crash. Awesome!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.