Closed Bug 461020 Opened 16 years ago Closed 16 years ago

Need a stack trace to aid debugging

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mrbkap, Assigned: anodelman)

Details

When speculative script parsing (bug 364315) landed, we started seeing intermittent orange on the Talos boxes due to crashes during Tp. Eventually, I disabled speculative parsing so that the trees would be green for beta1. However, it would be nice to re-enable speculative parsing.

The problem is that it isn't at all clear why the Talos boxes are crashing. As far as I know, nobody reported any speculative parsing related crashes or was able to reproduce the crashiness. I did try to install standalone Talos; however, I was unable to reproduce the crash with the regular pageset and installing the "full" pageset into my standalone talos caused it to not work at all (as far as I could tell, the browser refused to start at all, becoming a defunct process immediately).

I was able to hack something up to run through the pageset, but after ~40 cycles across 4 machines, I crashed only once in seemingly unrelated code.

It would be really nice if I could get some sort of crash stack for the crashes that Talos is seeing so I could direct my debugging efforts directly at the problems that Talos is running into (instead of staring at the code, trying to come up with any possible problem).

If you need any other information, please ask.
I really need this to move forward on speculative parsing...
Severity: normal → major
What operating systems were the crashes observed on?

We don't really have any system in place for being able to pull stack traces from builds and there is no crash reporting/symbols in tinderbox builds.

Would having access to a fully configured talos box that you could do your own testing on get you moving forward?  I suggest this because it is doubtful that we could pull together a full automated system in any sort of timely fashion.
(In reply to comment #2)
> What operating systems were the crashes observed on?

We were seeing crashes on all 3 platforms.

> Would having access to a fully configured talos box that you could do your own
> testing on get you moving forward?  I suggest this because it is doubtful that
> we could pull together a full automated system in any sort of timely fashion.

That would be brilliant. All I really need here is a one-off for a couple of hours (possibly) to get a stack trace. Having a full-on automated way of doing this is far more work than I'm asking for.
Stuart: Is this on mozilla-central or on the tracemonkey branch? We have talos machines on each, 

Alice: Is it ok to take one of the existing slaves out of production, for manual experiment? Or would it be safer instead to we setup a new mac mini, as a clone of an existing production slave? Or...any other ideas?
Summary: Need a stack trace to aid debugging → Need a stack trace to aid debugging intermittent crash when running in Talos
John, this is mozilla-central.
(In reply to comment #4)
> Alice: Is it ok to take one of the existing slaves out of production, for
> manual experiment? Or would it be safer instead to we setup a new mac mini, as
> a clone of an existing production slave? Or...any other ideas?

After talking to Alice, we do not want to take an existing talos slave out of production, or risk destabilizing the talos cluster with any manual experiments being carried out in the colo. 

Instead, do you have a mac mini on your desk that we can put a talos slave image onto? This would still need some manual setup afterwards (apache config, talos install, pageloader install spring to mind right now), but its a lot safer then risking polluting the production talos cluster.
I don't have a mac mini.
in meeting with bkaplan: this crash can be reproduced on the try server talos.
It would be good if we could reproduce on try server talos - but we can't get stack traces/core dumps off of those machines.
Summary: Need a stack trace to aid debugging intermittent crash when running in Talos → Need a stack trace to aid debugging
If you can crash on the win32 try server talos, and we can configure it to save a minidump (which shouldn't be hard), we do in fact upload symbols for the Windows builds there.
There was supposed to be a new followup bug filed as a result of a meeting yesterday, but I don't think I've seen it go by. John, did I miss a bugmail?

IIRC the conclusion was that a mac mini configured with Talos would end up with my ldap credentials on it, so I could make the changes necessary to get a minidump and reproduce the crash.
If you're going to do this on Try talos, or on some other Talos machine, either way, you should set
MOZ_CRASHREPORTER_NO_REPORT=1
and
MOZ_CRASHREPORTER=1
in the environment of the Windows box.

If we did this for Try talos, mrbkap could just submit his patch a bunch of times and when it crashes, someone could grab the minidump for him, and he could debug it using the symbol server.
(In reply to comment #10)
> If you can crash on the win32 try server talos, and we can configure it to save
> a minidump (which shouldn't be hard), we do in fact upload symbols for the
> Windows builds there.

mrbkap: did you see the symbols from windows runs? Are they enough?


(In reply to comment #11)
> There was supposed to be a new followup bug filed as a result of a meeting
> yesterday, but I don't think I've seen it go by. John, did I miss a bugmail?
> 
> IIRC the conclusion was that a mac mini configured with Talos would end up with
> my ldap credentials on it, so I could make the changes necessary to get a
> minidump and reproduce the crash.
Ah, ok. Sorry, I thought that was being filed by you, but I can do that if you
still need it after looking at windows (above)??
Talos auto-kills upon noticing a browser crash.  The talos code is checked out per run.  To solve this we'd need to:

1.  patch the talos code not to kill a crashed browser and to wait for manual intervention
2.  patch try talos buildbot to use the patched talos code
(In reply to comment #13)
> mrbkap: did you see the symbols from windows runs? Are they enough?

Which symbols were those? Any symbols from any talos enabled build would be fine.

> Ah, ok. Sorry, I thought that was being filed by you, but I can do that if you
> still need it after looking at windows (above)??

I asked if I should, but was told that since RelEng needs to give IT the name of the image to flash (or whatever the proper verb is there) that I wasn't the right person to file.
(In reply to comment #14)
> Talos auto-kills upon noticing a browser crash.  The talos code is checked out
> per run.  To solve this we'd need to:
> 
> 1.  patch the talos code not to kill a crashed browser and to wait for manual
> intervention
> 2.  patch try talos buildbot to use the patched talos code

Alice, are you saying we need to do this in addition to what Ted says in Comment 12 ?
Could we just get blake on the phone with alice, ben or ted to get this solved?  This is a high priority to get blake what he needs to debug - let's debug real time, not over bugmail.  Who's point on this from build and can you call blake?
Summary of phone call:

- win32 symbol is a redherring ; dropped.
- cloning another mac mini would take too long (and possible colo trip); dropped.
- we're pulling an OSX mac mini from try-talo and changing the keys on it, so Blake can access without risk of contaminating graphserver or leaking passwords. This should be done before blake heads out for dinner.
Assignee: nobody → anodelman
Priority: -- → P1
(In reply to comment #18)
> Summary of phone call:
> 
> - win32 symbol is a redherring ; dropped.
> - cloning another mac mini would take too long (and possible colo trip);
> dropped.
> - we're pulling an OSX mac mini from try-talo and changing the keys on it, so
> Blake can access without risk of contaminating graphserver or leaking
> passwords. This should be done before blake heads out for dinner.

Note: This means that try server has no talos coverage on OSX for 'n' days. Also, once mrbkap is finished experimenting, we'll have to reimage this machine before putting it back into use on try server.
mrbkap was given access to qm-ptiger-try01.  It will be re-imaged once he's done with it.
Assignee: anodelman → nobody
Status: NEW → RESOLVED
Closed: 16 years ago
Priority: P1 → --
Resolution: --- → FIXED
Assignee: nobody → anodelman
Priority: -- → P1
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.