629527 - win7 jetpack suite causes hung builder

Reporter

Description

•

14 years ago

I'm filing this bug to track work for dealing with the win7 hangs, RelEng is loaning out a builder for debugging: talos-r3-w7-001 (login info will be emailed to Myk separately) The current log info for a jetpack suite run looked like this before they were disabled. https://bugzilla.mozilla.org/show_bug.cgi?id=627070#c0

Lukas Blakk [:lsblakk] use ?needinfo

Reporter

Comment 1

•

14 years ago

And, in breaking news - ctalbert has gotten some forward movement on the hang and has m-c/jetpack patches that need landing. Cc'ing him and holding off on the loaner for now - we can continue discussion/tracking of this issue in this bug.

cmtalbert

Assignee

Comment 2

•

14 years ago

Attached patch disable crash reporter fix, backport from 1.5.2 — Details — Splinter Review

So, I tried reproducing this on my win7 box yesterday and today. I hit several things: * first attempt, rather old nightly build: ** Encountered crash, crash reporter displayed, encountered a hang very much like what Armen described in his write up of the win7 jetpack hang issue. ** Decided to backport our crash reporter fix from mozrunner 1.5.2 to jetpack to fix this, verified that it fixed the hang. * 2nd attempt, updated to yesterday's nightly: ** Jetpack ran fine up to some of the test-panel tests. At that point, firefox disappeared (from screen and process list) but the cfx output stopped. ** This seems to be a test or browser issue (see below) but because firefox itself was no longer running, I imagine this would not cause buildbot to hang. Buildbot would timeout the hung cfx process and reboot the slave, I believe. * 3rd attempt, today's nightly: * Tried to repro yesterday's hang, but accidentally updated my nighlty before doing so. * The cfx test harness did not hang, but test-windows.testWindowTabsObject failed due to timing out. So, I'm hoping this patch which updates the Jetpack mozrunner fork with the crash reporter disabling fix from 1.5.2 (bug 616383) will fix the hang we were seeing on windows 7. We need to figure out how to enable this either in staging or on try server to see if this change (once reviewed and committed to jetpack repo) fixes the hang.

Attachment #507724 - Flags: review?(avarma)

Armen [:armenzg]

Comment 3

•

14 years ago

(In reply to comment #2) > We need to figure out how to enable this either in staging or on try server to > see if this change (once reviewed and committed to jetpack repo) fixes the > hang. My suggestion is to have jetpack enabled for all platforms on a twig repo and push changes there to trigger them. This way we ensure that we won't get too many slaves hung if more than this fix is needed. (releng doesn't currently have an easy way to spot hung slaves - it's manual)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

14 years ago

Clint: pushing this bug to you, as it looks like you're doing the actual work here. Armen: good point. If we dont get these hangs resolved soon, we should have disable these jobs on mozilla-central. The worry is not just about wasting CPU cycles, its also about taking production slaves offline in a way that we cannot yet automatically detect - which would be bad for all tests on all branches. We can quickly re-enable the test when its not a risk to the slave pool. Lets give clint/ararma a day/two to try out their fix-in-hand before we start disabling on m-c, and moving this detective work to a rentable branch.

Assignee: nobody → ctalbert

OS: Mac OS X → All

Armen [:armenzg]

Comment 5

•

14 years ago

ctalbert mentioned a fixed that landed on Friday that could help mozmill and jetpack. ctalbert if I recall correctly I have tons of mozmill-all jobs currently (as of this morning) hung. I would like to disable mozmill-all at least on the tryserver until we don't hung anymore since we are taking a lot of slaves out of action and hitting dramatically on wait times. I noticed a difference. Instead of closing twice a crashreporter I had to close twice Minefield. Minefield wasn't really doing anything special.

cmtalbert

Assignee

Comment 6

•

14 years ago

Armen, is the hang you're seeing with mozmill on windows? If that is happening, then I'm thinking that something has changed inside of buildbot to cause this, because of the way that our patched twistd library works, it should be *impossible* for buildbot *not* to kill anything that mozmill generates. The twistd version we use on the slaves has been patched to launch everything in a job object on windows, and child processes are not allowed to break out of that job object. Therefore, when that job object is terminated (this occurs when buildbot does its timeout) every single child object is terminated by Windows. Jetpack is not in the configs on our auto-tools staging envirionment. Can we put this on a project branch or something so we can determine if this fixes this issue? It solved the hang I was seeing on my box, but I'm starting to think that the buildbot environment is factoring into this more than I had anticipated.

cmtalbert

Assignee

Comment 7

•

14 years ago

Atul, ping for review

Atul Varma [:atul]

Comment 8

•

14 years ago

Comment on attachment 507724 [details] [diff] [review] disable crash reporter fix, backport from 1.5.2 This looks fine to me. One question I have, though, is whether these hangs are at all related to bug 629197. Although that bug doesn't seem to mention any bugs that actually report Jetpack crashing due to NSJetpack (which enables out-of-process add-ons), my understanding is that the code is in need of repair and will likely be disabled for Firefox 4. Not sure if that is what's responsible for things hanging on win7 or not, but thought I'd mention it.

Attachment #507724 - Flags: review?(avarma) → review+

Armen [:armenzg]

Comment 9

•

14 years ago

(In reply to comment #6) > Armen, is the hang you're seeing with mozmill on windows? If that is > happening, then I'm thinking that something has changed inside of buildbot to > cause this, because of the way that our patched twistd library works, it should > be *impossible* for buildbot *not* to kill anything that mozmill generates. > It was happening on all platforms for mozmill (even though we are discussing this on the jetpack bug for a specific platform). Nothing has landed since the 5th: http://hg.mozilla.org/build/buildbot/rev/ceb8097ada17 The issue only started getting aggravated since Friday. > The twistd version we use on the slaves has been patched to launch everything > in a job object on windows, and child processes are not allowed to break out of > that job object. Therefore, when that job object is terminated (this occurs > when buildbot does its timeout) every single child object is terminated by > Windows. Can you please point me to the twistd file you guys use? could you paste it for each platform? I could compare it with them. Let me if you would like me to paste ours. > > Jetpack is not in the configs on our auto-tools staging envirionment. Can we > put this on a project branch or something so we can determine if this fixes > this issue? It solved the hang I was seeing on my box, but I'm starting to > think that the buildbot environment is factoring into this more than I had > anticipated. We could enable mozmill and jetpack in twig repos and staging. Could you please file a bug when you are ready?

cmtalbert

Assignee

Comment 10

•

14 years ago

(In reply to comment #9) > (In reply to comment #6) > Nothing has landed since the 5th: > http://hg.mozilla.org/build/buildbot/rev/ceb8097ada17 > The issue only started getting aggravated since Friday. What do you mean this issue only got aggravated friday? I don't understand. You've said this has been happening for a while. Is it just that now instead of crash dialogs you have minefield up, right? I'm more talking that buildbot over the course of the duration of mozmill being active has changed. Because the one change to mozmill is very minor and only serves to turn off the crashreporter. > > > The twistd version we use on the slaves has been patched to launch everything > > in a job object on windows, and child processes are not allowed to break out of > > that job object. Therefore, when that job object is terminated (this occurs > > when buildbot does its timeout) every single child object is terminated by > > Windows. > Can you please point me to the twistd file you guys use? could you paste it for > each platform? I could compare it with them. Let me if you would like me to > paste ours. > This is your platform. We did not patch twistd, this is something we discovered when we first tried putting mozmill into the buildbot environment (we had to remove mozmill's management of job objects so that programs it spawned would be killable). It comes from some patches on alice's user repo: http://people.mozilla.org/~anodelman/killprocess/

Armen [:armenzg]

Comment 11

•

14 years ago

(In reply to comment #10) > (In reply to comment #9) > > (In reply to comment #6) > > > Nothing has landed since the 5th: > > http://hg.mozilla.org/build/buildbot/rev/ceb8097ada17 > > The issue only started getting aggravated since Friday. > What do you mean this issue only got aggravated friday? I don't understand. Before *some* jobs would get the crashreporter to appear and buildbot not being able to kill them. Since Friday (if I am correct) *most* or *all* mozmill jobs would hang up. > You've said this has been happening for a while. Is it just that now instead > of crash dialogs you have minefield up, right? mozmill had been failing for a while to finish in some cases but after Friday (IIUC) it got really bad. In fact, we even had wait times on the weekend when we normally don't. > I'm more talking that buildbot > over the course of the duration of mozmill being active has changed. Because > the one change to mozmill is very minor and only serves to turn off the > crashreporter. I am not sure what you mean exactly about "during the duration of mozmill". Do you have some time tomorrow to chat on the phone to see if it can help us understand each other? I think we might be saying the same things. I have 10:30-11:00 and 2:00-2:30PDT booked if you need to. Give me a call on my cell if you need to. > > > The twistd version we use on the slaves has been patched to launch everything > > > in a job object on windows, and child processes are not allowed to break out of > > > that job object. Therefore, when that job object is terminated (this occurs > > > when buildbot does its timeout) every single child object is terminated by > > > Windows. > > Can you please point me to the twistd file you guys use? could you paste it for > > each platform? I could compare it with them. Let me if you would like me to > > paste ours. > > > This is your platform. We did not patch twistd, this is something we > discovered when we first tried putting mozmill into the buildbot environment > (we had to remove mozmill's management of job objects so that programs it > spawned would be killable). It comes from some patches on alice's user repo: > http://people.mozilla.org/~anodelman/killprocess/ I was almost to take that change for WinXP unit tests but I found that it was not necessary. Adding wbem in the PATH was good enough. According to my comment in https://bugzilla.mozilla.org/show_bug.cgi?id=614955#c9 the win7 slaves already have these patches in place. The patches you point out are only for Windows so I am not sure what can we do for the other platforms https://bugzilla.mozilla.org/show_bug.cgi?id=630551#c4 (talking about mozmill and not jetpack) Is bug 600736 still an issue? is it related? Whimboo pointed to me this bug as well -> bug 630258.

Myk Melez [:myk] [@mykmelez]

Comment 12

•

14 years ago

I landed attachment 507724 [details] [diff] [review]: https://github.com/mozilla/addon-sdk/compare/b45513a...ef98944 But I'm leaving this bug open until we confirm that it fixes the problem.

Armen [:armenzg]

Comment 13

•

14 years ago

Myk is this a fix for win7 or for other platforms? AFAIK we are not running the win7 jobs on any branch (the other platforms are). Do you want us to enable it for a twig branch and trigger jobs there?

Myk Melez [:myk] [@mykmelez]

Comment 14

•

14 years ago

(In reply to comment #13) > Myk is this a fix for win7 or for other platforms? Clint is in a better position to answer this question, but I think it's Windows 7-specific (or possibly just Windows-specific). > AFAIK we are not running the win7 jobs on any branch (the other platforms are). > Do you want us to enable it for a twig branch and trigger jobs there? I'm not very familiar with the RelEng infrastructure. Can you explain what a "twig branch" is and also what you mean by "platform" in this case?

Lukas Blakk [:lsblakk] use ?needinfo

Reporter

Comment 15

•

14 years ago

> I'm not very familiar with the RelEng infrastructure. Can you explain what a > "twig branch" is and also what you mean by "platform" in this case? He is referring to https://wiki.mozilla.org/DisposableProjectBranches which is a loanable branch for developers to iterate on a project. Basically you make an IT request for any repo you like to be cloned to our temporary branch's repo, then push to that to get the full automation. So this means we would have to: * turn on jetpack/mozmill-all for a twig branch, say Maple * someone in Jetpack's team would do the IT request for a clone of mozilla-central to the twig repo * then you push changes to that repo (and the addon-sdk repo) as needed to watch the output Here's what I don't like about this scenario. * Unlike tryserver, you get full automation - all platforms, all tests * It runs the risk of diverging from other changes to mozilla-central that might be part of the problem/solution So I still think the best solution is to push to try and specify those two test suites (jetpack, mozmill-all) once Clint's troubleshooting on his personal win7 box is fairly stable.

Armen [:armenzg]

Comment 16

•

14 years ago

I got on a call with ctalbert and after couple of hours I think I figured out what the problem is. What I am about to describe does not necessarily have to do with Win7 because I am reporting what I figured on a hang 10.5 slave. Every time jetpack has test failures it throws a sys.exit(1) [1] jetpack (aka python bin/cfx testall -a firefox -b $APP_PAT) [2] gets called by run_jetpack.sh If I modify the script: + echo "DEBUG: About to call python bin/cfx" + python bin/cfx testall -a firefox -b $APP_PATH >/dev/null 2>&1 + echo "DEBUG: I should have finished" I assume I should see the second message but I actually don't: (NOTE that I am using set -ex) >+ echo 'DEBUG: About to call python bin/cfx' >DEBUG: About to call bin/cfx > /dev/null >+python bin/cfx testall -a firefox -b /Users/cltbld/Desktop/jetpack/build/MinefieldDebug.app >talos-r3-leopard-024:build cltbld$ I have verified that if I changed the sys.exit(0) run_jetpack.sh finishes. The problem is with bash calling python and not being able to handle a sys.exit(1). Therefore buildbot does not have anymore a process to kill since it died before finishing! NOTE: to add on top we call bash within bash: "bash -c '/Users/cltbld/Desktop/jetpack/tools/buildfarm/utils/run_jetpack.sh macosx'" but I don't think it is related [1] http://hg.mozilla.org/projects/addon-sdk/file/e5ec22f3ff31/python-lib/cuddlefish/__init__.py#l278 [2] http://hg.mozilla.org/build/tools/file/tip/buildfarm/utils/run_jetpack.sh#l46

Armen [:armenzg]

Comment 17

•

14 years ago

If you want to reproduce you can use the builds and instructions provided in here: http://people.mozilla.com/~armenzg/builds/jetpack/run_jetpack.txt

Chris AtLee [:catlee]

Comment 18

•

14 years ago

So I can reproduce hanging a slave with a simple builder that has one step: ShellCommand(command=['sh', '-c', 'sleep 1000000000']) (sh for me is dash) Once that starts, kill -9 the sh instance. Now the slave is stuck. The current build can't be killed, and no new jobs can be accepted.

Brian Warner [:warner :bwarner]

Comment 19

•

14 years ago

Many many years ago, the original buildbot used PTY groups to spawn processes, so that that children and grandchildren of the ShellCommand (and basically anything that didn't explicitly daemonize itself) could all be killed with a single os.kill(-pid). This worked well on solaris, but for some reason, it caused enough problems on other platforms that we changed the default stop using PTYs and just use a regular pair of pipes. Without that trick, it's important to cut down on the amount of forking done by the child process. In some cases that means having one process os.exec()ing another instead of spawning it. Is the problem that 'cfx testall' isn't tracking the Firefox/etc process that it spawned and making sure that's dead before it exits itself? Or is it that run_jetpack.sh is terminating early? I think having 'cfx test' exit with rc=1 upon failure is reasonable. I'd assumed that it was a FF process that got left running, so I'm not clear on how it'd help to prevent run_jetpack.sh from terminating early on cfx error. Does that retain a chain-of-parenthood that could make it easier to clean up all the lingering processes?

Myk Melez [:myk] [@mykmelez]

Comment 20

•

14 years ago

Is this still happening? I seem to recall someone telling me recently that the hangs have been resolved.

Whiteboard: [triage:followup]

cmtalbert

Assignee

Comment 21

•

14 years ago

I thought this was finished? --> WFM/FIXED?

Armen [:armenzg]

Comment 22

•

14 years ago

For now let's call it WFM and re-open if we see happen again. To note that we currently only run jetpack on projects/addon-sdk and not on m-c. As we have seen in other bugs for now there are discrepancies between how it is run on m-c VS as a project job. We will tackle those differences when we see those jobs green on projects/addon-sdk and we want to enable for m-c and other branches.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Myk Melez [:myk] [@mykmelez]

Updated

•

14 years ago

Whiteboard: [triage:followup]