Closed Bug 1147271 Opened 9 years ago Closed 7 years ago

Intermittent Windows build subprocess.CalledProcessError: Command '['ssh', '-o', 'IdentityFile=/c/Users/cltbld/.ssh/ffxbld_rsa', '-o', 'BatchMode=yes', 'ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net', 'mktemp -d']' returned non-zero exit status 255

Categories

(Release Engineering :: General, defect, P1)

x86_64
Windows Server 2003

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: catlee)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1145507 +++
seems that got pretty frequent in the last days :(
{'builders': [('WINNT 5.2 mozilla-inbound build', 23),
              ('WINNT 5.2 mozilla-inbound leak test build', 20),
              ('WINNT 6.1 x86-64 mozilla-inbound build', 19),
              ('WINNT 6.1 x86-64 mozilla-inbound leak test build', 17),
              ('WINNT 5.2 mozilla-inbound pgo-build', 16),
              ('WINNT 5.2 b2g-inbound pgo-build', 12),
              ('WINNT 6.1 x86-64 fx-team leak test build', 11),
              ('WINNT 5.2 fx-team leak test build', 11),
              ('WINNT 5.2 fx-team build', 10),
              ('WINNT 6.1 x86-64 mozilla-inbound pgo-build', 9),
              ('WINNT 5.2 fx-team pgo-build', 8),
              ('WINNT 6.1 x86-64 fx-team build', 8),
              ('WINNT 6.1 x86-64 b2g-inbound pgo-build', 7),
              ('WINNT 5.2 b2g-inbound leak test build', 4),
              ('WINNT 6.1 x86-64 b2g-inbound leak test build', 4),
              ('WINNT 6.1 x86-64 mozilla-aurora build', 4),
              ('Win32 Mulet mozilla-inbound build', 4),
              ('WINNT 5.2 mozilla-aurora build', 4),
              ('WINNT 6.1 x86-64 fx-team pgo-build', 3),
              ('WINNT 6.1 x86-64 try leak test build', 3),
              ('WINNT 6.1 x86-64 mozilla-central leak test build', 3),
              ('WINNT 5.2 try leak test build', 3),
              ('WINNT 5.2 mozilla-central leak test build', 2),
              ('Win32 Mulet try build', 2),
              ('WINNT 5.2 mozilla-central pgo-build', 2),
              ('WINNT 5.2 mozilla-central nightly', 2),
              ('WINNT 6.1 x86-64 b2g-inbound build', 2),
              ('b2g_emulator_vm mozilla-inbound opt test mochitest-6', 1),
              ('WINNT 5.2 mozilla-central build', 1),
              ('WINNT 5.2 mozilla-aurora leak test build', 1),
              ('WINNT 6.1 x86-64 oak leak test build', 1),
              ('WINNT 6.1 x86-64 mozilla-central nightly', 1),
              ('Win32 Mulet fx-team build', 1),
              ('WINNT 5.2 try build', 1),
              ('WINNT 5.2 b2g-inbound build', 1),
              ('WINNT 6.1 x86-64 mozilla-aurora leak test build', 1),
              ('WINNT 6.1 x86-64 mozilla-central build', 1)],
 'repos': [('mozilla-inbound', 109),
           ('fx-team', 52),
           ('b2g-inbound', 30),
           ('mozilla-central', 12),
           ('mozilla-aurora', 10),
           ('try', 9),
           ('oak', 1)],
 'slaves': [('b-2008-ix-0069', 5),
            ('b-2008-ix-0109', 5),
            ('b-2008-ix-0171', 5),
            ('b-2008-ix-0140', 5),
            ('b-2008-ix-0144', 5),
            ('b-2008-ix-0137', 4),
            ('b-2008-ix-0107', 4),
            ('b-2008-ix-0101', 4),
            ('b-2008-ix-0148', 4),
            ('b-2008-ix-0119', 4),
            ('b-2008-ix-0072', 4),
            ('b-2008-ix-0151', 4),
            ('b-2008-ix-0005', 4),
            ('b-2008-ix-0084', 4),
            ('b-2008-ix-0135', 3),
            ('b-2008-ix-0134', 3),
            ('b-2008-ix-0139', 3),
            ('b-2008-ix-0147', 3),
            ('b-2008-ix-0102', 3),
            ('b-2008-ix-0118', 3),
            ('b-2008-ix-0113', 3),
            ('b-2008-ix-0170', 3),
            ('b-2008-ix-0163', 3),
            ('b-2008-ix-0078', 3),
            ('b-2008-ix-0076', 3),
            ('b-2008-ix-0075', 3),
            ('b-2008-ix-0093', 3),
            ('b-2008-ix-0096', 3),
            ('b-2008-ix-0145', 3),
            ('b-2008-ix-0065', 3),
            ('b-2008-ix-0156', 3),
            ('b-2008-ix-0015', 3),
            ('b-2008-ix-0006', 3),
            ('b-2008-ix-0002', 3),
            ('b-2008-ix-0125', 3),
            ('b-2008-ix-0086', 3),
            ('b-2008-ix-0131', 2),
            ('b-2008-ix-0130', 2),
            ('b-2008-ix-0108', 2),
            ('b-2008-ix-0106', 2),
            ('b-2008-ix-0105', 2),
            ('b-2008-ix-0100', 2),
            ('b-2008-ix-0115', 2),
            ('b-2008-ix-0116', 2),
            ('b-2008-ix-0166', 2),
            ('b-2008-ix-0167', 2),
            ('b-2008-ix-0070', 2),
            ('b-2008-ix-0071', 2),
            ('b-2008-ix-0142', 2),
            ('b-2008-ix-0146', 2),
            ('b-2008-ix-0068', 2),
            ('b-2008-ix-0067', 2),
            ('b-2008-ix-0066', 2),
            ('b-2008-ix-0154', 2),
            ('b-2008-ix-0157', 2),
            ('b-2008-ix-0012', 2),
            ('b-2008-ix-0013', 2),
            ('b-2008-ix-0014', 2),
            ('b-2008-ix-0088', 2),
            ('b-2008-ix-0007', 2),
            ('b-2008-ix-0083', 2),
            ('b-2008-ix-0009', 2),
            ('b-2008-ix-0126', 2),
            ('b-2008-ix-0127', 2),
            ('b-2008-ix-0120', 2),
            ('b-2008-ix-0128', 2),
            ('b-2008-ix-0033', 1),
            ('b-2008-ix-0030', 1),
            ('b-2008-ix-0037', 1),
            ('b-2008-ix-0039', 1),
            ('b-2008-ix-0136', 1),
            ('b-2008-ix-0149', 1),
            ('b-2008-ix-0117', 1),
            ('b-2008-ix-0111', 1),
            ('b-2008-ix-0110', 1),
            ('b-2008-ix-0112', 1),
            ('b-2008-ix-0051', 1),
            ('tst-linux64-spot-1837', 1),
            ('b-2008-ec2-0003', 1),
            ('b-2008-ec2-0002', 1),
            ('b-2008-ix-0169', 1),
            ('b-2008-ix-0161', 1),
            ('b-2008-ix-0164', 1),
            ('b-2008-ix-0165', 1),
            ('b-2008-ix-0079', 1),
            ('b-2008-ix-0074', 1),
            ('b-2008-ix-0090', 1),
            ('b-2008-ix-0091', 1),
            ('b-2008-ix-0094', 1),
            ('b-2008-ix-0097', 1),
            ('b-2008-ix-0098', 1),
            ('b-2008-ix-0143', 1),
            ('b-2008-ix-0141', 1),
            ('b-2008-ix-0061', 1),
            ('b-2008-ix-0062', 1),
            ('b-2008-ix-0011', 1),
            ('b-2008-ix-0017', 1),
            ('b-2008-ix-0001', 1),
            ('b-2008-ix-0121', 1),
            ('b-2008-ix-0122', 1),
            ('b-2008-ix-0123', 1),
            ('b-2008-ix-0081', 1)],
 'times': {'2015-03-28': 1,
           '2015-03-29': 1,
           '2015-03-30': 3,
           '2015-04-01': 2,
           '2015-04-02': 1,
           '2015-04-03': 1,
           '2015-04-07': 2,
           '2015-04-08': 1,
           '2015-04-13': 3,
           '2015-04-14': 1,
           '2015-04-15': 2,
           '2015-04-16': 1,
           '2015-04-17': 2,
           '2015-04-18': 1,
           '2015-04-20': 1,
           '2015-04-24': 2,
           '2015-05-05': 1,
           '2015-05-06': 5,
           '2015-05-07': 4,
           '2015-05-08': 1,
           '2015-05-12': 2,
           '2015-05-15': 1,
           '2015-05-18': 1,
           '2015-05-20': 4,
           '2015-05-21': 2,
           '2015-05-22': 1,
           '2015-05-25': 1,
           '2015-05-26': 14,
           '2015-05-27': 60,
           '2015-05-28': 1,
           '2015-05-29': 1,
           '2015-06-01': 1,
           '2015-06-02': 1,
           '2015-06-03': 1,
           '2015-06-04': 4,
           '2015-06-05': 1,
           '2015-06-10': 3,
           '2015-06-11': 1,
           '2015-06-12': 1,
           '2015-06-15': 2,
           '2015-06-16': 1,
           '2015-06-18': 1,
           '2015-06-23': 1,
           '2015-06-24': 3,
           '2015-06-26': 3,
           '2015-06-30': 1,
           '2015-07-01': 3,
           '2015-07-03': 1,
           '2015-07-06': 1,
           '2015-07-07': 3,
           '2015-07-10': 1,
           '2015-07-11': 1,
           '2015-07-13': 1,
           '2015-07-14': 4,
           '2015-07-15': 2,
           '2015-07-16': 2,
           '2015-07-20': 1,
           '2015-07-28': 1,
           '2015-08-04': 4,
           '2015-08-05': 40,
           '2015-08-06': 9}}
This is currently #3 on OrangeFactor and is causing Windows builds to fail at an unacceptably high rate. I'm going to have to close trees at the rate things are failing if we can't find anybody to actively work on this.
Flags: needinfo?(jgriffin)
Flags: needinfo?(coop)
I can file the suite of bugs required to get python upgraded to 2.7.10 across all our builders, but as we discussed on IRC, that doesn't really address the issue of what caused this to start happening in the first place.

I *think* the earliest incarnation I can see is https://bugzilla.mozilla.org/show_bug.cgi?id=1145507#c1 Bisecting a Windows build timeout sounds like the worst thing ever, so maybe we could try to bisect the OS X version (bug 1191877) instead? Who has the bisection tools needed to do this?
Flags: needinfo?(coop)
Ted, see gps' comments in bug 1191877. Something in the symbolstore is causing us extreme build timeout heartburn lately across all platforms, though Windows appears to the worst-affected.
Flags: needinfo?(ted)
Most of these latest occurrences aren't related to the symbolstore, but they _do_ show some kind of problem with the build.  Ted, can you look at this or suggest someone else who would be more appropriate?
Flags: needinfo?(jgriffin)
I think the patches in bug 1164816 might make this go away, they rewrite the job handling in symbolstore.py. I'll fix the thing that was preventing me from landing them and get them landed.
See Also: → 1193287
From the two logs I looked at (comment 613, comment 612), something is hanging during configure. It would be nice if we could get a screenshot when this happens so we can see if there are any dialogs hanging out. (I'm pretty sure there's a bug filed on that.)
Flags: needinfo?(ted)
I have try runs that are permafailing with this error:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2496e0cf7df4
https://treeherder.mozilla.org/#/jobs?repo=try&revision=833184627d44

At this point, I'm tempted to use inbound as try for this particular patch.
Severity: normal → major
See Also: → 1318765
Depends on: 1337807
:gbrown, do you believe that bug 1337807 will resolve most of these errors?
I don't know that we have enough knowledge to even guess at that. It's a very small fix that can fix certain kinds of issues that could manifest this way, so we should look at the stats here after we get it landed. If it's still occurring regularly we can look into other fixes.
Ted we have a pretty consistent pattern of failures here, what other options should we look into.  Luckily these are not happening so much that we see them every other push, but we do see failures daily.
Flags: needinfo?(ted)
Whiteboard: [stockwell needswork]
(In reply to Joel Maher ( :jmaher) from comment #692)
> Ted we have a pretty consistent pattern of failures here, what other options
> should we look into.  Luckily these are not happening so much that we see
> them every other push, but we do see failures daily.

I don't have any other potential solutions at hand, but I think our best course of action here would be to figure out how to get screenshots when these failures occur. Since these are all on Windows my best guess would be that we're still getting some sort of error dialog that's blocking execution. If we can determine whether that's the case then we can deal with it. If not, we'll have to investigate other options.

Looking at a few of the logs from orangefactor in comment 691 I note that there is an odd failure mode. The mozprocess call to invoke the build via `mach build` is timing out and erroring:
11:19:32     INFO - Automation Error: mozprocess timed out after 4800 seconds running ['c:\\mozilla-build\\python27\\python.exe', 'mach', '--log-no-times', 'build', '-v']
11:19:32    ERROR - timed out after 4800 seconds of no output

... but then buildbot is timing out anyway:
command timed out: 10800 seconds without output running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_windows_32_builds.py', '--custom-build-variant-cfg', 'debug', '--config', 'balrog/production.py', '--branch', 'autoland', '--build-pool', 'production'], attempting to kill

... and then the buildbot client errors:
remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: exceptions.RuntimeError: SIGKILL failed to kill process
]

This pattern was repeated in 3 different logs I looked at. Clearly something is failing or hanging in a way that's causing timeouts, but something in our automation isn't able to handle these timeouts and exit cleanly.
Flags: needinfo?(ted)
In terms of getting a screenshot, we do build a screenshot tool as part of the build:
https://dxr.mozilla.org/mozilla-central/source/testing/tools/screenshot/moz.build

If it built successfully it should be in `$(DIST)/bin/screenshot.exe`. I note that the logs show a post-run mozharness step running, maybe we could add one for Windows builds that looks to see if screenshot.exe exists, and if so run it to capture a screenshot? I don't think we'll be able to upload it anywhere, since this can happen either after the upload step has finished, or the upload step may not run at all due to the build failing, but we could have the Python code read the screenshot file and output a data: URL in the log, which should at least be usable.
no failures in the last week and only a few the week before- is this fixed?
Whiteboard: [stockwell needswork] → [stockwell unknown]
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
:ted, sorry to keep asking you to look at windows build failures, but we have way too many of them looking at this bug and bug 1311861- is there someone else we should be asking to load balance?
Flags: needinfo?(ted)
Whiteboard: [stockwell unknown] → [stockwell needswork]
I definitely don't have cycles to look into this right now. You could check with other build peers to see if they can handle this.
Flags: needinfo?(ted)
Chris, would you have some time to look at this failure (which might be related to bug 1311861)?
Flags: needinfo?(cmanchester)
Looking at a handful of these, they all occur in the automation steps, and the conspicuous missing step in the output is from the upload step. We're using --output-sync=target here, and according to the gmake docs for it that means "Output from the entire recipe for each target is grouped and printed once the target is complete", so a hang there might explain why we're not seeing any output for it.

I'll push to try with --output-sync=none to see if this reproduces and there's any interesting output in there.

From looking at a handful of examples of bug 1311861 as well these might as well be the same bug. The very little I can see of a pattern in the output for this failure is the same between them all.
Flags: needinfo?(cmanchester)
This reproduced twice out of several retriggers in: https://treeherder.mozilla.org/#/jobs?repo=try&revision=597575de448439465ec6d6c7b3bb3abfb967c195

This seems to confirm we're hanging in `make upload`: in both cases we see the command line to invoke upload.py but none of the output from output.py, so it seems to be getting stuck fairly early on.
By adding "-o BatchMode=yes" (thanks to Ted for the suggestion) to our ssh command line in upload.py these fail quickly with:

Host key verification failed.

so I guess we need to fix known_hosts on some of our builders.

I'll post a patch to make these fail quickly and at least stop tying up windows builders in hours long cascades of timeouts.
Comment on attachment 8866884 [details]
Bug 1147271 - Add ssh option to prevent hanging on password prompt in upload.py

https://reviewboard.mozilla.org/r/138488/#review141786

Nice find! So in cases where we have a bad config or something, we'll just fail earlier rather than hit a timeout, right?
Attachment #8866884 - Flags: review?(mshal) → review+
Pushed by cmanchester@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ccd0ac82b8bd
Add ssh option to prevent hanging on password prompt in upload.py r=mshal
Comment on attachment 8866884 [details]
Bug 1147271 - Add ssh option to prevent hanging on password prompt in upload.py

https://reviewboard.mozilla.org/r/138488/#review141786

Right, this option disables passphrase/password querying.
We're sorry - something has gone wrong while rewriting or rebasing your commits. The commits being pushed no longer match what was requested. Please file a bug.
With a patch to print the contexts of ~/.ssh/known_hosts from upload.py, a passing build had about 100 entries, including upload.trybld.productdelivery.prod.mozaws.net, and a failing build couldn't find ~/.ssh/known_hosts.

I'm out of my depth here, catlee, can you find someone to help track this down the rest of the way?
Flags: needinfo?(catlee)
(In reply to Chris Manchester (:chmanchester) from comment #712)
> With a patch to print the contexts of ~/.ssh/known_hosts from upload.py, a
> passing build had about 100 entries, including
> upload.trybld.productdelivery.prod.mozaws.net, and a failing build couldn't
> find ~/.ssh/known_hosts.

Pretty sure this is supposed to get installed via puppet or whatever, so this is some sort of provisioning issue.
Summary: Intermittent Windows build Automation Error: mozprocess timed out after 4800 seconds running ['c:\\mozilla-build\\python27\\python.exe', 'mach', '--log-no-times', 'build', '-v'] → Intermittent Windows build subprocess.CalledProcessError: Command '['ssh', '-o', 'IdentityFile=/c/Users/cltbld/.ssh/ffxbld_rsa', '-o', 'BatchMode=yes', 'ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net', 'mktemp -d']' returned non-zero exit status 255
:arr, do you know about known hosts on windows and how we install those?
Flags: needinfo?(arich)
Whiteboard: [stockwell needswork] → [stockwell infra]
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #714)
> (In reply to Chris Manchester (:chmanchester) from comment #712)
> > With a patch to print the contexts of ~/.ssh/known_hosts from upload.py, a
> > passing build had about 100 entries, including
> > upload.trybld.productdelivery.prod.mozaws.net, and a failing build couldn't
> > find ~/.ssh/known_hosts.
> 
> Pretty sure this is supposed to get installed via puppet or whatever, so
> this is some sort of provisioning issue.

Is it possible that the file is getting deleted at some point somehow?
Flags: needinfo?(catlee)
with this happening 50+ times/day, can we get someone to own this?  This is causing a lot of failures for all developers.
Flags: needinfo?(arich) → needinfo?(mcornmesser)
As far as the known host file, I verified that is being deployed in its entirety to the b-2008 golden instance, and it is on the instances that spawn from the golden AMI.

I can take a deeper look into this, but I will need an instance in which the build failed on and the corresponding log.
Flags: needinfo?(mcornmesser)
Nice catch with BatchMode=yes - we have that in ~/.ssh/config for a couple of wildcard hostnames, but they don't catch the *.mozaws.net upload hosts. Thunderbird jobs still use the buildbot factories so they still hang up.

Based on comment #723 we're seeing this on instances in both use1 and usw2, it started on May 1 and is getting worse (or we're getting better at starring with this bug). 

I found b-2008-spot-032 running (and actually with a green history) and tried running this in cmd.exe
 ssh -o IdentityFile=/c/Users/cltbld/.ssh/ffxbld_rsa -o BatchMode=yes ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net -vvv
which reproduced the Host key verification error. It tried to read known_hosts at lots of locations (eg in /etc) but not /c/Users/cltbld/.ssh/known_hosts. Setting HOME=/c/Users/cltbld let the ssh command work. 

But if you look in a failing log (like https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win32-pgo/1494736229/autoland-win32-pgo-bm77-build1-build28.txt.gz) you can see HOME is set correctly when fx_desktop_build.py is invoked, but there is still an upload failure. So this may well be false hope.

I'll try to catch a slave soon after an error and investigate it.
A little more progress. https://treeherder.mozilla.org/logviewer.html#?job_id=99012742&repo=try&lineNumber=5634 has a ssh -vvv on try, and ssh did try to read /c/users/cltbld/.ssh/known_hosts but still hit the host verification error. This seems to rule out $HOME not being set, and the next step is to check if that file was actually there (aka catlee's idea about purge_builds removing things it shouldn't).

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3add78e536d9c81e6828018435d6247491f6673e has a go at that, lots of jobs requested and very likely to spawn new instances. Unclear yet if the latter makes any difference - AFAICT they either come up working or not, rather than intermittent once up, but spot instances can be recycled fairly quickly so it's not easy to tell.
Taking a look at the instance that had a failed build; It appears that much of the contents of c:\users\cltbld\.ssh had been deleted.
A few notes from debugging today:
- instances can come up working and then change to a broken state later
- it looks like we have ssh-agent environment variables in some of the logs, which is weird

According to OF, this error seems to have spiked around May 4th/5th. The timing lines up with bug 1321168 and bug 1362356. We've backed out the c4.4xlarge patch for now to see if that makes a difference.
Remaining c4 instances were stopped:
15:06 PDT <arr> everything should be c3s from now on
Assignee: nobody → catlee
Priority: -- → P1
this looks to be fixed :)
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
(In reply to OrangeFactor Robot from comment #733)
> 74 failures in 777 pushes (0.095 failures/push) were associated with this
> bug in the last 7 days. 
Is there way to determine if these errors occurred before or after last Tuesday?  


> This is the #10 most frequent failure this week.  
> 
> ** This failure happened more than 30 times this week! Resolving this bug is
> a high priority. **
> 
> ** Try to resolve this bug as soon as possible. If unresolved for 2 weeks,
> the affected test(s) may be disabled. ** 
> 
> Repository breakdown:
> * autoland: 39
> * mozilla-inbound: 28
> * mozilla-esr52: 5
> * mozilla-beta: 2
> 
> Platform breakdown:
> * windowsxp: 40
> * windows8-64: 27
> * windows2012-32: 3
> * linux64: 3
> * windows2012-64: 1
> 
> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1147271&startday=2017-05-15&endday=2017-05-21&tree=all
Click through on the brasstacks link - in the table there are no instances after May 15, so we're good.
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: