Closed Bug 398113 Opened 17 years ago Closed 17 years ago

fx-win32-tbox & tbnewref-win32-tbox having occasional trouble with ssh connections, takes 2 hours to time out

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

Attachments

(2 files)

The fx-win32-tbox builds that started at 18:31 on 2007-09-03 and at 14:52 2007-09-29 took around two and a quarter hours, where the usual build takes 15 minutes. Both builds end with a "Read from remote host stage.mozilla.org: Connection reset by peer" after the "ssh -2 -l cltbld stage.mozilla.org chmod -R 775 ..." line.

With talos and bl-bldxp01 both taking 45-50 minutes after the end of a build, those get us into a 3 hour regression window, which is tolerable enough on weekends, but pretty evil during busy checkin times.
Could this be related to the recent stage migration work late Thurs night / early Friday morning?
Priority: -- → P1
(In reply to comment #0)
> The fx-win32-tbox builds that started at 18:31 on 2007-09-03 and ...
> 
To clarify, did you mean 2007-09-30 or 2007-09-03?
* Looks from the Firefox tinderbox page that 2007-09-30 18:31 is the build in question

* in the ssh log on stage I see the connection but no errors, and the command seems to be timing out after 2 hours

* more data points - tbnewref-win32- build starting at 
  * 2007/09/30 22:46 
  * 2007/09/30 07:48
  * 2007/09/29 02:52
 have a similar problem (chmod'ing thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk)
Assignee: build → nrthomas
fx-win32-tbox had 482 ssh processes running, all for connections to the cvs server in the last three hours. I've killed the processes off and restarted the tinderbox. tbnewref-win32-tbox doesn't have any stray processes, so that may not be the smoking gun.

Still looking for other boxes with this problem, but so far it appears to be limited to the latest windows reference platform.

A new variation in the build starting at 2007/10/01 08:57 - cvs was hung checking out client.mk (right at the start of the build). netstat reported connection was ESTABLISHED, nothing else obviously wrong.

I restarted both these boxes, possibly ssh got its nose tweaked out of joint when it couldn't contact stage during the migration (friday morning).

Leaving this open for a day or two to see if we get any more events.
Summary: fx-win32-tbox having occasional trouble with its connection to stage, takes forever to time out → fx-win32-tbox & tbnewref-win32-tbox having occasional trouble with ssh connections, takes 2 hours to time out
No events since rebooting, closing.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1191457680.1191465873.27492.gz (Started 17:28 20071003)

"Read from remote host stage.mozilla.org: Connection reset by peer"

Sorry.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Local patchSplinter Review
Must have jinxed it :-( Thunderbird had the same problem at 2007/10/03 22:58, so here's a local patch for these two boxes, to try to get more info.
(In reply to comment #8)
> Created an attachment (id=283535) [details]
> Local patch

s/time/date/g on that patch, so that it actually works.
Some data from a couple of events:

fx-win32-tbox, build starting 2007/10/06 07:04 

ssh -2 -l cltbld stage.mozilla.org date; chmod -Rv 775 /home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk; date
Sat Oct  6 07:20:11 PDT 2007
mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk' changed to 0775 (rwxrwxr-x)
mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk/firefox-3.0a9pre.en-US.win32.installer.exe' changed to 0775 (rwxrwxr-x)
mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk/firefox-3.0a9pre.en-US.win32.zip' changed to 0775 (rwxrwxr-x)
Sat Oct  6 07:20:11 PDT 2007

Read from remote host stage.mozilla.org: Connection reset by peer
----------------------------------------------------------------------------

tbnewref-win32-tbox, build starting 2007/10/05 23:17

ssh -2 -l cltbld stage.mozilla.org date;chmod -Rv 775 /home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk;date
Fri Oct  5 23:36:52 PDT 2007
mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk' changed to 0775 (rwxrwxr-x)
mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk/thunderbird-3.0a1pre.en-US.win32.installer.exe' changed to 0775 (rwxrwxr-x)
mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk/thunderbird-3.0a1pre.en-US.win32.zip' changed to 0775 (rwxrwxr-x)
Fri Oct  5 23:36:52 PDT 2007

Read from remote host stage.mozilla.org: Connection reset by peer
----------------------------------------------------------------------------

So it appears the remote command completes without problems, but that ssh fails to "hang up".
philor caught tbnewref-win32-tbox in the act, so we were able to do some debugging. Of the three ssh connections (ensuring the tinderbox-builds/TB... dir exists, scp'ing the builds up, fixing the permissions), it was actually the second that appeared to be hung, but at some point after the two files were transferred. Although that doesn't quite make sense, since the commands are not backgrounded and should be called consecutively. Attaching to sshd with strace (on stage) gave only a fragmentary "read(5,".

I've sprinkled some more date's around, and put some 2>&1 in so we actually catch stderr in the build log (thanks tinderbox!). That's on tbnewref-win32-tbox and fx-win32-tbox.
QA Contact: mozpreed → build
This problem continues to occur occasionally. 

Here's a diff between the debug output when it does work ok, and when it doesn't. There's nothing obvious there AFAICT.

Bug 400846 might be the same issue - it suggests the mkdir call is culprit.
Have a look at: https://bugzilla.mozilla.org/show_bug.cgi?id=355309#c41

It looks similar, although interesting that the win32 trunk slave hung *every* time, while here it only hangs "occasionally".

Can you try the same workaround here also, to see if the same workaround helps solve this bug also?


oh, and when cvs hung for us on win32 trunk, it did so regardless of whether the directories already existed, or had to be created. It would even hang if there were no files to update. So, I suspect the mkdir comment above is a red herring, and the problem is to do with ending of a process and ssh not disconnecting...
The try server had similar problems, upgrading msys ssh fixed it. There are details here:
https://bugzilla.mozilla.org/show_bug.cgi?id=400846#c2
Assignee: nrthomas → nobody
Status: REOPENED → NEW
Depends on: 402848
Priority: P1 → P3
bhearsum updated these two boxes to MozillaBuild v1.2, with it's much later version of ssh; I've removed the local logging-patches to post-mozilla-rel.pl.

Please reopen this bug if the problem recurs.
Status: NEW → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: