Closed
Bug 398113
Opened 17 years ago
Closed 17 years ago
fx-win32-tbox & tbnewref-win32-tbox having occasional trouble with ssh connections, takes 2 hours to time out
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
()
Details
Attachments
(2 files)
886 bytes,
patch
|
Details | Diff | Splinter Review | |
12.61 KB,
text/plain
|
Details |
The fx-win32-tbox builds that started at 18:31 on 2007-09-03 and at 14:52 2007-09-29 took around two and a quarter hours, where the usual build takes 15 minutes. Both builds end with a "Read from remote host stage.mozilla.org: Connection reset by peer" after the "ssh -2 -l cltbld stage.mozilla.org chmod -R 775 ..." line. With talos and bl-bldxp01 both taking 45-50 minutes after the end of a build, those get us into a 3 hour regression window, which is tolerable enough on weekends, but pretty evil during busy checkin times.
Comment 1•17 years ago
|
||
Could this be related to the recent stage migration work late Thurs night / early Friday morning?
Priority: -- → P1
Comment 2•17 years ago
|
||
(In reply to comment #0) > The fx-win32-tbox builds that started at 18:31 on 2007-09-03 and ... > To clarify, did you mean 2007-09-30 or 2007-09-03?
Comment 3•17 years ago
|
||
* Looks from the Firefox tinderbox page that 2007-09-30 18:31 is the build in question * in the ssh log on stage I see the connection but no errors, and the command seems to be timing out after 2 hours * more data points - tbnewref-win32- build starting at * 2007/09/30 22:46 * 2007/09/30 07:48 * 2007/09/29 02:52 have a similar problem (chmod'ing thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk)
Assignee: build → nrthomas
Comment 4•17 years ago
|
||
fx-win32-tbox had 482 ssh processes running, all for connections to the cvs server in the last three hours. I've killed the processes off and restarted the tinderbox. tbnewref-win32-tbox doesn't have any stray processes, so that may not be the smoking gun. Still looking for other boxes with this problem, but so far it appears to be limited to the latest windows reference platform.
Comment 5•17 years ago
|
||
A new variation in the build starting at 2007/10/01 08:57 - cvs was hung checking out client.mk (right at the start of the build). netstat reported connection was ESTABLISHED, nothing else obviously wrong. I restarted both these boxes, possibly ssh got its nose tweaked out of joint when it couldn't contact stage during the migration (friday morning). Leaving this open for a day or two to see if we get any more events.
Summary: fx-win32-tbox having occasional trouble with its connection to stage, takes forever to time out → fx-win32-tbox & tbnewref-win32-tbox having occasional trouble with ssh connections, takes 2 hours to time out
Comment 6•17 years ago
|
||
No events since rebooting, closing.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•17 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1191457680.1191465873.27492.gz (Started 17:28 20071003) "Read from remote host stage.mozilla.org: Connection reset by peer" Sorry.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 8•17 years ago
|
||
Must have jinxed it :-( Thunderbird had the same problem at 2007/10/03 22:58, so here's a local patch for these two boxes, to try to get more info.
Comment 9•17 years ago
|
||
(In reply to comment #8) > Created an attachment (id=283535) [details] > Local patch s/time/date/g on that patch, so that it actually works.
Comment 10•17 years ago
|
||
Some data from a couple of events: fx-win32-tbox, build starting 2007/10/06 07:04 ssh -2 -l cltbld stage.mozilla.org date; chmod -Rv 775 /home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk; date Sat Oct 6 07:20:11 PDT 2007 mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk' changed to 0775 (rwxrwxr-x) mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk/firefox-3.0a9pre.en-US.win32.installer.exe' changed to 0775 (rwxrwxr-x) mode of `/home/ftp/pub/firefox/tinderbox-builds/FX-WIN32-TBOX-trunk/firefox-3.0a9pre.en-US.win32.zip' changed to 0775 (rwxrwxr-x) Sat Oct 6 07:20:11 PDT 2007 Read from remote host stage.mozilla.org: Connection reset by peer ---------------------------------------------------------------------------- tbnewref-win32-tbox, build starting 2007/10/05 23:17 ssh -2 -l cltbld stage.mozilla.org date;chmod -Rv 775 /home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk;date Fri Oct 5 23:36:52 PDT 2007 mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk' changed to 0775 (rwxrwxr-x) mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk/thunderbird-3.0a1pre.en-US.win32.installer.exe' changed to 0775 (rwxrwxr-x) mode of `/home/ftp/pub/thunderbird/tinderbox-builds/TBNEWREF-WIN32--trunk/thunderbird-3.0a1pre.en-US.win32.zip' changed to 0775 (rwxrwxr-x) Fri Oct 5 23:36:52 PDT 2007 Read from remote host stage.mozilla.org: Connection reset by peer ---------------------------------------------------------------------------- So it appears the remote command completes without problems, but that ssh fails to "hang up".
Comment 11•17 years ago
|
||
philor caught tbnewref-win32-tbox in the act, so we were able to do some debugging. Of the three ssh connections (ensuring the tinderbox-builds/TB... dir exists, scp'ing the builds up, fixing the permissions), it was actually the second that appeared to be hung, but at some point after the two files were transferred. Although that doesn't quite make sense, since the commands are not backgrounded and should be called consecutively. Attaching to sshd with strace (on stage) gave only a fragmentary "read(5,". I've sprinkled some more date's around, and put some 2>&1 in so we actually catch stderr in the build log (thanks tinderbox!). That's on tbnewref-win32-tbox and fx-win32-tbox.
Updated•17 years ago
|
QA Contact: mozpreed → build
Comment 12•17 years ago
|
||
This problem continues to occur occasionally. Here's a diff between the debug output when it does work ok, and when it doesn't. There's nothing obvious there AFAICT. Bug 400846 might be the same issue - it suggests the mkdir call is culprit.
Comment 13•17 years ago
|
||
Have a look at: https://bugzilla.mozilla.org/show_bug.cgi?id=355309#c41 It looks similar, although interesting that the win32 trunk slave hung *every* time, while here it only hangs "occasionally". Can you try the same workaround here also, to see if the same workaround helps solve this bug also? oh, and when cvs hung for us on win32 trunk, it did so regardless of whether the directories already existed, or had to be created. It would even hang if there were no files to update. So, I suspect the mkdir comment above is a red herring, and the problem is to do with ending of a process and ssh not disconnecting...
Comment 14•17 years ago
|
||
The try server had similar problems, upgrading msys ssh fixed it. There are details here: https://bugzilla.mozilla.org/show_bug.cgi?id=400846#c2
Updated•17 years ago
|
Comment 15•17 years ago
|
||
bhearsum updated these two boxes to MozillaBuild v1.2, with it's much later version of ssh; I've removed the local logging-patches to post-mozilla-rel.pl. Please reopen this bug if the problem recurs.
Status: NEW → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•