Closed
Bug 781860
Opened 13 years ago
Closed 12 years ago
pesky keepalives for buildbot
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cransom, Assigned: catlee)
Details
(Whiteboard: [buildbot][reit])
Attachments
(2 files, 1 obsolete file)
|
1.33 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.78 KB,
patch
|
dustin
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
We keep butting heads against firewalls when they time out stale (30 minutes with no traffic) sessions but the buildbot connection is still alive. There are several provisions to keep this from happening and I was chatting with Dustin and the one that's most useful is built into buildbot and broken in the current deployed build.
The next option is enabling kernel/tcp level keepalives which is independent of the code (as long as the code makes the necessary syscall to enable it) and from cursory glances, is being called in twisted. If it is being set, there's a one line change on the build masters to lower the keepalive to 10 minutes (or anything less than 30) and we'd never have problem with a stateful device again. I tried verifying this on a buildmaster by checking to see if SO_KEEPALIVE was set on the established connections but lsof's -T f is invalid on this platform.
The extreme win would enabling the same tweaks on the client side since if the client is driving the connection, there could be a severe network disturbance (firewall reboot or something else that would drop all state) and the sessions would be rebuilt on the firewall without dropping the connection to the buildmaster. But I would be ecstatic if we get server keepalives started and I'm more than happy to give you whatever time I have next week (and beyond) to work on this.
Comment 1•13 years ago
|
||
A read through the source on the master shows no calls to setTcpKeepAlive, although it is called on the slave. The problem with the sysctl tweak is that not all of the slaves are Linux!
So, this may require a small code change, below. I'll verify this with the in-use version of Buildbot before suggesting the patch.
diff --git a/master/buildbot/buildslave.py b/master/buildbot/buildslave.py
index e7e1d65..65ed03a 100644
--- a/master/buildbot/buildslave.py
+++ b/master/buildbot/buildslave.py
@@ -270,6 +270,8 @@ class AbstractBuildSlave(config.ReconfigurableServiceMixin, pb.Avatar,
if self.slave_status:
self.slave_status.recordConnectTime()
+ # make sure we're using TCP keepalives
+ mind.broker.transport.setTcpKeepAlive(1)
if self.isConnected():
# duplicate slave - send it to arbitration
in addition to a sysctl.conf addition something like
net.ipv4.tcp_keepalive_time = 240
I'm going to add something similar to the above patch in the latest buildbot. The backport should be simple.
Assignee: server-ops-releng → dustin
Comment 2•13 years ago
|
||
OK, I've verified that the masters are *not* setting keepalive, so the above patch would be required.
Here's the patch I committed upstream:
https://github.com/buildbot/buildbot/commit/29a30c7157ffc53f997b747894f4c3f8215285e0
So, my recommendation would be applying the patch in comment 1 to hgmo/build/buildbot, and applying the sysctl in comment 1 using puppet.
Assignee: dustin → nobody
Component: Server Operations: RelEng → Release Engineering: Automation (General)
QA Contact: arich → catlee
| Reporter | ||
Comment 3•13 years ago
|
||
Is there any idea when we can get a few minutes of someone's time to apply the very tiny patch?
| Assignee | ||
Updated•13 years ago
|
OS: Mac OS X → All
Priority: -- → P3
Hardware: x86 → All
Whiteboard: [buildbot]
Comment 4•13 years ago
|
||
Ping on this.
Comment 5•13 years ago
|
||
A few notes from my digging back into this:
This bug is *strictly* about TCP keepalives, not application-level keepalives.
We need at least one end of the connection to have SO_KEEPALIVE set *and* have the keepalive duration shorter than 30m (1800s). We know how to do that on Linux, but not really on OS X, much less Windows. The suggestion in comment 1 is to set SO_KEEPALIVE and change the duration (using sysctl) on the masters, since they're all linux. 240 is less than 1800, but is smaller than strictly required.
(In reply to Ravi Pina [:ravi] from comment #4)
> Ping on this.
Ravi - due to bug 817597 I believe we now understand the entire picture better - planning for which long term solution we use will occur in bug 817838.
(It sounds like we have some per datacenter configurations going on now, and it makes sense to sort this out before the next wave of test machines arrive.)
| Assignee | ||
Comment 7•12 years ago
|
||
Attachment #688984 -
Flags: review?(rail)
| Assignee | ||
Comment 8•12 years ago
|
||
So I tried applying the given patch to my master. I stopped the slave partway through a build and waited to see what happened.
The build step gets interrupted with:
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
However, the socket is still active according to netstat:
tcp 0 705 10.12.48.14:9049 10.134.65.114:34762 ESTABLISHED 26583/python
| Assignee | ||
Comment 9•12 years ago
|
||
The socket finally closed itself 30 minutes later.
/proc/sys/net/ipv4/tcp_keepalive_time is set to 240.
| Assignee | ||
Comment 10•12 years ago
|
||
...which makes sense given that /proc/sys/net/ipv4/tcp_keepalive_probes is 9.
| Assignee | ||
Comment 11•12 years ago
|
||
The patch for our version of buildbot looks like this:
diff -r 39c79564a0dd master/buildbot/master.py
--- a/master/buildbot/master.py Wed Dec 05 09:09:46 2012 -0500
+++ b/master/buildbot/master.py Wed Dec 05 19:31:31 2012 -0800
@@ -321,6 +321,12 @@
# record when this connection attempt occurred
sl.recordConnectTime()
+ # try to use TCP keepalives
+ try:
+ mind.broker.transport.setTcpKeepAlive(1)
+ except:
+ pass
+
if sl.isConnected():
# uh-oh, we've got a duplicate slave. The most likely
# explanation is that the slave is behind a slow link, thinks we
Comment 12•12 years ago
|
||
Comment on attachment 688984 [details] [diff] [review]
hacky sysctl replacement for old puppet
Review of attachment 688984 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/sysctl/manifests/init.pp
@@ +1,5 @@
> +define sysctl($value) {
> + exec {
> + "sysctl-$name":
> + command => "/sbin/sysctl $name=$value",
> + onlyif => "/sbin/sysctl $name | grep -q -v $value";
A nit. You can use sysctl -n to print only values.
Attachment #688984 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 13•12 years ago
|
||
Comment on attachment 688984 [details] [diff] [review]
hacky sysctl replacement for old puppet
Landed (with nit addressed) as http://hg.mozilla.org/build/puppet-manifests/rev/a04b7a57a19f
Attachment #688984 -
Flags: checked-in+
Comment 14•12 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #10)
> ...which makes sense given that /proc/sys/net/ipv4/tcp_keepalive_probes is 9.
That explains the mystery of why I suggested 240s. Thanks :)
| Assignee | ||
Comment 15•12 years ago
|
||
committed http://hg.mozilla.org/build/buildbot/rev/878fe8b798d6 and merged to production-0.8 branch.
Comment 16•12 years ago
|
||
So far I gracefully restarted the following masters:
buildbot-master30.srv.releng.scl3.mozilla.com,scl3,build
buildbot-master35.srv.releng.scl3.mozilla.com,scl3,try
buildbot-master12.build.scl1.mozilla.com,scl1,build
buildbot-master32.srv.releng.scl3.mozilla.com,scl3,build
buildbot-master35.srv.releng.scl3.mozilla.com,scl3,build
Comment 17•12 years ago
|
||
Updated list of restarted masters:
buildbot-master34.srv.releng.scl3.mozilla.com,scl3,build
buildbot-master16.build.scl1.mozilla.com,scl1,tests
buildbot-master21.build.scl1.mozilla.com,scl1,tests
buildbot-master06.build.scl1.mozilla.com,scl1,tests
buildbot-master25.build.scl1.mozilla.com,scl1,build
buildbot-master04.build.scl1.mozilla.com,scl1,tests
buildbot-master31.srv.releng.scl3.mozilla.com,scl3,try
buildbot-master33.srv.releng.scl3.mozilla.com,scl3,try
buildbot-master49.srv.releng.scl3.mozilla.com,scl3,try
buildbot-master30.srv.releng.scl3.mozilla.com,scl3,build
buildbot-master35.srv.releng.scl3.mozilla.com,scl3,try
buildbot-master12.build.scl1.mozilla.com,scl1,build
buildbot-master32.srv.releng.scl3.mozilla.com,scl3,build
buildbot-master35.srv.releng.scl3.mozilla.com,scl3,build
Still to be done:
buildbot-master11.build.scl1.mozilla.com,scl1,tests
buildbot-master15.build.scl1.mozilla.com,scl1,tests
buildbot-master17.build.scl1.mozilla.com,scl1,tests
buildbot-master18.build.scl1.mozilla.com,scl1,tests
buildbot-master19.build.mtv1.mozilla.com,mtv1,tests
buildbot-master20.build.mtv1.mozilla.com,mtv1,tests
buildbot-master22.build.mtv1.mozilla.com,mtv1,tests
buildbot-master10.build.mtv1.mozilla.com,mtv1,tests
buildbot-master23.build.scl1.mozilla.com,scl1,tests
buildbot-master24.build.scl1.mozilla.com,scl1,tests
buildbot-master29.build.scl1.mozilla.com,scl1,tests
buildbot-master39.build.scl1.mozilla.com,scl1,tests
buildbot-master40.build.scl1.mozilla.com,scl1,tests
buildbot-master41.build.scl1.mozilla.com,scl1,tests
buildbot-master42.build.scl1.mozilla.com,scl1,tests
buildbot-master43.build.scl1.mozilla.com,scl1,tests
buildbot-master44.build.scl1.mozilla.com,scl1,tests
buildbot-master45.build.scl1.mozilla.com,scl1,tests
buildbot-master46.build.scl1.mozilla.com,scl1,tests
buildbot-master47.build.scl1.mozilla.com,scl1,tests
buildbot-master48.build.scl1.mozilla.com,scl1,tests
buildbot-master37.srv.releng.scl3.mozilla.com,scl3,tests
buildbot-master38.srv.releng.scl3.mozilla.com,scl3,tests
buildbot-master14.build.scl1.mozilla.com,scl1,try
buildbot-master13.build.scl1.mozilla.com,scl1,build
buildbot-master49.srv.releng.scl3.mozilla.com,scl3,build
Comment 18•12 years ago
|
||
I've just checked the looong running console. Looks like all masters except disabled 39-48 have been restarted. \o/
| Reporter | ||
Comment 19•12 years ago
|
||
I checked the sessions for buildbot-master30.srv.releng.scl3 and all the sessions are being refreshed:
{primary:node0}
cransom@fw1.console.releng.scl3.mozilla.net> show security flow session destination-prefix 10.26.48.17 node 1 | grep Timeout
Session ID: 20008042, Policy name: srv--buildbot/639, State: Active, Timeout: 86400, Valid
Session ID: 20025542, Policy name: srv--buildbot/628, State: Active, Timeout: 86384, Valid
Session ID: 20026039, Policy name: srv--buildbot/639, State: Active, Timeout: 86252, Valid
Session ID: 20033856, Policy name: srv--buildbot/639, State: Active, Timeout: 86396, Valid
Session ID: 20045716, Policy name: srv--buildbot/639, State: Active, Timeout: 86384, Valid
Session ID: 20048100, Policy name: srv--buildbot/639, State: Active, Timeout: 86334, Valid
Session ID: 20052266, Policy name: srv--buildbot/628, State: Active, Timeout: 86398, Valid
Session ID: 20054563, Policy name: srv--buildbot/628, State: Active, Timeout: 86398, Valid
Session ID: 20055319, Policy name: srv--buildbot/639, State: Active, Timeout: 86378, Valid
Session ID: 20056382, Policy name: srv--buildbot/639, State: Active, Timeout: 86380, Valid
Session ID: 20057310, Policy name: srv--buildbot/628, State: Active, Timeout: 86392, Valid
Session ID: 20058677, Policy name: srv--buildbot/639, State: Active, Timeout: 86400, Valid
Session ID: 20061768, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid
Session ID: 20077167, Policy name: srv--buildbot/639, State: Active, Timeout: 86396, Valid
Session ID: 20086502, Policy name: srv--buildbot/639, State: Active, Timeout: 86226, Valid
Session ID: 20090456, Policy name: srv--buildbot/639, State: Active, Timeout: 86180, Valid
Session ID: 20104611, Policy name: srv--buildbot/639, State: Active, Timeout: 86386, Valid
Session ID: 20124304, Policy name: srv--buildbot/628, State: Active, Timeout: 86206, Valid
Session ID: 20128454, Policy name: srv--buildbot/628, State: Active, Timeout: 86388, Valid
Session ID: 20133859, Policy name: srv--buildbot/628, State: Active, Timeout: 86334, Valid
Session ID: 20134289, Policy name: srv--buildbot/639, State: Active, Timeout: 86258, Valid
Session ID: 20143284, Policy name: srv--buildbot/628, State: Active, Timeout: 86232, Valid
Session ID: 20159358, Policy name: srv--buildbot/639, State: Active, Timeout: 86350, Valid
Session ID: 20166053, Policy name: srv--buildbot/628, State: Active, Timeout: 86234, Valid
Session ID: 20197499, Policy name: srv--buildbot/639, State: Active, Timeout: 86166, Valid
Session ID: 20205137, Policy name: srv--buildbot/639, State: Active, Timeout: 86254, Valid
Session ID: 20227633, Policy name: srv--buildbot/628, State: Active, Timeout: 86192, Valid
Session ID: 20243742, Policy name: srv--buildbot/639, State: Active, Timeout: 86368, Valid
Session ID: 20251673, Policy name: srv--buildbot/639, State: Active, Timeout: 86236, Valid
Session ID: 20291262, Policy name: srv--buildbot/628, State: Active, Timeout: 86338, Valid
Session ID: 20295651, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid
Session ID: 20339685, Policy name: srv--buildbot/628, State: Active, Timeout: 86168, Valid
Session ID: 20344187, Policy name: srv--buildbot/628, State: Active, Timeout: 86354, Valid
Session ID: 20364402, Policy name: srv--buildbot/639, State: Active, Timeout: 86218, Valid
Session ID: 20365422, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid
Session ID: 20369174, Policy name: srv--buildbot/639, State: Active, Timeout: 86220, Valid
Session ID: 20372030, Policy name: srv--buildbot/639, State: Active, Timeout: 86186, Valid
Session ID: 20374719, Policy name: srv--buildbot/628, State: Active, Timeout: 86276, Valid
Session ID: 20382909, Policy name: srv--buildbot/628, State: Active, Timeout: 86368, Valid
Session ID: 20396374, Policy name: srv--buildbot/639, State: Active, Timeout: 86288, Valid
86400 is the max timeout value so that looks good so far.
The catalyst for this bug was dev-master01.build.scl1.mozilla.com which hasn't been updated yet, as far as I can tell. Is that on the list to be done as well?
| Reporter | ||
Comment 20•12 years ago
|
||
Can anyone comment if the releng side fixes are complete so I can verify and close this?
| Assignee | ||
Comment 21•12 years ago
|
||
Yes, all the releng-side changes are complete. Users with instances on dev-master01 may need to update their buildbot master code with the fix, but we don't need to block on that.
| Assignee | ||
Comment 22•12 years ago
|
||
We should take other sysctl settings from
http://hg.mozilla.org/build/puppet/rev/ef848dbe991c
and
http://hg.mozilla.org/build/puppet/rev/a382ac1232a6
as well
| Assignee | ||
Comment 23•12 years ago
|
||
Attachment #715926 -
Flags: review?(dustin)
| Assignee | ||
Comment 24•12 years ago
|
||
Attachment #715926 -
Attachment is obsolete: true
Attachment #715926 -
Flags: review?(dustin)
Attachment #715928 -
Flags: review?(dustin)
Updated•12 years ago
|
Attachment #715928 -
Flags: review?(dustin) → review+
| Assignee | ||
Updated•12 years ago
|
Attachment #715928 -
Flags: checked-in+
| Assignee | ||
Updated•12 years ago
|
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 25•12 years ago
|
||
Is it possible this might have made bug 710942 worse ? On 2/22 the incidence jumped from less than 0.5/day to 4, then on 2/23 to 9, and still high since. philor says it might be load related from the tests, but would we need to be starving the windows kernel of CPU time for 180s for that to be plausible ?
Flags: needinfo?(catlee)
Comment 26•12 years ago
|
||
Bug 844648 is also suspiciously well lined up with the attachment 715928 [details] [diff] [review] being deployed. I'm wondering if we need some larger values for the timeouts.
Comment 27•12 years ago
|
||
If this is happening often enough, could we capture a tcpdump of it from the master side? Most systems use TCP offload these days, so the ACK would come from the NIC, or at least require very little interaction from the slave kernel.
| Assignee | ||
Comment 28•12 years ago
|
||
These masters are mostly KVM guests as well. I think that means that the kernel itself is subject to being CPU starved as well, especially if other guests on the same host are also very busy.
Anything particular we should be looking for in tcpdump? FIN/RST packets?
Flags: needinfo?(catlee)
Comment 29•12 years ago
|
||
All of the TCP envelopes in the tcpdump, really - so do we see keepalives, what kind of latency is there on acks, do those stop at some point, etc. Basically, if you can capture the last hour or so of a connection that eventually fails, that'd be best.
| Assignee | ||
Comment 30•12 years ago
|
||
Caught some that were disconnected as part of a reconfig. They're in root@bm35:/root/tcpdump-screen.log
I'm not sure these are the types of disconnects we're looking for though.
Comment 31•12 years ago
|
||
OK, here's the capture of a failure on talos-r3-xp-034 (10.12.50.142). Note
that communication here is within the VLAN, so no firewalls are involved.
> 21:59:48.624287 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3637152:3638612, ack 188153, win 65335, length 1460
> 21:59:48.624438 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3638612:3640072, ack 188153, win 65335, length 1460
> 21:59:48.624631 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3640072:3641532, ack 188153, win 65335, length 1460
> a 21:59:48.660293 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3641532, win 5840, length 0
> 21:59:48.660609 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3641532:3642992, ack 188153, win 65335, length 1460
> 21:59:48.660717 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3642992:3644452, ack 188153, win 65335, length 1460
> 21:59:48.661246 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3644452:3645912, ack 188153, win 65335, length 1460
> 21:59:48.661258 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3645912:3647372, ack 188153, win 65335, length 1460
> b 21:59:48.705296 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> c 21:59:48.957470 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 21:59:48.957484 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> 21:59:49.613745 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 21:59:49.613767 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> 21:59:50.816921 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 21:59:50.816954 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> 21:59:53.223199 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 21:59:53.223232 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> 21:59:58.035811 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 21:59:58.035829 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> 22:00:07.660991 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1
> 22:00:07.661025 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0
> d 22:00:23.925080 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 59860, length 0
> e 22:00:24.036399 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16
> 22:00:24.360296 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16
> 22:00:25.008314 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16
> 22:00:26.304386 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16
> 22:00:28.896300 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16
> f 22:01:05.185304 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 65535, length 16
> 22:01:46.657318 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 65535, length 16
> g 22:01:46.657995 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [R], seq 188643936, win 0, length 0
> h 22:02:46.842839 IP 10.12.50.142.1053 > 10.12.49.16.9201: Flags [S], seq 2514350768, win 65535, options [mss 1460,nop,nop,sackOK], length 0
At (a), the slave is pushing data to the master, and the master is ack'ing it.
The master's receive window is getting small, probably because the process is
not reading from the socket. At (b) that window shrinks to zero, but all
transmitted data is ack'd. At (c), the slave sends a window probe. The timing
is a bit odd - usually probes aren't sent until 5s after the window closes.
The slave continues to send probes at intervals of 0.3, 0.6, 1.2, 2.5, 5, and 9
seconds, until the master's window opens up at (d).
At this point, the master begins sending 16 bytes of payload (e), and
retransmits those six times. At (f), the master's window opens up a bit more -
another read from the socket by the process. Given the timings of the earlier
window probes, the next probe should have arrived sometime around 22:00:26, if
the slave hadn't received any packets from the master.
Finally, at (g), the slave sends a RST, basically saying "what the heck are you
talking about". The sequence number in that packet is unrelated to anything
above, suggesting that the slave has no memory of the earlier connection. The
slave immediately reconnects to the master at (h).
There are no keepalive packets here (those look like re-transmits of the last
ack'd byte).
My best guess as to what's happening here is that the slave rebooted
unexpectedly.
| Assignee | ||
Comment 32•12 years ago
|
||
Hmmm...it's odd that it reconnects almost exactly a minute later. I don't think the slaves can reboot that quickly. Maybe just the tcp connection died, and the buildslave retried a minute later.
In any case, the log for this event on the slave is gone.
How do we get some more insight into what is going on in the final moments of a job like this?
Comment 33•12 years ago
|
||
The slaves reboot pretty quickly - something around 45s. So it's not unreasonable.
And if the TCP connection had died but the kernel stayed up, it would still have a record of the sequence number, so it's far more likely that the kernel also restarted.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•