Closed
Bug 1425438
Opened 7 years ago
Closed 7 years ago
Idle time inaccurate when (and after) computer sleeps/hibernates/is not being watched
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Unassigned)
References
Details
This is a weird one - see the following log:
2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s.
2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s.
2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s.
2017/12/14 22:07:27 Error: (Intermittent) HTTP response code 503
HTTP/1.1 503 Service Unavailable
Content-Length: 506
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Type: text/html; charset=utf-8
Date: Thu, 14 Dec 2017 21:07:26 GMT
Server: Cowboy
<!DOCTYPE html>
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta charset="utf-8">
<title>Application Error</title>
<style media="screen">
html,body,iframe {
margin: 0;
padding: 0;
}
html,body {
height: 100%;
overflow: hidden;
}
iframe {
width: 100%;
height: 100%;
border: 0;
}
</style>
</head>
<body>
<iframe src="//www.herokucdn.com/error-pages/application-error.html"></iframe>
</body>
</html>
2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s.
2017/12/14 23:34:33 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55564->54.225.234.26:443: read: operation timed out
2017/12/15 00:28:06 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55618->50.16.212.167:443: read: operation timed out
2017/12/15 01:38:44 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55677->54.221.201.38:443: read: operation timed out
2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s.
2017/12/15 02:27:58 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55770->50.16.212.167:443: read: operation timed out
2017/12/15 02:33:56 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55852->54.221.201.38:443: read: connection reset by peer
2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s.
2017/12/15 03:20:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55865->54.225.234.26:443: read: operation timed out
2017/12/15 04:11:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55914->50.16.212.167:443: read: operation timed out
2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s.
2017/12/15 04:59:48 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56005->54.225.234.26:443: read: operation timed out
2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s.
2017/12/15 05:56:25 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56091->54.225.234.26:443: read: operation timed out
2017/12/15 06:47:29 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56188->50.16.212.167:443: read: operation timed out
2017/12/15 06:59:24 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: dial tcp 54.221.201.38:443: connect: network is down
2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s.
2017/12/15 08:01:41 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56244->54.225.234.26:443: read: operation timed out
2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s.
2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s.
2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s.
2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s.
2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s.
No task was claimed during this period, so the idle times are inaccurate. Reverse engineering the last active time, we get:
2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s. => idle since 19:15:08
2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s. => idle since 19:15:08
2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s. => idle since 19:15:08
2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s. => idle since 19:15:08
2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s. => idle since 22:31:37
2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s. => idle since 23:20:40
2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s. => idle since 00:48:26
2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s. => idle since 01:32:05
2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s. => idle since 03:22:06
2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s. => idle since 04:18:46
2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s. => idle since 04:19:00
2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s. => idle since 04:19:00
2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s. => idle since 04:19:00
2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s. => idle since 04:19:00
During the period where the computer was possibly in some kind of hibernation mode, the idle times seem to get reduced. When the computer is fully awake around 08:00, the idle times don't recover, but remain relative to the previous calculation.
This is strange, because idle times are calculated *absolutely* rather than being cumulatively added. There is an absolute timestamp stored of when the worker was last active[1], and at each logging entry, the duration between the current time and the last active time is calculated[2]. The only time the last active timestamp is updated, is when a new task is claimed[1].
This is a mystery to me at the moment. Even if there were some very slow threads during the hibernation time, I would expect that things would recover when the machine went out of hibernation mode. The fact the timestamps are accurate after the machine wakes, makes me think it isn't a bug in go's interpretation of the current time.
One troubleshooting option I have is to output the last active timestamp in the log line too, and to log every time it is set or updated. This might help diagnose the problem.
[1] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L608
[2] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L615
Comment 1•7 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/897af5679d299859bfdeb6c8126be3d2e87c4654
Bug 1425438 - extra debugging info
Reporter | ||
Comment 2•7 years ago
|
||
So, having left this running, I now have the data I was after. Here is an example:
2017/12/18 03:14:49 No task claimed. Idle for 6h9m25.196602544s (since 2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164).
Clearly the duration between "2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164" and "2017/12/18 03:14:49" is more than 6h9m25.196602544s.
This logging output comes from link in comment 1 above.
Comment 3•7 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/ecc115fd90d9984320428797e6b4223283b41225
Bug 1425438 - even more logging to see why idle time calculation might be incorrect
Reporter | ||
Comment 4•7 years ago
|
||
OK this is looking more and more like a go bug:
2017/12/19 13:43:49 No task claimed at 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752. Idle for 23m40.900620113s (since 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639).
which comes essentially from:
> now := time.Now()
> idleTime := now.Sub(lastActive)
> log.Printf("No task claimed at %v. Idle for %v (since %v).", now, idleTime, lastActive)
=>
lastActive = 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639
now = 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752
idleTime = 23m40.900620113s
But idleTime should be the duration between the two timestamps, i.e. around 1h29m1s.
I'll see if I can create a small program to reproduce issue, and submit a golang bug report.
Reporter | ||
Comment 5•7 years ago
|
||
Comment 6•7 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/569e52e9fcf31463dac4e69ec669450e47b7f122
Revert "Bug 1425438 - even more logging to see why idle time calculation might be incorrect"
This reverts commit ecc115fd90d9984320428797e6b4223283b41225.
https://github.com/taskcluster/generic-worker/commit/7dfbec08660832e7cbb5eab4c3cace702456b2ca
Revert "Bug 1425438 - extra debugging info"
This reverts commit 897af5679d299859bfdeb6c8126be3d2e87c4654.
https://github.com/taskcluster/generic-worker/commit/e7aada1a7843b53c93a6568af33e9c9c386f3f6e
Bug 1425438 - use wall time durations rather than monotonic clock durations in time arithmetic
Reporter | ||
Comment 7•7 years ago
|
||
Fixed in generic-worker 10.4.1.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•