Closed Bug 1425438 Opened 7 years ago Closed 7 years ago

Idle time inaccurate when (and after) computer sleeps/hibernates/is not being watched

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

References

Details

This is a weird one - see the following log: 2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s. 2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s. 2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s. 2017/12/14 22:07:27 Error: (Intermittent) HTTP response code 503 HTTP/1.1 503 Service Unavailable Content-Length: 506 Cache-Control: no-cache, no-store Connection: keep-alive Content-Type: text/html; charset=utf-8 Date: Thu, 14 Dec 2017 21:07:26 GMT Server: Cowboy <!DOCTYPE html> <html> <head> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta charset="utf-8"> <title>Application Error</title> <style media="screen"> html,body,iframe { margin: 0; padding: 0; } html,body { height: 100%; overflow: hidden; } iframe { width: 100%; height: 100%; border: 0; } </style> </head> <body> <iframe src="//www.herokucdn.com/error-pages/application-error.html"></iframe> </body> </html> 2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s. 2017/12/14 23:34:33 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55564->54.225.234.26:443: read: operation timed out 2017/12/15 00:28:06 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55618->50.16.212.167:443: read: operation timed out 2017/12/15 01:38:44 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55677->54.221.201.38:443: read: operation timed out 2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s. 2017/12/15 02:27:58 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55770->50.16.212.167:443: read: operation timed out 2017/12/15 02:33:56 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55852->54.221.201.38:443: read: connection reset by peer 2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s. 2017/12/15 03:20:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55865->54.225.234.26:443: read: operation timed out 2017/12/15 04:11:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55914->50.16.212.167:443: read: operation timed out 2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s. 2017/12/15 04:59:48 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56005->54.225.234.26:443: read: operation timed out 2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s. 2017/12/15 05:56:25 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56091->54.225.234.26:443: read: operation timed out 2017/12/15 06:47:29 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56188->50.16.212.167:443: read: operation timed out 2017/12/15 06:59:24 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: dial tcp 54.221.201.38:443: connect: network is down 2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s. 2017/12/15 08:01:41 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56244->54.225.234.26:443: read: operation timed out 2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s. 2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s. 2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s. 2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s. 2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s. No task was claimed during this period, so the idle times are inaccurate. Reverse engineering the last active time, we get: 2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s. => idle since 19:15:08 2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s. => idle since 19:15:08 2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s. => idle since 19:15:08 2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s. => idle since 19:15:08 2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s. => idle since 22:31:37 2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s. => idle since 23:20:40 2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s. => idle since 00:48:26 2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s. => idle since 01:32:05 2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s. => idle since 03:22:06 2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s. => idle since 04:18:46 2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s. => idle since 04:19:00 2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s. => idle since 04:19:00 2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s. => idle since 04:19:00 2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s. => idle since 04:19:00 During the period where the computer was possibly in some kind of hibernation mode, the idle times seem to get reduced. When the computer is fully awake around 08:00, the idle times don't recover, but remain relative to the previous calculation. This is strange, because idle times are calculated *absolutely* rather than being cumulatively added. There is an absolute timestamp stored of when the worker was last active[1], and at each logging entry, the duration between the current time and the last active time is calculated[2]. The only time the last active timestamp is updated, is when a new task is claimed[1]. This is a mystery to me at the moment. Even if there were some very slow threads during the hibernation time, I would expect that things would recover when the machine went out of hibernation mode. The fact the timestamps are accurate after the machine wakes, makes me think it isn't a bug in go's interpretation of the current time. One troubleshooting option I have is to output the last active timestamp in the log line too, and to log every time it is set or updated. This might help diagnose the problem. [1] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L608 [2] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L615
So, having left this running, I now have the data I was after. Here is an example: 2017/12/18 03:14:49 No task claimed. Idle for 6h9m25.196602544s (since 2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164). Clearly the duration between "2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164" and "2017/12/18 03:14:49" is more than 6h9m25.196602544s. This logging output comes from link in comment 1 above.
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/ecc115fd90d9984320428797e6b4223283b41225 Bug 1425438 - even more logging to see why idle time calculation might be incorrect
OK this is looking more and more like a go bug: 2017/12/19 13:43:49 No task claimed at 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752. Idle for 23m40.900620113s (since 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639). which comes essentially from: > now := time.Now() > idleTime := now.Sub(lastActive) > log.Printf("No task claimed at %v. Idle for %v (since %v).", now, idleTime, lastActive) => lastActive = 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639 now = 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752 idleTime = 23m40.900620113s But idleTime should be the duration between the two timestamps, i.e. around 1h29m1s. I'll see if I can create a small program to reproduce issue, and submit a golang bug report.
Commits pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/569e52e9fcf31463dac4e69ec669450e47b7f122 Revert "Bug 1425438 - even more logging to see why idle time calculation might be incorrect" This reverts commit ecc115fd90d9984320428797e6b4223283b41225. https://github.com/taskcluster/generic-worker/commit/7dfbec08660832e7cbb5eab4c3cace702456b2ca Revert "Bug 1425438 - extra debugging info" This reverts commit 897af5679d299859bfdeb6c8126be3d2e87c4654. https://github.com/taskcluster/generic-worker/commit/e7aada1a7843b53c93a6568af33e9c9c386f3f6e Bug 1425438 - use wall time durations rather than monotonic clock durations in time arithmetic
See Also: → 1424986
Fixed in generic-worker 10.4.1.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.