Closed Bug 1473990 Opened 7 years ago Closed 7 years ago

Persistent workers (expiration not expected)

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: dhouse, Assigned: jhford)

Details

Is there a means to configure the releng-hardware provisioner workers to all never expire unless we specifically remove them? We have workers that stop taking work (broken for some reason), and then they are expired/disappear. But we do not expect workers to ever go away on their own because of not taking work. If we must set an expiration, and instead of setting it to some date far-off, can we get alerts sent (through taskcluster-notify maybe to email/irc?) when a worker expires?
Hi Dave, I'm not sure what you mean by workers expiring. Workers, as a concept, don't currently have a concept of expiration themselves. Do you mean their credentials expiring? Workers are also free to show up and go away on their own schedule, so long as they have the appropriate taskcluster credentials to claim work.
Flags: needinfo?(dhouse)
Hi John. I've included an example of what I mean below. What is the taskcluster way for keeping worker state? The workers in the releng-hardware provisioner will not physically go away unless we manually remove them. So we want to keep the history of tasks and see the worker through the api even when a worker has not claimed work for N-days. ------ We have workers that "disappear" or "expire" if they do not run a task in 24 hours. When this happens, the worker explorer does not find the worker and direct queries to the api say the worker is not found. Here is an example for a worker that (from the logs) last completed a task 24+6.5 hours before this query: ```https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-186 Error ResourceNotFound Worker with workerId t-yosemite-r7-186, workerGroup mdc2,worker-type gecko-t-osx-1010 and provisionerId releng-hardware not found. Are you sure it was created? { "method": "getWorker", "params": { "provisionerId": "releng-hardware", "workerType": "gecko-t-osx-1010", "workerGroup": "mdc2", "workerId": "t-yosemite-r7-186" }, "payload": {}, "time": "2018-07-13T15:05:07.841Z" } ``` ``` --2018-07-13 08:09:34-- https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc2/t-yosemite-r7-186 Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt' Resolving queue.taskcluster.net (queue.taskcluster.net)... 23.23.82.166, 54.235.174.77, 54.225.200.206 Connecting to queue.taskcluster.net (queue.taskcluster.net)|23.23.82.166|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2018-07-13 08:09:35 ERROR 404: Not Found. ``` The last task completed with an error (timeout) the day before: ``` Jul 12 00:57:56 t-yosemite-r7-186.test.releng.mdc2.mozilla.com generic-worker[317]: 2018/07/12 00:57:56 Executing command 0: ["python2.7" "-u" "mozharness/scripts/desktop_unittest.py" "--cfg" "mozharness/configs/unittests/mac_unittest.py" "--mochitest-suite=browser-chrome-chunked" "--e10s" "--installer-url" "https://queue.taskcluster.net/v1/task/XH0GFYJMR32xTBQ8I9s42A/artifacts/public/build/target.dmg" "--test-packages-url" "https://queue.taskcluster.net/v1/task/XH0GFYJMR32xTBQ8I9s42A/artifacts/public/build/target.test_packages.json" "--download-symbols" "true" "--mochitest-suite=browser-chrome-chunked" "--e10s" "--total-chunk=7" "--this-chunk=2"] [...] Jul 12 01:25:55 t-yosemite-r7-186 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.pid.ScreenSaverEngine.1039): Path not allowed in target domain: type = uid, path = /System/Library/Frameworks/AppKit.framework/Versions/C/XPCServices/SandboxedServiceRunner.xpc/Contents/MacOS/SandboxedServiceRunner error = 1: Operation not permitted, origin = /System/Library/Frameworks/ScreenSaver.framework/Versions/A/Resources/ScreenSaverEngine.app [...] 01:12:04 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 01:12:04 ERROR - Force-terminating active process(es). 01:12:04 INFO - Determining child pids from psutil... 01:12:04 INFO - [] 01:12:04 INFO - Found child pids: set([]) 01:12:04 INFO - Killing process: 883 01:12:04 INFO - TEST-INFO | started process screencapture 01:28:44 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1531382273/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1531382273/build/tests/mochitest/runtests.py', '--total-chunks', '7', '--this-chunk', '2', '--appname=/Users/cltbld/tasks/task_1531382273/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=/Users/cltbld/tasks/task_1531382273/build/symbols', '--certificate-path=tests/certs', '--quiet', '--log-raw=/Users/cltbld/tasks/task_1531382273/build/blobber_upload_dir/browser-chrome-chunked_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1531382273/build/blobber_upload_dir/browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1531382273/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime'] 01:28:44 ERROR - timed out after 1000 seconds of no output 01:28:44 ERROR - Return code: -15 01:28:44 ERROR - No suite end message was emitted by this harness. [...] 01:28:45 WARNING - returning nonzero exit status 2 ```
Flags: needinfo?(dhouse)
Ahh, in the provisioner explorer. That's because the information exposed in that tool is based on what the queue sees. I cannot remember the defined period for how long we store this state off hand, but we should send out messages over rabbitmq when tasks complete. Listening to the RabbitMQ messages would provide the information needed and could be stored as long as needed. Does that sound like a reasonable approach?
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #3) > Ahh, in the provisioner explorer. That's because the information exposed in > that tool is based on what the queue sees. I cannot remember the defined > period for how long we store this state off hand, but we should send out > messages over rabbitmq when tasks complete. Listening to the RabbitMQ > messages would provide the information needed and could be stored as long as > needed. Does that sound like a reasonable approach? That sounds possible. So we'd need to stand something up to listen for messages. And we would watch for the worker expired messages?
(In reply to Dave House [:dhouse] from comment #4) > (In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #3) > > Ahh, in the provisioner explorer. That's because the information exposed in > > that tool is based on what the queue sees. I cannot remember the defined > > period for how long we store this state off hand, but we should send out > > messages over rabbitmq when tasks complete. Listening to the RabbitMQ > > messages would provide the information needed and could be stored as long as > > needed. Does that sound like a reasonable approach? > > That sounds possible. So we'd need to stand something up to listen for > messages. And we would watch for the worker expired messages? Hi John, is there anything that is listening to the RabbitMQ messages for other notifications currently? Alternatively, is there a database/other of the time-series data (that we can query or set alerts on) or expiration log messages (to alert off in papertrail)?
Flags: needinfo?(jhford)
(In reply to Dave House [:dhouse] from comment #4) > That sounds possible. So we'd need to stand something up to listen for > messages. And we would watch for the worker expired messages? Yes, you'd need to run something which listens to the right exchange. The node.js client (github.com/taskcluster/taskcluster-client) has a listener component already built in, I believe, which would make this not too difficult to implement. Let me know if I can help getting something like that started. There's not really worker expired message here, what you'd listen for is the different task completion messages and log when a relevant worker id is the one which completed that task. Basically you'd be building a smaller, task-specific, version of what the provisioner explorer does. (In reply to Dave House [:dhouse] from comment #5) > Alternatively, is there a database/other of the time-series data (that we > can query or set alerts on) or expiration log messages (to alert off in > papertrail)? I'm not aware of any other data source which would provide enough information here.
Flags: needinfo?(jhford)
There's no action to take here in the queue as best as I can tell. I'm going to mark this bug as resolved, but please feel free to file new bugs if there's any changes to the Queue which would make things easier to implement this.
Assignee: nobody → jhford
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #7) > There's no action to take here in the queue as best as I can tell. I'm > going to mark this bug as resolved, but please feel free to file new bugs if > there's any changes to the Queue which would make things easier to implement > this. Thank you. I'm thinking to track messages and state like you described in comment #6. I have been querying the queue for worker tasks and finding the time-last-active, but listening for messages will reduce my network traffic.
Status: RESOLVED → VERIFIED
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.