Closed
Bug 1364955
Opened 9 years ago
Closed 8 years ago
Implement tooling to monitor queues for taskcluster jobs on on premise hardware
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P5)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: kmoir, Assigned: aselagea)
References
Details
We need tooling to monitor the queues for taskcluster for our premise hardware pools for mac, linux and windows.
notes on existing code that Greg wrote
So the prototype is here:
https://github.com/gregarndt/taskcluster-task-analysis
It basically just listens for task events and stuffs them in a DB. There are quite a few things that need to be improved here, especially around the DB. There are no indices and I’m running on a basic Postgres heroku plan (limited to 10million rows). Also, because I’m only running with one dyno, there are times where it gets a bit behind with processing pulse messages. It also means if there are pulse issues, then you would not have machine information.
Here is the table schema:
https://github.com/gregarndt/taskcluster-task-analysis/blob/master/postgres/create_table.sql#L1-L25
This could easily iterate through a list of worker_ids and see if they have claimed a task within a certain time period. Or if there is a high failure rate for a particular worker_id.
I warn, this thing is very basic, was not meant for real time production use, needs some love, but it works.
---
We had a meeting last Thursday with Alin and Greg and we agreed that the actions going forward would be to
1) Greg to write instructions on how to install and run locally
2) Alin to run app
3) Decide on criteria for monitoring machines for health
3) Implement alerting and notification based on this criteria
----
Instructions for installing the app from Greg are as follows
I have added a readme to the taskcluster-task-analysis repo:
https://github.com/gregarndt/taskcluster-task-analysis/blob/master/README.md
Along with a sample vagrant file for setting up an ubuntu 14.04 environment:
https://github.com/gregarndt/taskcluster-task-analysis/blob/master/Vagrantfile
You can basically take out the script parts of that file and run on any ubuntu machine or vm. It setups up some dependencies, including docker, node, and yarn.
The readme contains instructions for running the sample Postgres docker container (remember: don’t use in production), along with running the app with the “test” profile which will make use of the local docker container.
Monitoring for rogue machines can look like many things, the simplest is probably:
https://gist.github.com/gregarndt/eab4b1354c7443904264f47134cb5cfb
Which outputs:
set(['t-yosemite-r7-0041', 't-yosemite-r7-0040', 't-yosemite-r7-0042’])
t-yosemite-r7-0045 wasn’t reported because in the last 5 hours it completed a couple of tasks so it’s ok.
Note, the harder part here is knowing when something is truly not picking up work because of bad luck (slow time of the day, other machines are just faster at picking up work) or the worker crashed or machine went down.
I hope this helps
---
| Reporter | ||
Updated•9 years ago
|
Assignee: nobody → aselagea
| Assignee | ||
Comment 1•9 years ago
|
||
Thanks for the above description, Kim!
So I went ahead and tried to run the application on my local Ubuntu VM. I managed to install the prerequisites, started the postresql container and then used yarn to install the application dependencies.
However, I don't have Pulse credentials at the moment. I found this wiki page [1] and then created an account under PulseGuardian. Since I haven't received any access token, I attempted to run the test configuration by specifying those credentials
=> got some TC data, along with an error like:
"Stack: TypeError: Cannot read property 'GITHUB_BASE_REPO_URL' of undefined"
I didn't have any contact with Pulse before, but I suppose those are not the credentials required to run the test configuration.
@Greg: could u please provide some guidance here?
Thanks!
[1] https://wiki.mozilla.org/Auto-tools/Projects/Pulse
Flags: needinfo?(garndt)
| Assignee | ||
Comment 2•9 years ago
|
||
Also, it seems I created a queue: "queue/aselagea/taskcluster-task-analysis" which got deleted after reaching the maximum number of unread messages over night.
"Your queue "queue/aselagea/taskcluster-task-analysis" on exchange "could not be determined" has been
deleted after exceeding the maximum number of unread messages. Upon deletion
there were 20298 messages in the queue, out of a maximum 20000 messages.
Make sure your clients are running correctly and are cleaning up unused
durable queues."
Comment 3•9 years ago
|
||
Pulseguardian monitors queues to ensure they do not grow unbounded. This has consequences on our rabbitmq cluster and can bring down all systems which rely upon it.
Are you sure that whatever is listening to that queue is properly draining it and it was just an instance of it not being able to drain it quick enough? You can monitor queue size within the pulseguardian frontend
There are times (especially during a bunch of large pushes that the queue can grow beyond 20k messages. What I have done for my own queues is request in #pulse (mcote or camd in particular) to mark my durable queue as unbounded, but understand that it comes with a risk that if the queue is not clearly out fast enough, it can grow and cause stability problems within the cluster.
Also, if you are doing this with taskcluster-task-analysis, I would ensure you have the latest copy of it since I removed some indexes that I was creating on the task table. This app is extremely insert heavy so there is a cost to have a lot of indexes in place. You might need to drop some of the indexes on the tasks table if you created that table with an out-of-date copy.
Flags: needinfo?(garndt)
Comment 4•9 years ago
|
||
> I didn't have any contact with Pulse before, but I suppose those are not the
> credentials required to run the test configuration.
> @Greg: could u please provide some guidance here?
> Thanks!
>
> [1] https://wiki.mozilla.org/Auto-tools/Projects/Pulse
Once you login with your ldap credentials to the pulseguardian site, you can then select "My Pulse Users" and click the plus icon to create a new pulse user. I suggest making credentials specific to that app that are not used elsewhere. On that page you configure a username and password that will be used in the config for the app.
| Assignee | ||
Comment 5•8 years ago
|
||
I've been caught with other tasks lately, but I'd like to move on with the work here too.
So I created a new pulse user and managed to run the test configuration for the app. However, I'm still getting an error similar to the one below, suggesting a missing 'GITHUB_BASE_REPO_URL' property. I spent some time debugging this, but I don't know what's really needed.
Stack: TypeError: Cannot read property 'GITHUB_BASE_REPO_URL' of undefined
at buildGithubSourceInformation (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/src/task.js:30:7)
at determineSourceInformation (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/src/task.js:22:12)
at new Task (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/src/task.js:79:19)
at Handler._callee4$ (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/src/handler.js:71:16)
at tryCatch (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/~/neutrino-preset-node/~/regenerator-runtime/runtime.js:63:15)
at Generator.invoke [as _invoke] (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/~/neutrino-preset-node/~/regenerator-runtime/runtime.js:337:1)
at Generator.prototype.(anonymous function) [as next] (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/~/neutrino-preset-node/~/regenerator-runtime/runtime.js:96:1)
at step (/home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/~/neutrino-preset-node/~/babel-runtime/helpers/asyncToGenerator.js:17:1)
at /home/aselagea/work/github/taskcluster-task-analysis/build/webpack:/~/neutrino-preset-node/~/babel-runtime/helpers/asyncToGenerator.js:28:1
at process._tickCallback (internal/process/next_tick.js:109:7)
Error caught when processing message.
@Greg: I wonder if you experienced the same issues at some point&know how to fix it.
Flags: needinfo?(garndt)
Comment 6•8 years ago
|
||
I believe this commit should fix it:
https://github.com/gregarndt/taskcluster-task-analysis/commit/d9a6a8d078113b4fbce350c89245f4f395731076
Also, as a side note, you probably do not need this app anymore. Kim has been working at integrating some stuff into slavehealth that will help us in the short term.
Flags: needinfo?(garndt)
Comment 7•8 years ago
|
||
Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.
Priority: -- → P5
| Assignee | ||
Comment 8•8 years ago
|
||
Hassan Ali has been working on a dashboard that lists a lot of useful information about the TC workertypes: https://tools.taskcluster.net/provisioners/
So I guess this one is no longer needed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Updated•8 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•