1373938 - Submit worker type pending count data to statsum and/or ActiveData

Reporter

Description

•

7 years ago

It appears that release engineering has graphs in grafana to show historical data about pending counts. The taskcluster team has graphs about pending wait times within signalfx, but those might not be the same numbers that are helpful to releng and it's protected behind a login. I'm not sure the mechanism that acquires that data from buildbot to send to graphite, but perhaps we can adapt it to periodically look at the pending counts for worker types that releng is concerned with and stuff it in there. I'm happy to help in any way that I can.

Chris AtLee [:catlee]

Comment 1

•

7 years ago

We're currently using hostedgraphite for this kind of metric. e.g. https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending I believe the metric name is for this would be $prefix.releng.pending.$poolnmame Does signalfx have no public access for dashboards?

Priority: -- → P2

Greg Arndt [:garndt]

Reporter

Comment 2

•

7 years ago

(In reply to Chris AtLee [:catlee] from comment #1) > We're currently using hostedgraphite for this kind of metric. e.g. > > https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/ > grafana/dashboard/db/pending > > I believe the metric name is for this would be > $prefix.releng.pending.$poolnmame I came across this dashboard, but could not find out exactly what is stuffing the data into graphite. I think this is for AWS [1] but I didn't find where we're monitoring the same for physical machines. I think whatever is doing the polling for pending can just be adjusted to also do it for worker types that releng is concerned with. We provide an endpoint that can report pending counts for a given provisionerId/workerType [2]. > > Does signalfx have no public access for dashboards? They do not. I'm reaching out again to see where it's at on their roadmap. However, we are currently not tracking pending counts in signalfx. We track pending wait times. Not that it would be impossible to track pending counts, but we have found that wait times are much more useful for knowing when something is going wrong and people are waiting too long for results. This is the dashboard btw, https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI . I can get you access if you don't have it already. (note, osx testers are not on there yet, there is a follow up bug to add it, but you can build a custom graph showing it). [1] https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py [2] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#pendingTasks

Pete Moore [:pmoore][:pete]

Comment 3

•

7 years ago

Catlee, is RelEng capturing this data? Dustin, do we have any work planned for recording historical pending counts for all provisioner/workerType combinations somewhere? This feels generic enough that RelEng shouldn't have to build a custom solution.

Flags: needinfo?(dustin)

Flags: needinfo?(catlee)

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

7 years ago

We record this sort of information in signalfx. If we don't have pending counts, we could certainly add that. It has the issues described above regarding public access. I just chatted with gps and he mentioned that ActiveData can slurp up this sort of information. We could probably modify tc-lib-monitor to send data to ActiveData in addition to SignalFx (and later only to ActiveData)

Flags: needinfo?(dustin)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Component: Platform Support → Queue

Product: Release Engineering → Taskcluster

QA Contact: catlee

Summary: Pending counts for gecko-t-osx-1010 are not recorded in graphite/grafana → Submit worker type pending count data to statsum and/or ActiveData

Chris AtLee [:catlee]

Comment 5

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #3) > Catlee, is RelEng capturing this data? No, we don't capture any non-buildbot pending counts I don't think.

Flags: needinfo?(catlee)

Pete Moore [:pmoore][:pete]

Comment 6

•

7 years ago

Coop, Maybe we should consider this for 2018 Q1 or Q2? Jonas, is this a lot of work to implement? I'm guessing the easiest is for the queue to publish this data, but it could also be published by an external service that routinely queries the queue for pending count data. Depends how monolithic we want to make the queue, I guess. ;) In general, we should probably form some kind of working group that looks at operational concerns, and data we provide, composed of people from TaskCluster, RelEng, our Cloud Operations team(s)?, Build Duty, Sheriffs, .... to make sure we have a shared view on what data we want to publish, how we make it available, support the services that provide access to the data, etc.. Historical capture of pending counts across workerTypes is probably just a starting point. Until now, we've mostly just published data we thought might be interesting, but haven't really had an operations-driven project plan etc. Pete

Flags: needinfo?(jopsen)

Flags: needinfo?(coop)

Jonas Finnemann Jensen (:jonasfj)

Comment 7

•

7 years ago

This would be fairly simple to implement separate from the queue... or as part of the queue in a background process.. But it would probably take a dedicated background process.

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 8

•

7 years ago

Not saying that it's something we should do in Q1/Q2, if we do queue using postgres, we'll get the option of making some long running analytics on that... At-least that's the thinking dustin and bstack have.

Chris Cooper [:coop] (he/him)

Comment 9

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #6) > Maybe we should consider this for 2018 Q1 or Q2? Yes, I would love to start collect this data in someplace useful. Let's decide where/how in January.

Flags: needinfo?(coop)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

No longer blocks: 1372229

Chris Cooper [:coop] (he/him)

Comment 10

•

6 years ago

We're interested in dashboards again. This data may end up getting piped to a statuspage.io instance.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 11

•

6 years ago

Attached file PR — Details

We're already doing the polling for this in the provisioner. This simple patch adds a monitor.measure to this polling, so that we'll make this pending tasks available per worker type

Attachment #8991826 - Flags: review?(bstack)

[github robot]

Comment 12

•

6 years ago

Commit pushed to master at https://github.com/taskcluster/taskcluster-lib-api https://github.com/taskcluster/taskcluster-lib-api/commit/470df11b81455d9320f35bcce0e2c4be2de1c3e4 Merge pull request #108 from taskcluster/bug1373938 Bug 1437461 - Use taskcluster-lib-artifact-go for uploading/downloading artifacts

Brian Stack [:bstack]

Updated

•

6 years ago

Attachment #8991826 - Flags: review?(bstack) → review+

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 13

•

6 years ago

I've landed this patch, and it will be a part of the next deployment

QA Contact: jhford

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

6 years ago

Assignee: nobody → jhford

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Queue → Services