Closed Bug 1373938 Opened 3 years ago Closed 2 years ago

Submit worker type pending count data to statsum and/or ActiveData

Categories

(Taskcluster :: Services, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: jhford)

Details

Attachments

(1 file)

PR
55 bytes, text/x-github-pull-request
bstack
: review+
Details | Review
It appears that release engineering has graphs in grafana to show historical data about pending counts.

The taskcluster team has graphs about pending wait times within signalfx, but those might not be the same numbers that are helpful to releng and it's protected behind a login.

I'm not sure the mechanism that acquires that data from buildbot to send to graphite, but perhaps we can adapt it to periodically look at the pending counts for worker types that releng is concerned with and stuff it in there.  I'm happy to help in any way that I can.
We're currently using hostedgraphite for this kind of metric. e.g.

https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending

I believe the metric name is for this would be $prefix.releng.pending.$poolnmame

Does signalfx have no public access for dashboards?
Priority: -- → P2
(In reply to Chris AtLee [:catlee] from comment #1)
> We're currently using hostedgraphite for this kind of metric. e.g.
> 
> https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/
> grafana/dashboard/db/pending
> 
> I believe the metric name is for this would be
> $prefix.releng.pending.$poolnmame

I came across this dashboard, but could not find out exactly what is stuffing the data into graphite.  I think this is for AWS [1] but I didn't find where we're monitoring the same for physical machines.  I think whatever is doing the polling for pending can just be adjusted to also do it for worker types that releng is concerned with.  We provide an endpoint that can report pending counts for a given provisionerId/workerType [2].


> 
> Does signalfx have no public access for dashboards?

They do not.  I'm reaching out again to see where it's at on their roadmap. However, we are currently not tracking pending counts in signalfx.  We track pending wait times.  Not that it would be impossible to track pending counts, but we have found that wait times are much more useful for knowing when something is going wrong and people are waiting too long for results.  

This is the dashboard btw, https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI .  I can get you access if you don't have it already. (note, osx testers are not on there yet, there is a follow up bug to add it, but you can build a custom graph showing it).


[1] https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py
[2] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#pendingTasks
Catlee, is RelEng capturing this data?

Dustin, do we have any work planned for recording historical pending counts for all provisioner/workerType combinations somewhere? This feels generic enough that RelEng shouldn't have to build a custom solution.
Flags: needinfo?(dustin)
Flags: needinfo?(catlee)
We record this sort of information in signalfx.  If we don't have pending counts, we could certainly add that.  It has the issues described above regarding public access.

I just chatted with gps and he mentioned that ActiveData can slurp up this sort of information.  We could probably modify tc-lib-monitor to send data to ActiveData in addition to SignalFx (and later only to ActiveData)
Flags: needinfo?(dustin)
Component: Platform Support → Queue
Product: Release Engineering → Taskcluster
QA Contact: catlee
Summary: Pending counts for gecko-t-osx-1010 are not recorded in graphite/grafana → Submit worker type pending count data to statsum and/or ActiveData
(In reply to Pete Moore [:pmoore][:pete] from comment #3)
> Catlee, is RelEng capturing this data?

No, we don't capture any non-buildbot pending counts I don't think.
Flags: needinfo?(catlee)
Coop,

Maybe we should consider this for 2018 Q1 or Q2?

Jonas, is this a lot of work to implement? I'm guessing the easiest is for the queue to publish this data, but it could also be published by an external service that routinely queries the queue for pending count data. Depends how monolithic we want to make the queue, I guess. ;)

In general, we should probably form some kind of working group that looks at operational concerns, and data we provide, composed of people from TaskCluster, RelEng, our Cloud Operations team(s)?, Build Duty, Sheriffs, .... to make sure we have a shared view on what data we want to publish, how we make it available, support the services that provide access to the data, etc.. Historical capture of pending counts across workerTypes is probably just a starting point. Until now, we've mostly just published data we thought might be interesting, but haven't really had an operations-driven project plan etc.

Pete
Flags: needinfo?(jopsen)
Flags: needinfo?(coop)
This would be fairly simple to implement separate from the queue...
or as part of the queue in a background process.. But it would probably take a dedicated background process.
Flags: needinfo?(jopsen)
Not saying that it's something we should do in Q1/Q2, if we do queue using postgres, we'll get the option of making some long running analytics on that... At-least that's the thinking dustin and bstack have.
(In reply to Pete Moore [:pmoore][:pete] from comment #6) 
> Maybe we should consider this for 2018 Q1 or Q2?

Yes, I would love to start collect this data in someplace useful. Let's decide where/how in January.
Flags: needinfo?(coop)
No longer blocks: 1372229
We're interested in dashboards again. This data may end up getting piped to a statuspage.io instance.
Attached file PR
We're already doing the polling for this in the provisioner.  This simple patch adds a monitor.measure to this polling, so that we'll make this pending tasks available per worker type
Attachment #8991826 - Flags: review?(bstack)
Commit pushed to master at https://github.com/taskcluster/taskcluster-lib-api

https://github.com/taskcluster/taskcluster-lib-api/commit/470df11b81455d9320f35bcce0e2c4be2de1c3e4
Merge pull request #108 from taskcluster/bug1373938

Bug 1437461 - Use taskcluster-lib-artifact-go for uploading/downloading artifacts
Attachment #8991826 - Flags: review?(bstack) → review+
I've landed this patch, and it will be a part of the next deployment
QA Contact: jhford
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Assignee: nobody → jhford
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.