Closed Bug 1136537 Opened 9 years ago Closed 9 years ago

gecko + metrics: Integrated metrics based on branch/symbol/platform/etc..

Categories

(Taskcluster :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlal, Assigned: pmoore)

Details

Continuing in the tradition of using routes to extend the capabilities of our tasks (like the tc-treeherder routes do http://mozilla-taskcluster.readthedocs.org/en/latest/treeherder/) we should have a service which listens for when tasks enter a completed/failed/exception state and collect metrics of various kinds.

The initial metrics I care about are :

  - How long did I wait for this job to start from the time I pushed to branch.
    This is a combination of fetching data from the pushlog and the taskcluster task level metrics.

  - How long between pending state of the task and completed (how long did we wait for a machine after we decided to run it)

  - If a build failed why? [this is a harder one starting with just build failures is good start and we can investigate from there]

These metrics should be broken down by:

 - platform (values are stored in task.extra)
 - symbol (in task.extra)
 - repository (this is currently an env var GECKO_HEAD_REPOSITORY)

This bug should be used for initial proof of concept testing and to start gathering early data ideally we publish this data into influxdb (this becomes way more powerful once influx 0.9 is released).
Pete, can you take an initial crack at this?
Flags: needinfo?(pmoore)
(This got a bit long, so I made most of it into bullet points, enjoy)

I would suggests metric be broken down by `task.tags` which is a mapping from strings to strings, which can trivially be put into influxdb.
Example:
task = {
  payload: {...},
  extra: {...},
  tags: {
    platform:  "linux",
    symbol:    "B",
    ...
  },
  ...
}

Then the influxdb reports entries have have key-values:
{
  // Names here are strictly illustrative
  timeToResolution: ..., // time it took to resolve the task (from unscheduled -> completed)
  ... // other durations of interest

  // tags for filtering
  "tag:platform": "linux",
  "tag:symbol": "B"

  // meta data
  provisionerId:  ...,
  workerType:     ...,
  workerGroup:    ...,
  workerId:       ...
}

Pros:
 - completely generic
 - only depends on task-definition and task status structure
 - task status structure is present in pulse messages.
(`task.tags` was designed for this kind of filtering on tasks)

Given that mozilla-taskcluster already has the pushlog and knows when it was pushed it might be
easier if mozilla-taskcluster just added a property:
  task.extra.stats.triggeredAt = <date-time of push>
(Again naming is illustrative)

Pros:
 - No need to fetch from pushlog
 - Other components (like a github hook handler) can use `triggeredAt` too.
(`triggeredAt` is the time when action that triggered tasks happened, like a push)

No special routes needed:
 - tasks without tags, just don't get any "tag:..." properties (they are still aggregated)
 - tasks without (or with invalid) task.extra.stats.triggeredAt property doesn't get a duration
   that measures time from triggeredToCreation (or whatever we use the property for).

Note:
 - one influxdb entry per task or per run (or both in different series).
 - IMO: explicit > implicit
 - Making stats (or any component) depend on ENV vars is bad (and not portable to other workers)
 - Filtering by task.tags is natural (that's what it's for)
 - Makes sense that `task.extra.stats.triggeredAt` carries special value
   (that's what task.extra is for)
I know the bug says, gecko specific. But I don't think gecko needs things we can't do in a generic fashion, by adding special properties to `task.extra.stats`. Most of these probably reusable by other
components too.

Note, if you wanted stats for gecko tasks only, just add a tag `gecko: 'true'` to your tasks. And filter
by this in grafana.
This all sounds sensible, thanks for the pointers! I appreciate the detail.

Will post back with updates/questions/progress etc.

Thanks.
Assignee: nobody → pmoore
Flags: needinfo?(pmoore)
These pointers are great but I want to emphasize this is a hack for getting stats from _existing_ data and things like fetching data from the pushlog are going to be ugly without normalizing some of this. We have plenty of data in graphite at a task+worker type specific level so I am looking for something specific (that I don't already have) here.
I've written a go program which spawns a go routine per hg repository (watching the push logs), and a go routine monitoring a pulse queue which I have created to listen to the 5 taskcluster pulse exchanges that advise of finished tasks and finished task graphs in the various states.

See https://github.com/petemoore/tc-gecko-metrics-go for implementation specifics.

The data collection is backed by BoltDB for reliable and safe (ACID) data storage, such that crashes/exceptions should cause no data loss.

As hg pushes come in, the details of repo/commit/push time are stored in a dedicated bucket. Concurrently, the queues are monitored, and when tasks are completed, the task cluster queue API is called to get payload/task information. This data is also persisted in a dedicated bucket. Whenever a gecko task comes in with the repo env vars (comment 0), a lookup is stored between task graph id and hg repo/changeset, so that any other tasks with the same task graph id can be updated with the hg meta data. This update occurs when a task graph completes, since at this point, we know no more tasks should be coming in for the given task graph id. The completed task graphs are known about from the schedulerevents amqp api. Once the task graph closes, the various buckets are combined to provide a uniform view of the task information, together with the hg repo details and push time, and the BoltDB is then accordingly updated.

At the moment I have not implemented data cleanup, but this would be quite possible by either removing lookup data after the task graph completes and the denormalised json has been generated - however if data is purged which is still present in the hg push logs, it would return - and therefore the data cleanup is not entirely trivial, but still quite solvable. However this is a short term PoC, so I have avoided getting too bogged down in such optimisations.

I am currently in the process of reading up on heroku to get it deployed on my personal heroku account I have set up. If there is a corporate account I can use, I would also be happy to use that.

Locally I experimented with influxdb, but had several http 500's, also when switching to previous release tags - need to investigate cause. However, I am persisting data already, so it should be quite trivial to deploy this to a local influxdb instance once I have the heroku deployment.

I've not heavily documented the code since this is a PoC under development, I'll look to do that once the deployment is done.

If you want to have a play, the go program has two run modes: data collection (run), and data output (show):

$ tc-gecko-metrics-go 
Usage:
    tc-gecko-metrics-go run -o OUTPUT-METRICS -i INTERNAL-DATA -r HG-REPOS-JSON-FILE
    tc-gecko-metrics-go show -i MAPPING-FILE
    tc-gecko-metrics-go --help

The options -i and -o represent file locations for the two BoltDB databases which will be created the first time you run the command; choose anything you like. On future runs they will get updated.
I deployed to heroku today, but so far have not been successful setting up an Influx DB environment to push the data to.

I also tried registering for a free trial for a hosted Influx DB service, but also this is not working. I can't log in under http://104.236.160.115:8083/ and also the generated dns name for my trial app didn't seem to work (at least my dns can't resolve it yet).

That said, I noticed we have influxDB connection strings in taskclaster apps hosted on Heroku, so I'll have a chat with James about this, maybe we already have an existing InfluxDB instance/cluster that I can connect to.

I've set up auto-deployment from github to heroku, and I've frozen all dependencies with godep. I am using the https://github.com/kr/heroku-buildpack-go buildpack.

I haven't connected a logging service - will also discuss with James later today.
I think this is a dead work now based on Edgar's work - should I close this?
Flags: needinfo?(jopsen)
Flags: needinfo?(echen)
(In reply to Pete Moore [:pmoore][:pete] from comment #10)
> I think this is a dead work now based on Edgar's work - should I close this?

I guess you ni to a wrong "Edgar Chen" :)
Flags: needinfo?(echen) → needinfo?(eggylv999)
There is a stats collection system now in place which sends to InfluxDB. A couple of the tags requested in the original report are still missing, but it should be pretty easy to add them if needed.
Flags: needinfo?(eggylv999)
(In reply to Edgar Chen from comment #12)
> There is a stats collection system now in place which sends to InfluxDB. A
> couple of the tags requested in the original report are still missing, but
> it should be pretty easy to add them if needed.

Thanks Edgar! It sounds like it makes sense to close this now then. Can you provide details of how the new stats can be viewed?

Thanks,
Pete
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(eggylv999)
Resolution: --- → FIXED
@pmoore, Take a look at:
https://github.com/taskcluster/taskcluster-stats-collector/blob/master/lib/series.js
and
http://grafana.taskcluster.net/
(login with gmail and use your mozilla account)

There is already a dashboard for per workerType pending time:
https://grafana.taskcluster.net/dashboard/db/task-pending-time-templated

Feel free to duplicate hack and make your own :)
Flags: needinfo?(jopsen)
Flags: needinfo?(eggylv999)
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: --- → mozilla41
Version: unspecified → Trunk
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.