Closed Bug 1071281 Opened 10 years ago Closed 10 years ago

Add metrics (via statsd) to mozharness

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlal, Assigned: mshal)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1971] )

Attachments

(5 files, 1 obsolete file)

I was talking with mshal/jlund today about adding more metrics to mozharness scripts... The obvious wins here are being able to measure end to end time as well as measuring granular statistics and timing data (For example: how much time do we spend on source checkout).

We have a few options to choose from metrics wise:

  - influxdb
  - statsd

My personal suggestion would be to research influxdb but go ahead and implement/ship statsd it is offered directly from hosted graphite and is very easy to query via grafana.

There are a number of technical considerations:

 - how do we name the stats [1]

 - how do we inject credentials/hostname for statsd/influxdb (alternatively we may host our own and do host validation?)

 - should we use tcp or udp (udp is likely much better choice here as we don't strictly need all stats to work)

  - use a library for mozharness or implement our own udp packet handling [2]


1. http://matt.aimonetti.net/posts/2013/06/26/practical-guide-to-graphite-monitoring/
2. https://pypi.python.org/pypi/python-statsd

The end goal here is to have a buildbot independent way for us to report statistics on CI... This is particularly important to me so I can compare buildbot times to taskcluster times and vice versa and just for general optimization (are we spending all our time in CI overhead? or clobbering?, etc...)
As mentioned in the mozharness meeting today, data coming from mozsystemresourcemonitor (in mozharness as ResourceMonitorMixin, http://hg.mozilla.org/build/mozharness/annotate/1fc39400ae85/mozharness/base/python.py#l427) might be an interesting data source.
Assignee: nobody → mshal
Attached patch influx-mixin.patch (obsolete) — Splinter Review
This is an early stage WIP for adding influxdb stats as a mozharness mixin. Two big things missing at this point are grabbing the url to post to from a config file, as well as the actual stats to log (right now it just logs the action name and runtime, as a quick test).

At this point I'm looking for some feedback on whether or not this is an appropriate way to implement stats tracking in mozharness, and if we can use an existing config file to store the url info.
Attachment #8501206 - Flags: feedback?(jlund)
Comment on attachment 8501206 [details] [diff] [review]
influx-mixin.patch

Review of attachment 8501206 [details] [diff] [review]:
-----------------------------------------------------------------

looks really good. I think this will be very useful.

I wonder if we should take this and ResourceMonitoringMixin and move it to its own file: e.g. mozharness/base/stats.py. python.py to me feels like it should be specific to python stuff (virtualenvs, etc). I see an argument for it remaining here though.

::: mozharness/base/python.py
@@ +545,5 @@
> +    """TODO
> +
> +    TODO
> +    """
> +    def __init__(self, *args, **kwargs):

we discussed on irc an issue where self.log is not instantiated yet and most likely neither is self.config by the time this is called. that is because it is being called before we get to BaseScript.__init__ as it has jumped the MRO in super.

as it's a mixin, I think it is better to just put this init logic in a @PreScriptRun listener.

@@ +557,5 @@
> +            import requests
> +
> +            self._post = requests.post
> +            # TODO: Get url from config somewhere?
> +            self._posturl = 'http://192.168.88.124:8086/db/db1/series?u=root&p=root'

for buildbot, we store a file called oauth.txt on slaves during a job and then delete it at the end. its contents comes from our buildbot master machines (master/BuildSlaves.py) and is not recorded in any of our repos.

you can see how balrog avails of it in mozharness:
http://mxr.mozilla.org/build/source/mozharness/scripts/b2g_build.py#160
http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/updates/balrog.py#14

if you grep for 'oauth.txt', you can see we use it for blobber creds too.

this might be a good place to store that url and then parse oauth.txt.

We can chat offline about getting this stats url added to that. The other option is to distibute it in a file via puppet and heira across our machines: https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Secrets
Attachment #8501206 - Flags: feedback?(jlund) → feedback+
(In reply to Jordan Lund (:jlund) from comment #3)
> I wonder if we should take this and ResourceMonitoringMixin and move it to
> its own file: e.g. mozharness/base/stats.py. python.py to me feels like it
> should be specific to python stuff (virtualenvs, etc). I see an argument for
> it remaining here though.

I've left it where it is for now, though I don't have an argument to keep it there. Mostly I put it there because that's where ResourceMonitoringMixin is :). I can try moving them though if you'd refer.

> we discussed on irc an issue where self.log is not instantiated yet and most
> likely neither is self.config by the time this is called. that is because it
> is being called before we get to BaseScript.__init__ as it has jumped the
> MRO in super.
> 
> as it's a mixin, I think it is better to just put this init logic in a
> @PreScriptRun listener.

That works much better!

> @@ +557,5 @@
> > +            import requests
> > +
> > +            self._post = requests.post
> > +            # TODO: Get url from config somewhere?
> > +            self._posturl = 'http://192.168.88.124:8086/db/db1/series?u=root&p=root'
> 
> for buildbot, we store a file called oauth.txt on slaves during a job and
> then delete it at the end. its contents comes from our buildbot master
> machines (master/BuildSlaves.py) and is not recorded in any of our repos.
> 
> you can see how balrog avails of it in mozharness:
> http://mxr.mozilla.org/build/source/mozharness/scripts/b2g_build.py#160
> http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/updates/
> balrog.py#14
> 
> if you grep for 'oauth.txt', you can see we use it for blobber creds too.
> 
> this might be a good place to store that url and then parse oauth.txt.

Ok, we now get the URL by grabbing 'influxdb_credentials' from oauth.txt
Attached patch bug1071281Splinter Review
Attachment #8501206 - Attachment is obsolete: true
Attachment #8505094 - Flags: review?(jlund)
Comment on attachment 8505094 [details] [diff] [review]
bug1071281

Review of attachment 8505094 [details] [diff] [review]:
-----------------------------------------------------------------

this is really neat. I think there are some properties that are specific to b2g_build that shouldn't be in mozharness/base/python.py. Going even higher level than b2g_build, all of the buildbot_{properties,config} mentions are specific to Mozilla. mozharness/base/* is meant to be really generic (opposed to mozharness/mozilla/*). I'd like to make this easily usable in all our scripts and remove any mention of specific ones going forward (I had a quick idea of how to do this with some poor pseudo code below)

I don't want to block you from progressing so I'll r+ if you would rather land and iterate. Once this does land on default, we should test this with a bunch of jobs (namely b2g builders) on cypress

::: mozharness/base/python.py
@@ +586,5 @@
> +
> +        self.start_time = time.time()
> +
> +    @PostScriptAction
> +    def influxdb_recording_post_action(self, action, success=None):

I don't think 'success' is used

@@ +609,5 @@
> +                c.get('branch'),
> +                c.get('slavename'),
> +                p.get('gecko_revision'),
> +                p.get('gaia_revision'),
> +                self.query_buildid(),

some of these, namely the last three, feel like they are specific to b2g and/or builds.

maybe we should have a list of props in a hook method here with a default empty implementation in this class but we implement it in the respective script we are using this with.

# in this class
def query_custom_influx_data(self):
    return []

# in b2g_build
def query_custom_influx_data(self):
    """ returns column, point
    return [
        ('gecko_revision', p.get("gecko_revision")),
         # etc
    ]

then in influxdb_recording_post_action we extend 'points' and 'columns' with query_custom_influx_data()

@@ +626,5 @@
> +                "buildid",
> +            ],
> +        }])
> +
> +    def record_influx_stat(self, json_data):

I assume if we don't have a value for say gecko_revision or buildid, influx accepts None as data points?
Attachment #8505094 - Flags: review?(jlund) → review+
I'm having trouble getting this to record stats from a build machine. dustin, do we block outgoing traffic somewhere? The connection just times out (outgoing TCP connection to influxdb.com:8086)
Flags: needinfo?(dustin)
Yes, we'll need an explicit flow in place for that.  What are the source hosts?  Does influxdb.com's IP stay stable?
The IP for goldiewilson-onepointtwentyone-1.c.influxdb.com is 104.131.136.49, so that's where we'll send traffic to on port 8086. The source hosts are our builders (both in-house and AWS). I suspect we may want to include testers as well in the future.

I don't know how stable the IP is. Even if it's static, it may change in the future if we change/upgrade our influxdb account or service. How difficult is it to change the flow? If we change IPs it wouldn't stop us from building - the worst that would happen is we'd lose a few stats. Not a big deal overall.
Yeah, that's a digitalocean IP, and even the hostname looks pretty transient.  Changing the flow isn't too hard, but if the connection times out then that adds to the build time and might still cause a tree closure, right?
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Yeah, that's a digitalocean IP, and even the hostname looks pretty
> transient.  Changing the flow isn't too hard, but if the connection times
> out then that adds to the build time and might still cause a tree closure,
> right?

The timeout is set pretty low (currently 5s, but we may need to adjust it). If the stats connection times out it doesn't affect the result of the build - it just disables the stats collection for that job. So we'd hit one 5s timeout and then the rest of the build would continue without trying to keep connecting for each stat point. It's not intended to be a bulletproof get-all-the-stats-no-matter-what implementation - we want to get the stats that are available quickly if possible, and ignore them if not.
OK, sounds good.  So let's see how long that IP stays put
Flags: needinfo?(dustin)
Michael, the flow is in, so you should have access now.  Let me know if not?
(In reply to Jordan Lund (:jlund) from comment #6)
> ::: mozharness/base/python.py
> @@ +586,5 @@
> > +
> > +        self.start_time = time.time()
> > +
> > +    @PostScriptAction
> > +    def influxdb_recording_post_action(self, action, success=None):
> 
> I don't think 'success' is used

This mixin doesn't, but it's required since script.py calls the action with:

method(action, success=success and self.return_code == 0)

If I remove the argument, it fails with:

17:27:35    ERROR - Exception during post-action for checkout-sources: Traceback (most recent call last):
17:27:35    ERROR -   File "/home/worker/mozharness/mozharness/base/script.py", line 1222, in run_action
17:27:35    ERROR -     method(action, success=success and self.return_code == 0)
17:27:35    ERROR - TypeError: influxdb_recording_post_action() got an unexpected keyword argument 'success'

> @@ +609,5 @@
> > +                c.get('branch'),
> > +                c.get('slavename'),
> > +                p.get('gecko_revision'),
> > +                p.get('gaia_revision'),
> > +                self.query_buildid(),
> 
> some of these, namely the last three, feel like they are specific to b2g
> and/or builds.
> 
> maybe we should have a list of props in a hook method here with a default
> empty implementation in this class but we implement it in the respective
> script we are using this with.

This sounds like a good idea to me. We'll be wanting slightly different stats for taskcluster builds anyway, so I'll have to implement something like this to handle that.

> I assume if we don't have a value for say gecko_revision or buildid, influx
> accepts None as data points?

Yep, None is fine. They just show up as empty in influxdb
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Michael, the flow is in, so you should have access now.  Let me know if not?

I posted in the other bug too, but traffic is getting through now! Thanks for the quick response.
Attached patch virtualenvSplinter Review
This is a follow-up patch to fix issues I found while testing on my dev-master (not just in a docker container for TC). Specifically:

1) Use the VirtualenvMixin to get the 'requests' module
2) query_buildid doesn't work for some actions, so wrap it in a try/except
3) if stats collection fails for any reason (including a timeout exception), disable it

The buildid is empty for the following actions: clobber, checkout_sources, get-globs, update-source-manifest. jlund, is there some other way I can get a unique identifier that will be the same across all actions for an individual build? I was hoping to have something to query against to get all the results for an individual build (doesn't necessarily have to be the buildid specifically).
Attachment #8511178 - Flags: review?(jlund)
Comment on attachment 8511178 [details] [diff] [review]
virtualenv

Review of attachment 8511178 [details] [diff] [review]:
-----------------------------------------------------------------

nice, lgtm

::: mozharness/base/python.py
@@ +565,5 @@
>  
>          try:
> +            site_packages_path = self.query_python_site_packages_path()
> +            if site_packages_path not in sys.path:
> +                sys.path.append(site_packages_path)

hmm, so ResourceMonitor's technique is to use: activate_virtualenv http://mxr.mozilla.org/build/source/mozharness/mozharness/base/python.py#414

did that not work for you? I suppose this is ok. I wonder if we could have conflicts if we added this path to an interpreter that was a different version.

@@ +606,5 @@
> +
> +        try:
> +            buildid = self.query_buildid()
> +        except:
> +            buildid = None

WRT you buildid question.

It looks like query_buildid() in b2g_build.py only derives the buildid from the app.ini file and that is generated after the build step. However it is actually buildbot that will set that explicitly via the MOZ_BUILD_DATE[1] env var.

you can see that MOZ_BUILD_DATE is the same buildid value as the one generated from query_buildid.

I don't *think* there can be a situation where buildbot doesn't have a buildid generated by the time the build starts but I assume we have MOZ_BUILD_DATE under a condition so that if we don't pass it, the build step will just create one on the fly.

In saying that, you can use buildbot_config['properties']['buildid'] and if it is not the same as query_buildid, it will be unique :). 

buildid is simply a time stamp[2]

But where this might fall short is when you use taskcluster and buildbot_config no longer exists.

option 1:

Does taskcluster have an equivalent? Maybe you can use buildid from buildbot and something else from taskcluster? That way you can still search for job with a known id from tbpl/treeherder. If the stars aligned and you had an identifier clash between the two, who cares? :) you mentioned it's not the individual job but the averages of them all anyway

option 2:

create your own identifier based on time. This will make it hard to identify the individual job stat point. Not sure if you care about that though

[1] http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/building/buildb2gbase.py#254
[2] http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/building/buildbase.py#646
Attachment #8511178 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) from comment #17)
> ::: mozharness/base/python.py
> @@ +565,5 @@
> >  
> >          try:
> > +            site_packages_path = self.query_python_site_packages_path()
> > +            if site_packages_path not in sys.path:
> > +                sys.path.append(site_packages_path)
> 
> hmm, so ResourceMonitor's technique is to use: activate_virtualenv
> http://mxr.mozilla.org/build/source/mozharness/mozharness/base/python.py#414
> 
> did that not work for you? I suppose this is ok. I wonder if we could have
> conflicts if we added this path to an interpreter that was a different
> version.

I was using vcs_sync.py as a reference, since it also used the requests module. I tried to change it to use the ResourceMonitor technique, but by default virtualenv seems to be setup to install things into venv/lib/python2.7/site-packages, and buildbot runs mozharness with python2.6, so it is looking in venv/lib/python2.6/site-packages. I was eventually able to get it to work by updating buildbot-configs to specify 'mozharness_python': '/tools/buildbot/bin/python' for the emulator config. Is it worth going and adding mozharness_python to all those configs? Or should buildbot be using 2.7 by default for mozharness? I'm not sure if it makes sense to block this bug on a full 2.7 solution, since it seems like it might be a can of worms.

> It looks like query_buildid() in b2g_build.py only derives the buildid from
> the app.ini file and that is generated after the build step. However it is
> actually buildbot that will set that explicitly via the MOZ_BUILD_DATE[1]
> env var.

In that link it looks like mozharness is setting MOZ_BUILD_DATE in the environment, and pulling buildid from the buildbot config. When I run an emulator build, there is no buildid field in the buildbot config properties :/. Maybe we'll need to add this somewhere? I can probably do that as a followup - in the meantime it just means it is harder to find all the stats for a single build, which may not be something we want to do often anyway.

> option 1:
> 
> Does taskcluster have an equivalent? Maybe you can use buildid from buildbot
> and something else from taskcluster? That way you can still search for job
> with a known id from tbpl/treeherder. If the stars aligned and you had an
> identifier clash between the two, who cares? :) you mentioned it's not the
> individual job but the averages of them all anyway

I'm expecting to use different stat details for taskcluster anyway. Off-hand I'm not sure if it has buildid or not - but various details like those buildbot properties will be different. When I go to add that, I'll likely be moving things around as you suggested in #c6.

> create your own identifier based on time. This will make it hard to identify
> the individual job stat point. Not sure if you care about that though

Hmm, this might be easiest. Though I was hoping for something that could be tied in to a build property of some sort. I guess if I just display it in the log we at least have some record of what it is.
Oh, maybe the no-buildid thing is just when using my dev master? I can see that a b2g_mozilla-central_emulator_dep build on mozilla-central gets a buildid:

08:06:24     INFO - Using buildbot properties:
...
08:06:24     INFO -         "buildid": "20141028075549",

And it has a revision too, which is helpful since that is also blank for the "clobber" action. Here's all I get in my dev master:

12:59:54     INFO -     "properties": {
12:59:54     INFO -         "base_mirror_urls": null, 
12:59:54     INFO -         "upload_ssh_user": "ffxbld", 
12:59:54     INFO -         "product": "b2g", 
12:59:54     INFO -         "basedir": "/builds/slave/b2g_m-cen_emu_dep-000000000000", 
12:59:54     INFO -         "script_repo_revision": "default", 
12:59:54     INFO -         "repository": "", 
12:59:54     INFO -         "buildername": "b2g_mozilla-central_emulator_dep", 
12:59:54     INFO -         "base_bundle_urls": [
12:59:54     INFO -             "http://dev-stage01.build.mozilla.org/pub/mozilla.org/firefox/bundles"
12:59:54     INFO -         ], 
12:59:54     INFO -         "mock_target": null, 
12:59:54     INFO -         "project": "", 
12:59:54     INFO -         "platform": "emulator", 
12:59:54     INFO -         "upload_ssh_key": "ffxbld_dsa", 
12:59:54     INFO -         "master": "http://dev-master1.srv.releng.scl3.mozilla.com:8064/", 
12:59:54     INFO -         "branch": "mozilla-central", 
12:59:54     INFO -         "hgurl": "https://hg.mozilla.org/", 
12:59:54     INFO -         "repo_path": "users/mshal_mozilla.com/mozilla-central", 
12:59:54     INFO -         "buildnumber": 41, 
12:59:54     INFO -         "revision": null, 
12:59:54     INFO -         "slavename": "b-linux64-hp-0022", 
12:59:54     INFO -         "tooltool_url_list": [
12:59:54     INFO -             "http://runtime-binaries.pvt.build.mozilla.org/tooltool"
12:59:54     INFO -         ], 
12:59:54     INFO -         "upload_ssh_server": "dev-stage01.srv.releng.scl3.mozilla.com"
12:59:54     INFO -     }, 

Maybe I just need to set them manually to something in order to test it?
Attached patch buildidSplinter Review
Ok, so if I explicitly set the revision and buildid properties before doing a Force Build, then I can use this patch to get them out, and it works for every build step. That gets me the data I want :)

Let me know if you want me to use a ResourceMonitoringMixin-style virtualenv or not. If so, I think we'll need to patch buildbot-configs to set the mozharness_python to a 2.7 version (or is there a better way?). If not, I'll squash all the patches in this bug before landing.
Attachment #8513097 - Flags: review?(jlund)
Comment on attachment 8513097 [details] [diff] [review]
buildid

Review of attachment 8513097 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for trying out the resource monitoring suggestion. I do think we should be using the same version of python for every mozharness call in production. But you're right, that's out of scope and may be a bag of snakes.

compound the patches together but let's test on cypress :)

::: mozharness/base/python.py
@@ +616,2 @@
>                  p.get('gaia_revision'),
> +                c.get('buildid'),

we have logic in buildbot factories to create a buildid if it is not generated by the time the job starts. I am not sure how we wouldn't have a buildid at this point or if that said logic is needed.

I suppose it is a question of what is more important to you:

1) having a buildid every time but sometimes it may be generated at this point in the patch (and therefore unrelated to the *real* buildid)
2) every time we have a buildid it will for sure match that of the *real* build

does that make sense? this patch is taking the number 2 approach

one other thing that may be worth mentioning is that unless we set buildid/revision in staging (as you did) we will be submitting to unflux without those points. Speaking of which, do we even want to let staging builds do influx stat pushes?
Attachment #8513097 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) from comment #21)
> ::: mozharness/base/python.py
> @@ +616,2 @@
> >                  p.get('gaia_revision'),
> > +                c.get('buildid'),
> 
> we have logic in buildbot factories to create a buildid if it is not
> generated by the time the job starts. I am not sure how we wouldn't have a
> buildid at this point or if that said logic is needed.

I'm not sure if I totally follow the buildbot logic, but I guess it gets skipped doing a "Force Build" in staging? I definitely don't get a buildid property unless I explicitly set it, though it is present in builds on tbpl.

> I suppose it is a question of what is more important to you:
> 
> 1) having a buildid every time but sometimes it may be generated at this
> point in the patch (and therefore unrelated to the *real* buildid)
> 2) every time we have a buildid it will for sure match that of the *real*
> build
> 
> does that make sense? this patch is taking the number 2 approach

I think so - this seems to work well enough for now in that I get the same buildid for all steps, rather than a missing buildid for the first few steps. One thing catlee mentioned is that buildid isn't necessarily unique anyway, so we may want to add the builduid (note: with "u") to the stats as well.

> one other thing that may be worth mentioning is that unless we set
> buildid/revision in staging (as you did) we will be submitting to unflux
> without those points. Speaking of which, do we even want to let staging
> builds do influx stat pushes?

The stats will be disabled by default in staging unless you add an influxdb_credentials line to your BuildSlaves.py, so I don't think we need to worry too much about empty fields for staging stats. If it's something you want to test in staging, you'll be pointing the influxdb_credentials to a test database anyway.

It is definitely a bit frustrating that firing off builds with "Force Build" seems to be subtly differently from a scheduled build, since it makes it even harder to accurately test :/
Attached patch puppet-influxdbSplinter Review
Patch for BuildSlaves.py to set influxdb_credentials. The secret part should already be in hiera, but please double-check me! mozharness uses oauth.txt on the slave to determine influxdb_credentials, which is the URL to post stats to.
Attachment #8514391 - Flags: review?(rail)
Also add influxdb_credentials as an example in buildbot-configs
Attachment #8514392 - Flags: review?(rail)
Attachment #8514392 - Flags: review?(rail) → review+
Attachment #8514391 - Flags: review?(rail) → review+
Comment on attachment 8514391 [details] [diff] [review]
puppet-influxdb

https://hg.mozilla.org/build/puppet/rev/8fbb25418d9c

And merged to production.
Attachment #8514391 - Flags: checked-in+
++ I just want to say thank you to everyone :)

I created this dashboard https://stats.taskcluster.net/grafana/#/dashboard/db/developer-branch-b2g-builds?panelId=1&fullscreen&edit (you need to login with persona with @mozilla.com email) which shows the time spend in various mozharness steps per buildername.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1971]
I think we can mark this fixed. Any other work can be done in followups.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: