Closed Bug 1286358 Opened 8 years ago Closed 7 years ago

Buildbot to consume data from Treeherder's SETA

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(3 files)

instead of flaky service on ouija.allizom.org here is the url armenzg> here's the flask app for the Heroku version https://github.com/mozilla/ouija/blob/master/src/server.py this is the list of jobs that should be run http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1 today we consume a list of jobs that should not be run and create a dict of that in buildbot to skip jobs in the scheduler
Assignee: nobody → kmoir
Blocks: 1176784
jmaher: what do you think is needed before we can start consuming the data from the Heroku instance?
while this hasn't been a top priority for me, I will say that the heroku instance isn't running all the services that we run- I believe it is due to needing to pay for larger databases and computing power. I believe a good next step is to figure out what we need and how to pay for it, then get things truly running in parallel. MikeLing, can you give us details of what we are running now on Heroku and what we could do if we paid for the service?
Flags: needinfo?(sabergeass)
(In reply to Joel Maher ( :jmaher ) from comment #2) > MikeLing, can you give us details of what we are running now on Heroku and > what we could do if we paid for the service? Sure! For now, I can make sure the 'seta'(not ouija) can run on the heroku just like alertmanager server. But, due to the limitation of database size on heroku, we still need to use the a part of data from alertmanager server right now[1]. And we can only support four or five days' data querying on heroku also because the database size. So, after we paid for the service, we could support data querying for each day and no longer need to retrieve data from alertmanager server. Furthermore, we could write a script to make it automate run failure.py and updated.py to get the latest data and make our results more reliable :) [1]https://github.com/MikeLing/ouija/blob/ouija-rewrite/tools/failures.py#L34
Flags: needinfo?(sabergeass)
I'm an admin and I can upgrade this. Is it the add on that needs upgrading or the dyno? If the add on, are these the instructions? [1] If so, would you like me to go ahead and upgrade? or do you want me to wait? Can we also rename the app to "seta"? Instead of seta-dev. [1] https://devcenter.heroku.com/articles/upgrading-heroku-postgres-databases
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #4) > Is it the add on that needs upgrading or the dyno? I'm not sure about this. For now, I'think we only need to upgrade the database size. But t's a great thing to have more dyno(I don't know what to do with them for now > If the add on, are these the instructions? [1] Yeah, I think so. > If so, would you like me to go ahead and upgrade? or do you want me to wait? > > Can we also rename the app to "seta"? Instead of seta-dev. My opinion is please go ahead to upgrade it, so we can do more things on it :) And I'm totally ok with the seta or seta-dev. Thank you! BTW, I have no enough authority to use 'fork' to add a stage server for seta[1] [1]https://devcenter.heroku.com/articles/multiple-environments#starting-from-an-existing-app
Depends on: 1292560
I've upgraded the DB. I've also upgraded the dyno so we can have metrics. I've also forked seta-dev to seta. I've pointed seta to armenzg/ouija instead of your repo (I will eventually point it to mozilla/ouija when I have permission). I've also created a seta pipeline, I've added the seta app as production while seta-dev is staging. PRs will eventually autodeploy new versions of SETA to be tested against. You can visit it here: https://seta.herokuapp.com
Summary: see if we can consume data from heroku app w seta data → Consume data from the Heroku app seta-dev
The seta app now deploys automatically from mozilla/ouija:master. seta-dev is based on mikeling's repo (manual deployments).
armenzg: what is the status of the new seta deployment after the gsoc term has completed? I notice https://seta.herokuapp.com doesn't seem to work anymore
Flags: needinfo?(armenzg)
we have seta and seta-dev (staging server) on heroku. There are a few things required to get this done: 1) migrate old data 2) create an endpoint for taskcluster to use 3) land in-tree code for taskcluster 4) when taskcluster is a-ok, migrate buildbot to new server Right now we are working on 1 and 2 via pull requests/issues (https://github.com/mozilla/ouija/pulls ). I have personally been working on #1 and it required cleaning up a lot of data and fixing a major loss of data from ~6 weeks ago when a treeherder API changed and we stopped getting important data for SETA. The data is fine now, I am testing data migration and hopefully early next week we can call #1 done. Regarding #2, there is much discussion on PR's for the endpoint, including example code- possibly when #1 is done we can all focus more heavily on #2 and resolve that the week after next- that would mean that SETA + taskcluster would be running by the end of the month and a couple weeks later we could look at migrating off of buildbot.
Flags: needinfo?(armenzg)
Armen is this the new endpoint we need to consume. I saw your note to the releng list and looked at the pull request http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&date=2016-12-21
Flags: needinfo?(armenzg)
Depends on: 1306709
Flags: needinfo?(armenzg)
Summary: Consume data from the Heroku app seta-dev → Consume data from Treeherder's SETA
I also see it being called with &inactive=1. Do you ever call it without &inactive=1? I want to know when the API is called in different ways. https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py#75 https://dxr.mozilla.org/build-central/source/tools/buildfarm/maintenance/update_seta.py#45
No, I don't call it without inactive. I just want the list of tests to run at a less frequent interval and construct the dictionary of values for skipconfig from that data
I will tackle this on the first week of January. Any other reviewer while kmoir is on PTO?
Assignee: kmoir → armenzg
Blocks: 1325404
I've landed on master what I believe are the sufficient changes to switch Buildbot over. I will review this bug and the output from both systems to understand if there are still any issues. At the beggining of this bug we wanted to switch over from: http://alertmanager.allizom.org/data/setadetails to: http://seta.herokuapp.com/data/setadetails for now we have: https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/ but the final url will be (we're waiting for a deployment this week): https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/ From my understanding of this bug we focused on making SETA work for TaskCluster first. We also reached a good point on MikeLing's work for Buildbot support, however, we decided to wait a bit longer to consume from Treeherder directly. By doing this, we would only switch over once rather than twice. Thanks MikeLing for your hardwork and making my work easier! As per kmoir describes: > Today we consume a list of jobs that should *not* be run and create a dict of that in buildbot > to skip jobs in the scheduler. Changes since the original implementation: * We don't need to specify a date when calling the API (&date=2017-01-09) * &inactive=1 is no more. We now use priority=5 (it means low value jobs) * We now use &build_system_type-{buildbot,taskcluster} instead of &buildbot=1 Each endpoint seems to return different values: * Alertmanager - 846 builders [1] * Seta-dev - 1014 builders [2] * Seta - 1199 builders [3] * TH's API - 1249 builders [4] In any case, the new TH SETA api seems to show reasonable values for priority=1 [5] 29 builders. This includes 28 talos builders from preseed.json + 1 builder from analyzing failures [6]. We're going to have to wait few days until the SETA changes make it into production. Will needs to deploy some major changes and do some manual DB work this week. I would like to see the number of builders we will get with priority=1 once we have a lot more of failures fixed by commits data. I will still prepare the Buildbot patches for what I believe are the current set of required changes. jmaher, kmoir: could you please review this comment and see if it makes sense to you? [1] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-09&buildbot=1&branch=mozilla-inbound&active=0 [2] http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&priority=5 [3] http://seta.herokuapp.com/data/setadetails/?buildbot=1&priority=5 [4] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json [5] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json [6] https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json
Summary: Consume data from Treeherder's SETA → Buildbot to consume data from Treeherder's SETA
That's all it will take. I will have to wait until the Treeherder changes from 'master' make it into 'production'.
kmoir, jmaher: do you have any comments wrt to comment 17?
Flags: needinfo?(kmoir)
Flags: needinfo?(jmaher)
No, I'm just writing a patch so we can consume the new endpoint. The data is via https instead of http so I'm cleaning up the buildbot code to consume that.
Flags: needinfo?(kmoir)
One major difference between the old SETA and the new one is that we don't consider jobs fixed by commit which have been tagged to be fixed by an empty field. This definitely can affect the list of builders to be determined as low value jobs. https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/ vs alertmanager.allizom.org/data/seta/?startDate=2017-01-01&endDate=2017-01-09 (I've been waiting a while but I'm sure is this endpoint)
I am concerned with the 1 job to run via analysis- is this with 90 days of history, or with what is done on staging/locally? We should have a representative amount of builders as p1 vs p5. I fully understand there are differences between alertmanager, heroku, treeherder- some of this is that we are slightly changing the sources, for example- I believe alertmanager is dealing with desktop-test/android-test and not the new taskcluster names, so treeherder is the winner in the case.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #22) > I am concerned with the 1 job to run via analysis- is this with 90 days of > history, or with what is done on staging/locally? We should have a > representative amount of builders as p1 vs p5. > This is staging failures data (3 revisions): https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json I want to see the production output since staging adds no real value. FYI 4 months. Treeherder expires data after 4 months.
So I have been testing patches to consume the new data. As I mentioned to Armen yesterday in irc, I ran into some problems since the seta data is now provided via https vs the previous http I wrote a script like this to test import httplib import json host = "treeherder.allizom.org" path = "/api/project/mozilla-inbound/seta/job-priorities/" try: port = int(443) conn = httplib.HTTPSConnection(host, port) conn.request("GET", path) r1 = conn.getresponse() if r1.status == 200: data = json.loads(r1.read()) print data except ValueError, e: print("JSON parsing error %s: %s" % (url, str(e))) which works with a more current version of python, but not the version used to run buildbot (2.7.3). With python 2.7.3 we get the error ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error investigation has revealed that this is because the SNI libraries were only backported python > 2.7.3 One suggestion is to use requests but this is not on the masters either So I'm continuing to investigate. We don't really want to upgrade python on all the buildbot masters now given that their demise is imminent later this year
If we switch to BBB, would we get SETA support via TaskCluster? If so, we could wait until we run everything via TaskCluster/BBB and get SETA via that way. Alternatively, could we make a call to curl or wget? In any case, here's our current comparison of builder lists between systems: * low value jobs TH - 124 builders [1] * high value jobs TH - 1154 builders [2] * low value jobs Allizom - 846 builders [3] * high value jobs Allizom - 62 builders [4] It is rather dissapointing to see those numbers; maybe I'm being too harsh on myself but I would have expected at least some closeness to each other. I know that on Ouija we consider jobs marked with blank string fixed by commit when on TH we don't. If we add the builders from both endpoints we get: * TH 1278 builders * Allizom 968 builders If bug 1330354 was fixed we could at least try to play with production data to determine if there are any issues in the logic. [1] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json [2] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json [3] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=0 [4] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=1
If I knew were were going to be running everything via BBB in <4 weeks, I would say "yes, lets just wait for BBB and do SETA on everything there". Unfortunately I think we are not going to be doing that. Odd that we seem to have flipped low value/high value. When running 'failures.py', how many 'fixed by commit' regressions are we working with in TH? In allizom, we have 628 failures over 90 days.
So, the ssl issue isn't an problem anymore, I looked switched to the treeherder url and it works. Probably something to do with an self signed cert on allizom or something. In any case, I found another problem the data here https://treeherder.mozilla.org/api/project/autoland/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json specifies mozilla-inbound in the branch like this when the url points to autoland "jobtypes":{"2017-01-11":["Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound talos chromez-e10s" Is this expected? same for https://treeherder.mozilla.org/api/project/graphics/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json I assume that the data in all three links could be different thus my scripts parse on the branch name
Flags: needinfo?(armenzg)
Thanks Kim! I will look into it. (In reply to Joel Maher ( :jmaher) from comment #26) > Odd that we seem to have flipped low value/high value. When running > 'failures.py', how many 'fixed by commit' regressions are we working with in > TH? In allizom, we have 628 failures over 90 days. I can't tell because of bug 1330354. Enough for the MySql operations to take over 49 seconds.
Flags: needinfo?(armenzg)
kmoir: I'm returning the bug to you as I will be gone after today. If you find anymore issues please chat with rwood/jmaher.
Assignee: armenzg → kmoir
Depends on: 1330652
Attached file compare_end_points.py
Here's a script I started in order to compare endpoints. A feature I wanted to add is to sort the data before writing to disk. This would allow using diffing tools like vimdiff or diff. Good luck with the bug!
Attachment #8826329 - Attachment mime type: text/x-python-script → text/plain
I am looking at the differences in data and trying to figure out how we get data for fixed_by_commit in treeherder. I believe we have an issue here: https://github.com/mozilla/treeherder/blob/edc3d7ad112c7c60e341fab7c1485c0e41408036/treeherder/etl/seta.py#L95 I have a query that I believe replicates the fixed_by_commit logic: https://sql.telemetry.mozilla.org/queries/2517 the concerns here are the option_collection_hash might be incorrect. What I see in this is that we are splitting the name on -{option}, where option = [opt, debug, pgo, asan], where we should be doing: if option in ['pgo', 'asan']: option = 'opt' job_type_name.split('{option}-'.format(buildtype=platform_option))[-1] I think that would solve things, but I would need to debug it a bit more. :rwood, this is probably the next level of debugging to do.
Flags: needinfo?(rwood)
Flags: needinfo?(rwood)
Hmm, seems I'm having ssl eerors again when trying to fetch the url using the version of python that buildbot uses from treeherder.mozilla.org. Traceback (most recent call last): File "test.py", line 11, in <module> conn.request("GET", path) File "/tools/python27/lib/python2.7/httplib.py", line 958, in request self._send_request(method, url, body, headers) File "/tools/python27/lib/python2.7/httplib.py", line 992, in _send_request self.endheaders(body) File "/tools/python27/lib/python2.7/httplib.py", line 954, in endheaders self._send_output(message_body) File "/tools/python27/lib/python2.7/httplib.py", line 814, in _send_output self.send(msg) File "/tools/python27/lib/python2.7/httplib.py", line 776, in send self.connect() File "/tools/python27/lib/python2.7/httplib.py", line 1161, in connect self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file) File "/tools/python27/lib/python2.7/ssl.py", line 381, in wrap_socket ciphers=ciphers) File "/tools/python27/lib/python2.7/ssl.py", line 143, in __init__ self.do_handshake() File "/tools/python27/lib/python2.7/ssl.py", line 305, in do_handshake self._sslobj.do_handshake() ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error
I was thinking of using the non buildbot version of python already installed to run a cron job to fetch the json then change the buildbot configs to parse the local copy upon reconfig
I don't think this bug is relevant anymore since we can see the end of the road for tc migration and seta only runs on certain trunk branches.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Could someone file the decom bug for Heroku seta and make it depend on the necessary bb->tc bugs?
I think we will need to decom alertmanager's SETA or remove the logic from Buildbot consuming from it. We deleted the SETA Heroku apps not long ago. TC uses Treeherder's SETA. Please correct me if needed.
Flags: needinfo?(jmaher)
Armen, that is all correct
Flags: needinfo?(jmaher)
we can do the work in bug 1383863
That sounds great - thank you!
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: