Buildbot to consume data from Treeherder's SETA

RESOLVED WONTFIX

Status

Infrastructure & Operations
CIDuty
RESOLVED WONTFIX
2 years ago
2 months ago

People

(Reporter: kmoir, Assigned: kmoir)

Tracking

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments)

(Assignee)

Description

2 years ago
instead of flaky service on ouija.allizom.org

here is the url
armenzg> here's the flask app for the Heroku version https://github.com/mozilla/ouija/blob/master/src/server.py

this is the list of jobs that should be run
http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1

today we consume a list of jobs that should not be run and create a dict of that in buildbot to skip jobs in the scheduler
(Assignee)

Updated

2 years ago
Assignee: nobody → kmoir
(Assignee)

Updated

2 years ago
Blocks: 1176784

Comment 1

2 years ago
jmaher: what do you think is needed before we can start consuming the data from the Heroku instance?
while this hasn't been a top priority for me, I will say that the heroku instance isn't running all the services that we run- I believe it is due to needing to pay for larger databases and computing power.  I believe a good next step is to figure out what we need and how to pay for it, then get things truly running in parallel.

MikeLing, can you give us details of what we are running now on Heroku and what we could do if we paid for the service?
Flags: needinfo?(sabergeass)

Comment 3

2 years ago
(In reply to Joel Maher ( :jmaher ) from comment #2)
> MikeLing, can you give us details of what we are running now on Heroku and
> what we could do if we paid for the service?

Sure! For now, I can make sure the 'seta'(not ouija) can run on the heroku just like alertmanager server. But, due to the limitation of database size on heroku, we still need to use the a part of data from alertmanager server right now[1]. And we can only support four or five days' data querying on heroku also because the database size.

So, after we paid for the service, we could support data querying for each day and no longer need to retrieve data from alertmanager server. Furthermore, we could write a script to make it automate run failure.py and updated.py to get the latest data and make our results more reliable :)


[1]https://github.com/MikeLing/ouija/blob/ouija-rewrite/tools/failures.py#L34
Flags: needinfo?(sabergeass)

Comment 4

2 years ago
I'm an admin and I can upgrade this.

Is it the add on that needs upgrading or the dyno?
If the add on, are these the instructions? [1]
If so, would you like me to go ahead and upgrade? or do you want me to wait?

Can we also rename the app to "seta"? Instead of seta-dev.

[1] https://devcenter.heroku.com/articles/upgrading-heroku-postgres-databases

Comment 5

2 years ago
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #4)


> Is it the add on that needs upgrading or the dyno?
I'm not sure about this. For now, I'think we only need to upgrade the database size. But t's a great thing to have more dyno(I don't know what to do with them for now

> If the add on, are these the instructions? [1]

Yeah, I think so.

> If so, would you like me to go ahead and upgrade? or do you want me to wait?
> 
> Can we also rename the app to "seta"? Instead of seta-dev.

My opinion is please go ahead to upgrade it, so we can do more things on it :) And I'm totally ok with the seta or seta-dev. Thank you!

BTW, I have no enough authority to use 'fork' to add a stage server for seta[1]

[1]https://devcenter.heroku.com/articles/multiple-environments#starting-from-an-existing-app

Updated

2 years ago
Depends on: 1292560

Comment 6

2 years ago
I've upgraded the DB.
I've also upgraded the dyno so we can have metrics.

I've also forked seta-dev to seta. I've pointed seta to armenzg/ouija instead of your repo (I will eventually point it to mozilla/ouija when I have permission).

I've also created a seta pipeline, I've added the seta app as production while seta-dev is staging.
PRs will eventually autodeploy new versions of SETA to be tested against.

You can visit it here:
https://seta.herokuapp.com
Summary: see if we can consume data from heroku app w seta data → Consume data from the Heroku app seta-dev

Comment 7

2 years ago
The seta app now deploys automatically from mozilla/ouija:master.

seta-dev is based on mikeling's repo (manual deployments).

Updated

2 years ago
Duplicate of this bug: 1253020
(Assignee)

Comment 9

2 years ago
armenzg: what is the status of the new seta deployment after the gsoc term has completed?  I notice https://seta.herokuapp.com doesn't seem to work anymore
Flags: needinfo?(armenzg)
we have seta and seta-dev (staging server) on heroku.  There are a few things required to get this done:
1) migrate old data
2) create an endpoint for taskcluster to use
3) land in-tree code for taskcluster
4) when taskcluster is a-ok, migrate buildbot to new server

Right now we are working on 1 and 2 via pull requests/issues (https://github.com/mozilla/ouija/pulls ).

I have personally been working on #1 and it required cleaning up a lot of data and fixing a major loss of data from ~6 weeks ago when a treeherder API changed and we stopped getting important data for SETA.  The data is fine now, I am testing data migration and hopefully early next week we can call #1 done.

Regarding #2, there is much discussion on PR's for the endpoint, including example code- possibly when #1 is done we can all focus more heavily on #2 and resolve that the week after next- that would mean that SETA + taskcluster would be running by the end of the month and a couple weeks later we could look at migrating off of buildbot.
Flags: needinfo?(armenzg)
(Assignee)

Comment 11

2 years ago
Armen is this the new endpoint we need to consume.  I saw your note to the releng list and looked at the pull request

http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&date=2016-12-21
Flags: needinfo?(armenzg)

Updated

2 years ago
Depends on: 1306709
Flags: needinfo?(armenzg)
Summary: Consume data from the Heroku app seta-dev → Consume data from Treeherder's SETA

Comment 12

2 years ago
I also see it being called with &inactive=1.
Do you ever call it without &inactive=1?

I want to know when the API is called in different ways.

https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py#75
https://dxr.mozilla.org/build-central/source/tools/buildfarm/maintenance/update_seta.py#45
(Assignee)

Comment 13

2 years ago
No, I don't call it without inactive.   I just want the list of tests to run at a less frequent interval and construct the dictionary of values for skipconfig from that data

Comment 14

2 years ago
I will tackle this on the first week of January.

Any other reviewer while kmoir is on PTO?
Assignee: kmoir → armenzg

Updated

2 years ago
Blocks: 1325404

Comment 15

2 years ago
I've landed on master what I believe are the sufficient changes to switch Buildbot over.
I will review this bug and the output from both systems to understand if there are still any issues.

At the beggining of this bug we wanted to switch over from:
http://alertmanager.allizom.org/data/setadetails
to:
http://seta.herokuapp.com/data/setadetails
for now we have:
https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/
but the final url will be (we're waiting for a deployment this week):
https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/

From my understanding of this bug we focused on making SETA work for TaskCluster first.
We also reached a good point on MikeLing's work for Buildbot support, however, we decided to wait a bit longer to consume from Treeherder directly. By doing this, we would only switch over once rather than twice.
Thanks MikeLing for your hardwork and making my work easier!

As per kmoir describes:
> Today we consume a list of jobs that should *not* be run and create a dict of that in buildbot
> to skip jobs in the scheduler.

Changes since the original implementation:
* We don't need to specify a date when calling the API (&date=2017-01-09)
* &inactive=1 is no more. We now use priority=5 (it means low value jobs)
* We now use &build_system_type-{buildbot,taskcluster} instead of &buildbot=1

Each endpoint seems to return different values:
* Alertmanager - 846 builders [1]
* Seta-dev - 1014 builders [2]
* Seta - 1199 builders [3]
* TH's API - 1249 builders [4]

In any case, the new TH SETA api seems to show reasonable values for priority=1 [5] 29 builders. This includes 28 talos builders from preseed.json + 1 builder from analyzing failures [6].

We're going to have to wait few days until the SETA changes make it into production.
Will needs to deploy some major changes and do some manual DB work this week.
I would like to see the number of builders we will get with priority=1 once we have a lot more of failures fixed by commits data.

I will still prepare the Buildbot patches for what I believe are the current set of required changes.

jmaher, kmoir: could you please review this comment and see if it makes sense to you?

[1] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-09&buildbot=1&branch=mozilla-inbound&active=0
[2] http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&priority=5
[3] http://seta.herokuapp.com/data/setadetails/?buildbot=1&priority=5
[4] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json
[5] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json
[6] https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json
Summary: Consume data from Treeherder's SETA → Buildbot to consume data from Treeherder's SETA
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 18

2 years ago
That's all it will take.
I will have to wait until the Treeherder changes from 'master' make it into 'production'.

Comment 19

2 years ago
kmoir, jmaher: do you have any comments wrt to comment 17?
Flags: needinfo?(kmoir)
Flags: needinfo?(jmaher)
(Assignee)

Comment 20

2 years ago
No, I'm just writing a patch so we can consume the new endpoint.  The data is via https instead of http so I'm cleaning up the buildbot code to consume that.
Flags: needinfo?(kmoir)

Comment 21

2 years ago
One major difference between the old SETA and the new one is that we don't consider jobs fixed by commit which have been tagged to be fixed by an empty field.
This definitely can affect the list of builders to be determined as low value jobs.

https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/
vs
alertmanager.allizom.org/data/seta/?startDate=2017-01-01&endDate=2017-01-09 (I've been waiting a while but I'm sure is this endpoint)
I am concerned with the 1 job to run via analysis- is this with 90 days of history, or with what is done on staging/locally?  We should have a representative amount of builders as p1 vs p5.

I fully understand there are differences between alertmanager, heroku, treeherder- some of this is that we are slightly changing the sources, for example- I believe alertmanager is dealing with desktop-test/android-test and not the new taskcluster names, so treeherder is the winner in the case.
Flags: needinfo?(jmaher)

Comment 23

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #22)
> I am concerned with the 1 job to run via analysis- is this with 90 days of
> history, or with what is done on staging/locally?  We should have a
> representative amount of builders as p1 vs p5.
> 
This is staging failures data (3 revisions):
https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json

I want to see the production output since staging adds no real value.

FYI 4 months. Treeherder expires data after 4 months.
(Assignee)

Comment 24

2 years ago
So I have been testing patches to consume the new data.  As I mentioned to Armen yesterday in irc, I ran into some problems since the seta data is now provided via https vs the previous http

I wrote a script like this to test
import httplib 
import json

host = "treeherder.allizom.org"
path = "/api/project/mozilla-inbound/seta/job-priorities/"

try:
    port = int(443)
    conn = httplib.HTTPSConnection(host, port)
    conn.request("GET", path)
    r1 = conn.getresponse()
    if r1.status == 200:
        data = json.loads(r1.read())
        print data 
except ValueError, e:
    print("JSON parsing error %s: %s" % (url, str(e)))


which works with a more current version of python, but not the version used to run buildbot (2.7.3).  

With python 2.7.3 we get the error
ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

investigation has revealed that this is because the SNI libraries were only backported python > 2.7.3

One suggestion is to use requests but this is not on the masters either

So I'm continuing to investigate.  We don't really want to upgrade python on all the buildbot masters now given that their demise is imminent later this year

Comment 25

2 years ago
If we switch to BBB, would we get SETA support via TaskCluster?
If so, we could wait until we run everything via TaskCluster/BBB and get SETA via that way.

Alternatively, could we make a call to curl or wget?

In any case, here's our current comparison of builder lists between systems:
* low value jobs TH       - 124 builders [1]
* high value jobs TH      - 1154 builders [2]
* low value jobs Allizom  - 846 builders [3]
* high value jobs Allizom - 62 builders [4]

It is rather dissapointing to see those numbers; maybe I'm being too harsh on myself but I would have expected at least some closeness to each other.
I know that on Ouija we consider jobs marked with blank string fixed by commit when on TH we don't.

If we add the builders from both endpoints we get:
* TH 1278 builders
* Allizom 968 builders

If bug 1330354 was fixed we could at least try to play with production data to determine if there are any issues in the logic.

[1] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json
[2] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json
[3] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=0
[4] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=1
If I knew were were going to be running everything via BBB in <4 weeks, I would say "yes, lets just wait for BBB and do SETA on everything there".  Unfortunately I think we are not going to be doing that.

Odd that we seem to have flipped low value/high value.  When running 'failures.py', how many 'fixed by commit' regressions are we working with in TH?  In allizom, we have 628 failures over 90 days.
(Assignee)

Comment 27

2 years ago
So, the ssl issue isn't an problem anymore, I looked switched to the treeherder url and it works.  Probably something to do with an self signed cert on allizom or something.  In any case, I found another problem

the data here
https://treeherder.mozilla.org/api/project/autoland/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json

specifies mozilla-inbound in the branch like this when the url points to autoland
"jobtypes":{"2017-01-11":["Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound talos chromez-e10s"

Is this expected? 

same for https://treeherder.mozilla.org/api/project/graphics/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json

I assume that the data in all three links could be different thus my scripts parse on the branch name
Flags: needinfo?(armenzg)

Comment 28

2 years ago
Thanks Kim! I will look into it.

(In reply to Joel Maher ( :jmaher) from comment #26)
> Odd that we seem to have flipped low value/high value.  When running
> 'failures.py', how many 'fixed by commit' regressions are we working with in
> TH?  In allizom, we have 628 failures over 90 days.

I can't tell because of bug 1330354. Enough for the MySql operations to take over 49 seconds.
Flags: needinfo?(armenzg)

Comment 30

2 years ago
kmoir: I'm returning the bug to you as I will be gone after today.
If you find anymore issues please chat with rwood/jmaher.
Assignee: armenzg → kmoir
Depends on: 1330652

Comment 31

2 years ago
Created attachment 8826329 [details]
compare_end_points.py

Here's a script I started in order to compare endpoints.

A feature I wanted to add is to sort the data before writing to disk.
This would allow using diffing tools like vimdiff or diff.

Good luck with the bug!

Updated

2 years ago
Attachment #8826329 - Attachment mime type: text/x-python-script → text/plain
I am looking at the differences in data and trying to figure out how we get data for fixed_by_commit in treeherder.  I believe we have an issue here:
https://github.com/mozilla/treeherder/blob/edc3d7ad112c7c60e341fab7c1485c0e41408036/treeherder/etl/seta.py#L95

I have a query that I believe replicates the fixed_by_commit logic:
https://sql.telemetry.mozilla.org/queries/2517

the concerns here are the option_collection_hash might be incorrect.  What I see in this is that we are splitting the name on -{option}, where option = [opt, debug, pgo, asan], where we should be doing:
if option in ['pgo', 'asan']:
    option = 'opt'

job_type_name.split('{option}-'.format(buildtype=platform_option))[-1]

I think that would solve things, but I would need to debug it a bit more.

:rwood, this is probably the next level of debugging to do.
Flags: needinfo?(rwood)

Updated

2 years ago
Flags: needinfo?(rwood)
(Assignee)

Comment 33

2 years ago
Hmm, seems I'm having ssl eerors again when trying to fetch the url using the version of python that buildbot uses from treeherder.mozilla.org.

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    conn.request("GET", path)
  File "/tools/python27/lib/python2.7/httplib.py", line 958, in request
    self._send_request(method, url, body, headers)
  File "/tools/python27/lib/python2.7/httplib.py", line 992, in _send_request
    self.endheaders(body)
  File "/tools/python27/lib/python2.7/httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "/tools/python27/lib/python2.7/httplib.py", line 814, in _send_output
    self.send(msg)
  File "/tools/python27/lib/python2.7/httplib.py", line 776, in send
    self.connect()
  File "/tools/python27/lib/python2.7/httplib.py", line 1161, in connect
    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
  File "/tools/python27/lib/python2.7/ssl.py", line 381, in wrap_socket
    ciphers=ciphers)
  File "/tools/python27/lib/python2.7/ssl.py", line 143, in __init__
    self.do_handshake()
  File "/tools/python27/lib/python2.7/ssl.py", line 305, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error
(Assignee)

Comment 34

a year ago
I was thinking of using the non buildbot version of python already installed to run a cron job to fetch the json then change the buildbot configs to parse the local copy upon reconfig
(Assignee)

Comment 35

a year ago
I don't think this bug is relevant anymore since we can see the end of the road for tc migration and seta only runs on certain trunk branches.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WONTFIX
Could someone file the decom bug for Heroku seta and make it depend on the necessary bb->tc bugs?

Comment 37

a year ago
I think we will need to decom alertmanager's SETA or remove the logic from Buildbot consuming from it.
We deleted the SETA Heroku apps not long ago.
TC uses Treeherder's SETA.

Please correct me if needed.
Flags: needinfo?(jmaher)
Armen, that is all correct
Flags: needinfo?(jmaher)
That sounds great - thank you!
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.