Pregenerate a cache of inbound build information and have mozregression use it

RESOLVED INCOMPLETE

Status

Testing
mozregression
RESOLVED INCOMPLETE
3 years ago
3 years ago

People

(Reporter: wlach, Assigned: kapy)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 3 obsolete attachments)

49 bytes, text/x-github-pull-request
parkouss
: feedback-
Details | Review | Splinter Review
Right now we download inbound information "on demand" inside mozregression itself from the inbound ftp site:

https://github.com/mozilla/mozregression/blob/master/mozregression/inboundfinder.py#L87

This is basically doing a parallel crawl of the inbound ftp site. This is really slow, and error prone.

Instead, let's create some kind of per-platform/branch json index of inbound build dates and locations using a python script (basically just following the procedure outlined in the code snippit above) that we can retrieve quickly from a static ftp site. 

Steps I'd probably take writing this:

1. Extract out the above logic into a seperate script which prints out inbound information given a revision range.
2. Enhance the logic to output a json file, and be able to select arbitrary platform/branch combinations.
3. Allow fetching the entire inbound archive into a single file (likely to take a while), as well as updating an existing file with new information (and taking out old information which has expired)

Once this is working to our satisfaction, we can probably have this run periodically, upload the information to S3, and have mozregression use it for faster and more robust inbound (or other branch) regression finding!
Why don't we extend this idea to nightly information too ? Nightlies builds may be invalid too, and we could make a uniform way to query builds information.
Would it be possible to write a script that acts on local filesystem ? If we could run it locally in

 - http://ftp.mozilla.org (for nightly)
 - http://inbound-archive.pub.build.mozilla.org (for inbound)

It would be fast and less error prone. We could generate a json file (or multiple files, or even some webserver) containing builds infos every day with a cron and mozregression would have just to fetch this.
(In reply to Julien Pagès from comment #1)
> Why don't we extend this idea to nightly information too ? Nightlies builds
> may be invalid too, and we could make a uniform way to query builds
> information.

Yes, good idea.

(In reply to Julien Pagès from comment #2)
> Would it be possible to write a script that acts on local filesystem ? If we
> could run it locally in
> 
>  - http://ftp.mozilla.org (for nightly)
>  - http://inbound-archive.pub.build.mozilla.org (for inbound)
> 
> It would be fast and less error prone. We could generate a json file (or
> multiple files, or even some webserver) containing builds infos every day
> with a cron and mozregression would have just to fetch this.

Yes, I think ultimately this is what we should do. Let's get a prototype going first and then if it works well we should be able to translate this easily into something that we can get releng to run on the build servers. :) I hear rumours that they're thinking of developing a "build information API", something like this would be a great start.
Great, a local script will work really well I'm sure; I'm thinking that this prototype must not be a real mozregression patch - this must be a separate tool.

Kapil, are you still interested in working on this ? Otherwise, I would be really happy to work on it myself. :)
Flags: needinfo?(kpsingh201091)
Maybe we could implement this in another way:

 - one server to keep build info, and dispatch it as json data on requests
 - tools to get build data locally, and send this data to the server
 - mozregression will do the build info requests, then the bisection/testing work

This would involve more work, but also it will be more reusable; for example adding new data sources or reflect some tree changes will be easier. We will also have to define good api / data format exchange.

William, what do you think ?
(In reply to Julien Pagès from comment #4)
> Great, a local script will work really well I'm sure; I'm thinking that this
> prototype must not be a real mozregression patch - this must be a separate
> tool.
> 
> Kapil, are you still interested in working on this ? Otherwise, I would be
> really happy to work on it myself. :)

Kapil is actually still working on this. :) I was helping him with some of the initial steps today. But it might be helpful for him to have another mentor? We're starting off with just writing a simple script which just creates an index of build information by crawling the nightly archive. If you have some suggestions on where to go after we've got that working, that would be cool. :)

(In reply to Julien Pagès from comment #5)
> Maybe we could implement this in another way:
> 
>  - one server to keep build info, and dispatch it as json data on requests
>  - tools to get build data locally, and send this data to the server
>  - mozregression will do the build info requests, then the bisection/testing
> work

I think this first is not necessary -- we can just store the json statically. But yes, generally the seperation of concerns sounds like a good idea!
Flags: needinfo?(kpsingh201091)
(In reply to William Lachance (:wlach) from comment #6)
> Kapil is actually still working on this. :) I was helping him with some of
> the initial steps today. But it might be helpful for him to have another
> mentor? We're starting off with just writing a simple script which just
> creates an index of build information by crawling the nightly archive. If
> you have some suggestions on where to go after we've got that working, that
> would be cool. :)

Great! I would be happy to help if needed. Well currently I can't think of
something more specific for the script - what you describe is what we need. :)
(Assignee)

Comment 8

3 years ago
Thanks guys for helping me with this bug. :)
Julien, can you tell me your irc nickname so that I can get help from you also.
Sure, I use parkouss as nickname.

Comment 10

3 years ago
I can't seem to find this script:
https://github.com/mozilla/mozregression/blob/master/mozregression/inboundfinder.py#L87

In general, IIUC, this info could be retrieved from buildapi or the treeherder apis, however, I don't know what your code is looking for.

I'm considering writing a library to help in some other matters if you would like to add to it:
https://etherpad.mozilla.org/query-ci
The http search logic is now in https://github.com/mozilla/mozregression/blob/master/mozregression/build_data.py, and configured with https://github.com/mozilla/mozregression/blob/master/mozregression/fetch_configs.py.

We are looking for build dirs - inbound and nightlies, with information like:

 - repository
 - changeset
 - date of the build
 - application
 - architecture

when available. The purpose is to get information for a range (dates or changesets) of builds, and download some of them to test them (binary search to find a regression).
(Assignee)

Comment 12

3 years ago
Created attachment 8553792 [details]
prefetch.py

By taking ideas from both of you Will and Julien, I have built a script that extracts nightly info. I will build the script for inbound info as well, but first tell me the improvements to be made to this script, I want to know if i'm going right.
Hi Kapil,

firstly thanks for this first step!

There are some issues though. I think that the main problem is the approach to get the build information.

As you can see build folders contains information for each os and bits.
(for example http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2014/11/2014-11-02-03-02-02-comm-central/)

Currently the algorithm in prefetch.py would go in each build dir one time for linux 32, on for linux 64, one for macosx, etc.

I think a better approach would be to go in each build folder for a given date once to get all the information for different os/bits. This would reduce the number of requests needed. Do not forget that the script needs ultimately to get build info for each os/bits combination and write all this in a json file.

As you can see, mozregression was not designed to the approach I describe - It was desgned to only get information for one app, running under a certain os with a certain arch - for example a user was looking for firefox, on linux 64 bits. Now the script needs to get everything in one go - all apps, all oses, and all bits available. (let's forget about other repos for now).

I think that NightlyUrlBuilder can be reused here, as it provides the build folder url. But the BuildFolderInfoFetcher and maybe fetch_configs needs to be adapted to allow recuperation of all the data with only one request that list build folder - and maybe one other that read one txt build file.

I think you can copy paste code from mozregression in your script if required - at the end, all the code currently used to retrieve info from build folders will go in these scripts, so it won't be a copy paste.

Also, as a general python coding rule, I think it is safer to not write code outside of functions/classes. You can put this code in a main() function, and use the 'if __name__ == "__main__"' idiom.

I will NI William to be sure he agrees with what I wrote here.

Thanks again Kapil. :) Do not hesitate to contact me if you have some questions.
Flags: needinfo?(wlachance)
Yup, all this makes sense to me. :)
Flags: needinfo?(wlachance)
(Assignee)

Comment 15

3 years ago
Created attachment 8554382 [details]
prefetch.py

Hello guyz, take a look at the new file, I have made all the changes that Julien said. I have modified fetch_configs.py file and BuildFolderInfoFetcher class according to the needs. Take a look at the file and tell me the next steps.

Thanks,
Kapil
Attachment #8553792 - Attachment is obsolete: true
Hello Kapil,

Yup, I did not tested it yet but this seems more like what we want. :)

I think that we should create one json file per application thought. Because on the mozregression side, we will require data for one application at a time - a user want to bisect for only one app, looking for another one must be done in another mozregression run, so this will save some bandwidth.

The main problem now is that we need the script to be able to get all build data for an app - ie without specifying a good date, or even a bad one. (these terms do not make sense here by the way). So we may have to find the first nightly build date for each app, and use it as a lower date - upper date may just be now by default. This is a task that can be done by looking at the ftp.mozilla.org server. For firefox for example I would do the following:

1. Go to http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly : this is the root dir for every firefox nightlies
2. in this folder, find the oldest year : http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2004
3. Now look for the oldest month:  http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2004/02
4. Finally, look for the oldest build dir. It seems that 2004-02-10 is the oldest build available for firefox nightlies - I would say that we can hardcode this value. but a better way would be to automatically script the logic to find build dirs for firefox, find them all (knowing that there are all under a pattern http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/YEAR/MONTH) and write all this sorted in json.

This is somewhere what the current BuildFolderInfoFetcher is doing. BUT it is doing this for a known date - instead we should traverse the nightly tree and automatically find available build dirs until now.

I hope this is clear. I know that it is a certain amount of work - and somewhere hard to test because traversing all these urls must take quite a LONG time. :) Also, somewhere we have to keep in mind that this script may one day no more use urls, but traverse local files to find build dirs (see comment 3) but I suppose that won't make big changes on this script.

Thanks Kapil, you are on the right side for this (really not simple) bug!

@Will
Do you agree with this, and particularly the one json per application ? I am thinking that we could name them nightly_{repo}_{app}.json (and nightly_{app}.json when it is the main repo, or maybe nightly_main_{app}.json).
Or something else if repo or app may include underscore in names (I wanted to use - instead but I forgot about fennec-2.3 app name). Maybe dots ?
This is open to discussion or other ideas for sure.
Flags: needinfo?(wlachance)
(Assignee)

Comment 17

3 years ago
Created attachment 8554528 [details]
prefetch.py

Hey Will and Julien, check out this code, now it produces separate json files for each app and I have also added a method that updates good_date to the oldest date and bad_date to the current date.
Attachment #8554382 - Attachment is obsolete: true
Attachment #8554528 - Flags: review?(wlachance)

Comment 18

3 years ago
If it helps, all of the metadata that you want lives in here:
http://builddata.pub.build.mozilla.org/builddata/buildjson

This week I will have a small release that finds data in there if you would like to have a look at it.
(Assignee)

Comment 19

3 years ago
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #18)
> If it helps, all of the metadata that you want lives in here:
> http://builddata.pub.build.mozilla.org/builddata/buildjson
> 
> This week I will have a small release that finds data in there if you would
> like to have a look at it.

Thanks Armen, but I can't understand all the entries in the json files.

Comment 20

3 years ago
Once I have something released using it, it might help understand it. It took me a bit to figure it out.
It has all information of all scheduled jobs in the continuous integration.
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #20)
> Once I have something released using it, it might help understand it. It
> took me a bit to figure it out.
> It has all information of all scheduled jobs in the continuous integration.

This is interesting, and I didn't know about it, but I'm not really sure if it helps us in this particular case. We can get all that we need out of the build folders, as far as I know.
(In reply to Julien Pagès from comment #16)

> @Will
> Do you agree with this, and particularly the one json per application ? I am
> thinking that we could name them nightly_{repo}_{app}.json (and
> nightly_{app}.json when it is the main repo, or maybe
> nightly_main_{app}.json).
> Or something else if repo or app may include underscore in names (I wanted
> to use - instead but I forgot about fennec-2.3 app name). Maybe dots ?
> This is open to discussion or other ideas for sure.

Yes, something like this makes sense to me. Partitioning the data into separate files should make things slightly easier to test as well as saving bandwidth.
Flags: needinfo?(wlachance)
Comment on attachment 8554528 [details]
prefetch.py

Will give a more detailed review later, but one thing that immediately jumps out to me is that the classes should be modified inline, rather than being copied into the script. If changing them involves updating the way they're called elsewhere in mozregression, that's ok. :) (hint: use "git grep <name>" to find all uses of a class or api method in the source)
Once the script will works, there will be no use for these classes inside mozregression. Unless I missed something. :) I thought that we were starting to build here a new tool to feed mozregression - ie with a its own github repo later. Is that wrong ?
(In reply to Julien Pagès from comment #24)
> Once the script will works, there will be no use for these classes inside
> mozregression. Unless I missed something. :) I thought that we were starting
> to build here a new tool to feed mozregression - ie with a its own github
> repo later. Is that wrong ?

I think it'd be easiest to keep the code for the tool inside the mozregression repository, since it's pretty closely tied to it (and it's quite possible that people might want to work on both at the same time). 

Once this tool is ready (and mozregression modified to use the cached data) we can probably scrap the codepath inside mozregression to download this data (and just have it pull down the json). I thought it might be less confusing while we're working on things to only have one copy of the code though. If I'm missing something and this makes things really difficult let me know. :)
(In reply to William Lachance (:wlach) from comment #25)
> (In reply to Julien Pagès from comment #24)
 
> Once this tool is ready (and mozregression modified to use the cached data)
> we can probably scrap the codepath inside mozregression to download this
> data (and just have it pull down the json). I thought it might be less
> confusing while we're working on things to only have one copy of the code
> though. If I'm missing something and this makes things really difficult let
> me know. :)

Ok after chatting with Kapil I am now convinced that it's easier to just copy & paste the code when writing the script, as it will all be going away soon anyway. So disregard that last remark. I still think it makes sense to keep everything in one repository though.
Comment on attachment 8554528 [details]
prefetch.py

Hey Kapil, a few things before you submit your next patch:

1. In some ways (e.g. indentation) this doesn't follow the a-team python style (http://ateam-bootcamp.readthedocs.org/en/latest/reference/python-style.html) Could you run http://flake8.readthedocs.org/en/latest/ against this?
2. We should add this to the repository as a script you can run. You can follow the model used by the main mozregression script, which is to define an entry point linked to from setup.py:

https://github.com/mozilla/mozregression/blob/master/setup.py#L25
https://github.com/mozilla/mozregression/blob/master/mozregression/main.py#L207

Once you've done that, you should be able to commit and make a pull request and then link to this bug. This is outlined here: http://ateam-bootcamp.readthedocs.org/en/latest/guide/development_process.html#git-and-github

I'm going to cancel this review for now. Looking forward to your next patch. :)
Attachment #8554528 - Flags: review?(wlachance)
(Assignee)

Comment 28

3 years ago
Created attachment 8557176 [details] [review]
Pregenerate build info cache

Hey guyz, take a look at my pull request.

Thanks,
Kapil
Attachment #8554528 - Attachment is obsolete: true
Attachment #8557176 - Flags: review?
Attachment #8557176 - Flags: feedback?(wlachance)
Attachment #8557176 - Flags: feedback?(j.parkouss)
Comment on attachment 8557176 [details] [review]
Pregenerate build info cache

Hi Kapil, Thanks for this! Unfortunately there is some things that I find not optimal yet - so the f- (do not be mad about this by the way, that's normal :))

Also this decreases a lot our unit test percentage! Well, I suppose this is normal, and we can move later the unitests of build_data and fetch_configs when we will remove them - I will ask Will about this, but I'm thinking that we could 'hide' the covering of this file until it is not really used.

My Biggest concern here is the fetch_config logic I think. The way it was designed (query for one app, one os, one bit option) does not mean that you have to use it this way. I am thinking that maybe you may build a dict of all regexes for an app, and try to pass each of them on build dirs to find info data. Somethink like:

app_data = {
    'firefox': {
        'linux': [32, 64],
        'mac': [64],
        'win': [32, 64],
    },
    'thunderbird': {
        ...
    },
    ...
}

# then you can build regexes:
app_regexes = {}
for app, data in app_data.iteritems():
    config = fetch_configs.create_config(app)
    regexes = []
    for os, all_bits in data.iteritems()
        for bits in all_bits:
            regexes.append(re.compile(config.build_regex(os, bits)))
    app_regexes[app] = regexes

This also means that you don't have to copy paste fetch_configs classes, just use them (I now think it is better, because we already have tested them and we know they works).

I hope you get the idea: with these regexes, we can traverse the build trees (nighly or inbound), then in each build folder try every regex for each link - each of these results will be build data.

I hope I made myself clear! This is quite long; :)
Attachment #8557176 - Flags: feedback?(j.parkouss) → feedback-
Comment on attachment 8557176 [details] [review]
Pregenerate build info cache

I think in addition to what Julien said, we should probably try to eliminate the use of "good_date" / "bad_date" nomenclature in the build data classes while we're writing this. I think I'd rather we just write the build data classes to straightforwardly crawl the archive within a date range and extract the build data.

So excited to see this progressing! Looking forward to seeing more soon. :)
Attachment #8557176 - Flags: review?
Attachment #8557176 - Flags: feedback?(wlachance)
So, we discussed some things with William, that I am going to report here. It's about the required steps needed for this bug.

First, I want to say (again) that this is not an easy bug to work on. This will require quite a lot of work (source code, reviews, discussions). But that's an excellent bug to understand a large part of mozregression internals! Kapil, if you persist on this you will learn a lot on mozregression internals and be able to tackle most of other bugs easily! That would be awesome for us to have a new person that have this kind of knowledge! :)

Well, let's see what we discussed now.

Overview of what we need to do
==============================

1. write a prefetch script. this is what we are doing now, and probably the hardest part.
2. rewrite BuildData classes to be able to use the json data generated with 1.
3. (This can - may/must ? - be done with each step) rewrite/add unit tests for the changed or added code.

We will focus on each step at a time - personally I don't mind if tests are written at the end, but I do mind that they are written at some point.

Json data files url
===================

json data files will be kept on a web server (we don't know which one currently). We need to add a command line flag to specify which one. I suppose we could use localhost for now as a default (for testing) and/or maybe we could handle the file:// url scheme to be able to take those files from a local folder (also mainly for testing purposes).

Json data files naming convention
=================================

We also discussed about a naming convention for the json data files (used in 1 and 2). We decided to use '{build_type}-{application}-{repo}.json':

- {build_type} will be nightly or inbound
- {application} is the app name
- {repo} is the repo name (or what we called inbound branch for inbound). I don't think we discussed repo yet. Well as an example:
 - in this nightly url http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/01/2012-01-16-03-11-00-mozilla-central/firefox-12.0a1.en-US.linux-x86_64.tar.bz2, the repo is "mozilla-central". The last part of  th build dir name (after the date).
 - in this inbound url http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1415213096/firefox-36.0a1.en-US.linux-x86_64.tar.bz2, the repo is "mozilla-inbound" (this is the first part of the build folder name here)

We will urlencode '-' in file name parts. For example, the repo fx-team will end in a real file name that will be 'nightly-firefox-fx%45team.json', as %45 is the url encoding for '-'.

Well, that's enough for now. ;)
So, after some discussions, I think that we can forgot about nightlies for this bug (Kapil agrees on this too).
For now, and maybe for good.

The real problem here is inbound, which is slow and error prone. So the real improvement will be for inbound only.

Why I think nightlies are not needed:
 - this works well already
 - there are a LOT of builds. This means a LOT of url crawling (as we don't do that locally for now), and this will take a lot of time. Plus updating the code is big amount of work.

At the opposite, for inbound there are only 6 month of data available, and just a few branches are interesting for us (seems to be only mozilla-inbound, b2g-inbound and fx-team). But even if we want every branches this is still just a 6 month old data we need, so not *much*.

Kapil is focusing on inbound for now.

Will, what do you think ? Maybe we can forget about nightlies here ?
Flags: needinfo?(wlachance)
(In reply to Julien Pagès from comment #32)
> Kapil is focusing on inbound for now.
> 
> Will, what do you think ? Maybe we can forget about nightlies here ?

I'm for anything that could reduce the scope of this bug a bit! I agree that inbound is by far the biggest pain point.
Flags: needinfo?(wlachance)
Similar but not identical tickets include Bug 1124378 and Bug 1131909.
Hey, so after chatting with some of the releng folks; I think we might want to just cancel this in favor of just working on bug 1132151. It looks like what they've built with taskcluster is pretty much exactly what we want here, once they've added some of the indexing code that we need. 

Kapil, Julien: please have a look and let me know if you have any questions.

Sorry if it seems like we did some unnecessary work here. Perhaps one consolation is that we now all have a good understanding of how the branch, platform, dates, and revisions interact, so writing the code using taskcluster should hopefully be relatively straightforward. :) I'd of course be happy to continue mentoring things there.
So this seems to be clear now - we are going to use s3! Not sure how exactly yet, but we will. :)

Maybe we should close this bug in favor of bug 1132151 ?
Flags: needinfo?(wlachance)
Yeah, let's mark this is an incomplete and move on to newer things.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(wlachance)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.