Closed Bug 863785 Opened 11 years ago Closed 10 years ago

Publish machine readable build steps

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: catlee)

Details

Each build in automation is composed of many steps. e.g. set up the environment, run program X, etc. There is a lot of interesting info that can be gleaned from having a clear picture of what's going on in each of these steps. For example, when a build becomes faster or slower over time, we'd like to know which step(s) are responsible for that change.

Currently, there is no easy way to obtain a short summary of the steps that went into a particular build. You have to download the raw logs for each build and then parse the steps out of the logs. Even then, the strings in the logs sometimes vary between builders.

It's my understanding that the raw times and status of each build step is stored in a database somewhere. Catlee was even kind enough to provide me a CSV dump containing this information. Thanks, Catlee!

I think this data should be exposed on an ongoing basis in a similar fashion to the build JSON (http://builddata.pub.build.mozilla.org/buildjson/).

Priority is likely low. But as soon as it's available, that opens the door for better, more introspective dashboards and other coolness.
Product: mozilla.org → Release Engineering
Taras: I'm not sure if this is still worth pushing for.

Having this data would make it much, much easier to measure and analyze the impact of various job steps over time.

For example, bug 950850 has identified the "automation overhead" as part of executing Windows build jobs on try. Currently, this involves downloading a log file and parsing the results. Doing this on a mass scale is time and bandwidth prohibitive (hundreds of gigabytes of logs need to be processed). And, even when you parse the logs, the "sections" in the buildbot logs look something like:

========= Started 'python c:/builds/moz2_slave/try-w32-0000000000000000000000/tools/buildfarm/utils/retry.py ...' (results: 0, elapsed: 9 mins, 41 secs) (at 2013-12-17 17:52:00.699462) =========

You can parse this into the step "python c:/builds/moz2_slave/try-w32-0000000000000000000000/tools/buildfarm/utils/retry.py ...", but that's practically useless. You really want to know this corresponds to the "version_control_sync" step.

The buildbot database internally has concise, stable, and mostly descriptive keys that correspond to each step. I'd like to see this data published in a machine readable format so we can use this data to better understand and optimize the automation. If nothing else, this data would facilitate more easily measuring the impact of various changes going into automation, allowing us to make intelligent, data-driven decisions.

The wins should be clear. I just don't know if this is still relevant given buildbot overhaul talks.
I'm not sure what changes we're going to make to buildbot in the near future but I actually think that it should be very practical to download the logs, parse the interesting bits of information out of them, and find the parts that take the longest for now, even if this data is not available in a better form.

(Note that there are other advantages to having this data but as far as optimizing what happens on our build infra goes, I think we can survive the way things are now.)
I started a side-project a few months back that mass downloaded all buildbot metadata (buildjson and the raw logs) and attempted to make sense of it all. Code at https://github.com/indygreg/mozilla-build-analyzer.

It took dozens of hours and hundreds of gigabytes to download and store the logs. It took many hours of CPU time to parse them. That log scanning needed to be redone every time I updated the parser or changed the db schema to store new data. Even when you parsed the log, the step names were not very convenient for mass analysis. You can see this for yourself.

Download this Python file: https://raw.github.com/indygreg/mozilla-build-analyzer/master/mozbuilddata/logparser/jobparser.py

Then execute it, passing the URL of a build log. e.g.

$ python jobparser.py http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/gszorc@mozilla.com-a3bd2f3eaa15/try-win32/try-win32-bm56-try1-build627.txt.gz
                                set props: basedir       0      0%
                               set props: hashType       0      0%
                               set props: builddir       0      0%
                               clobber build tools      14      0%
                                 clone build tools      14      0%
                               set props: toolsdir       0      0%
                            checking clobber times       1      0%
              set props: purge_actual purge_target       1      0%
                                      set_buildids       0      0%
                               set props: toolsdir       0      0%
                                set props: basedir       0      0%
                                 remove mozharness       5      0%
                               checkout mozharness       4      0%
                 updating mozharness to production       1      0%
                                delete old package       0      0%
                    downloading to buildprops.json       0      0%
                                        rm -rf ...    1129      16%
python c:/builds/moz2_slave/try-w32-00000000000000     381      5%
                           set props: got_revision       0      0%
                               set props: comments       0      0%
                                     got mozconfig       0      0%
                                    cat .mozconfig       0      0%
                                           compile    2357      34%
                                  remove old nonce       0      0%
                              downloading to token       0      0%
python c:/builds/moz2_slave/try-w32-00000000000000     245      4%
python c:/builds/moz2_slave/try-w32-00000000000000     101      1%
python c:/builds/moz2_slave/try-w32-00000000000000     122      2%
                                     find filepath       0      0%
                        set props: packageFilename       0      0%
                            set props: packageSize       0      0%
                            set props: packageHash       0      0%
                               set props: filepath       0      0%
python c:/builds/moz2_slave/try-w32-00000000000000      85      1%
                                     find filepath       0      0%
                      set props: installerFilename       0      0%
                          set props: installerSize       0      0%
                          set props: installerHash       0      0%
                               set props: filepath       0      0%
                                set props: buildid       0      0%
                             set props: appVersion       0      0%
                                set props: appName       0      0%
                                    set props: who       0      0%
set props: symbolsUrl packageUrl testsUrl jsshellU      32      0%
                                        sendchange       5      0%
                                        sendchange       0      0%
python c:/builds/moz2_slave/try-w32-00000000000000      32      0%
python c:/builds/moz2_slave/try-w32-00000000000000      45      1%
python c:/builds/moz2_slave/try-w32-00000000000000      30      0%
python c:/builds/moz2_slave/try-w32-00000000000000      91      1%
python c:/builds/moz2_slave/try-w32-00000000000000      98      1%
                               check test complete    2140      31%
                        maybe rebooting slave lost       1      0%

A lot of that output is useful. But the "python c:/builds/moz2_slave/try-w32-00000000000000" (many of them taking up non-trivial amounts of time) are completely useless. The buildbot database has intelligent names for these steps.

If we exposed this concise step data in a machine readable format, it would allow anybody to create useful dashboards without spending a prohibitive amount of time to acquire and parse the data.
OK.. I was mostly talking about doing this for new build jobs that are coming in since I think that data is more actionable than the historical data, but if you really want the historical data then I agree that parsing out all of the logs that we have is not the best way to go forward.
I think changing buildbot to emit the machine friendly step name instead of the human description going forward might make sense. That still requires you to download whole build logs to parse, which is inefficient.

If we set up a mechanism to dump the database contents, we can perform a one-time backfill of the historical data and we can start asking questions about unexpected trends.
Log parsing is going to be more helpful long-term. As we move more into mozharness, there are fewer discreet steps in buildbot to measure...perhaps mozharness needs to generate some structured logs for timing of different actions - but that's another bug.

The log generation happens here:
http://hg.mozilla.org/build/buildbotcustom/file/42bf30ebc1d0/bin/log_uploader.py#l153

I suspect we could change the format here to list the step name rather than description and it would be far more meaningful. cf. http://hg.mozilla.org/build/buildbot/file/ec4a62e2ab83/master/buildbot/status/builder.py#l860
I did try and export build steps as json a few months back, but I OOMed the machine and didn't revisit.

If we want to export the step data we need a different output format and/or a different exporter.
I patched mozharness in bug 859573 to make it trivial to capture per-step/action data in a generic way. The resource monitor hooks into this and reports on per-action time and resource usage. The existing buildbot limitation influenced my design in that bug :)

e.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1387378199/mozilla-central_ubuntu64_vm_test-xpcshell-bm54-tests1-linux-build211.txt.gz

07:56:36     INFO - Total resource usage - Wall time: 999s; CPU: 100.0%; Read bytes: 10424320; Write bytes: 4539351040; Read time: 12228; Write time: 3690032
07:56:36     INFO - pull - Wall time: 13s; CPU: 98.0%; Read bytes: 167936; Write bytes: 3100672; Read time: 248; Write time: 2940
07:56:36     INFO - install - Wall time: 17s; CPU: 100.0%; Read bytes: 0; Write bytes: 26296320; Read time: 0; Write time: 160916
07:56:36     INFO - run-tests - Wall time: 969s; CPU: 100.0%; Read bytes: 9363456; Write bytes: 4501245952; Read time: 11028; Write time: 3474564

(BTW 100% CPU is not accurate - bug 893391.)

The only thing we need is machine readability. That's what bug 893388 is for. That bug became much easier now that blobber is deployed. I wish I had time to work on it!

I agree mozharness is the future. But improving build jobs seems to be a hot focus now and moving build jobs to mozharness hasn't finished yet. Not sure where that leaves us with priorities.
(In reply to Chris AtLee [:catlee] from comment #7)
> I did try and export build steps as json a few months back, but I OOMed the
> machine and didn't revisit.
> 
> If we want to export the step data we need a different output format and/or
> a different exporter.

CSV or some other newline-delimited format (even JSON objects/arrays on separate lines). I'd shy away from monolithic JSON because it isn't inherently stream oriented (or at least most parsers aren't defined that way).
I wrote a CSV exporter in reportor. These should be generated daily and be available here:
https://secure.pub.build.mozilla.org/builddata/reports/reportor/daily/build_steps/steps.csv.gz
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.