Closed Bug 1212500 Opened 9 years ago Closed 9 years ago

Ingest Text Logs for Test Infrastructure Timing

Categories

(Testing Graveyard :: ActiveData, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ekyle, Assigned: ekyle)

References

Details

Let us put the timing data from the text logs into ActiveData. Then we can slice and dice it, and generate statistics. Work is currently being done on `dev` branch https://github.com/klahnakoski/TestLog-ETL/blob/dev/testlog_etl/transforms/pulse_block_to_job_logs.py
I deployed the job timing ETL to ActiveData [1]. There is still some post-deployment monitoring I must do, and I am sure a few more ETL issues will reveal themselves as we process significantly more logs than I can from home. Next step is to grab the "elapsedTime=" lines that Armen pointed out yesterday, and attempt a view-timings-by-pool, to better understand what the complications might be. [1] http://activedata.allizom.org/tools/query.html#query_id=nUt9Vkp5
WRT to "elapsedTime=", we have a hidden cost per Buildot step which currently can *only* be seen by looking at the full log. This was brought up in [1] [1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ ### There are some interesting things in this pre-mh time: elapsedTime=0.301000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) ========= Notice that elapsed time is 42s while the command itself thinks it only too 0.3s. This is a symptom of master lag, which we track an approximation of here: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters The "lag" is time between a step completing on the worker and the master responding to that. If this gets too high, it means we need to add more masters. ###
Kyle, could you please make the table that shows up on ActiveData something that I can expand? or open on a new window? It's hard to scroll up/down left/right.
Blocks: 1213004
Armen, The plan is a static javascripty page that will make navigation easier. The ActiveData query tool is only for those willing in to suffer. Remember, without a `limit` clause you will only get 10 rows. In the short term, use the COPY button, then copy-and-paste to a spreadsheet.
Gotcha. Why do I only get 6 rows in this query? I should see a lot more. http://activedata.allizom.org/tools/query.html#query_id=Di5asUdy
There's something off in here: http://activedata.allizom.org/tools/query.html#query_id=TVBVHJOx Most steps should not take less than 1 second. How can I round the steps to only have two decimals? Are the following questions answered by using another from? How can I see the actual Mozharness steps? (e.g. clobber) How can I see the total duration of a job? The Windows non-mozharness step are significantly bigger: http://activedata.allizom.org/tools/query.html#query_id=CY40m0pj How can I see the list of distinct build.name which have build.platform == android-api-11?
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #5) > Gotcha. > > Why do I only get 6 rows in this query? I should see a lot more. > http://activedata.allizom.org/tools/query.html#query_id=Di5asUdy I think that combination is not run much, being the ASH branch and all.
I'm also trying to look into how much time is spent for this step: "set props: basedir" Right now I can only see up to "set props" Thanks for pointing the Ash issue.
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #6) > There's something off in here: > http://activedata.allizom.org/tools/query.html#query_id=TVBVHJOx > Most steps should not take less than 1 second. Maybe that is how long they take? The `rm` and `set props` are usually below a second, not sure about the others. To find out, the `run.logurl` will give you the file it came from. (SEE BELOW**) > How can I round the steps to only have two decimals? Copy-and-paste to a spreadsheet. :) Maybe I should have the Javascript output nicer numbers? Who needs more than a few significant digits anyway? > Are the following questions answered by using another from? > How can I see the actual Mozharness steps? (e.g. clobber) The `action.timings.harness.step` will have the mozharness steps. > How can I see the total duration of a job? Of a single job? Then you must isolate a buildid (or builduid), wherever that is in the schema (which I have not documented yet) and `sum` the `builder.duration`. For the average of a job, hmmmm, that may be an oversight: I was planning to get the job numbers from the buildbot json logs and stick them in `start_time`, `end_time` and `duration` of the `jobs` table, but that is not done yet. > The Windows non-mozharness step are significantly bigger: > http://activedata.allizom.org/tools/query.html#query_id=CY40m0pj > > How can I see the list of distinct build.name which have build.platform == > android-api-11? ** Due to a SMALL BUG, I can not seem to pull `build.platform`, which is what you are looking for. I will fix that tomorrow morning.
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #8) > I'm also trying to look into how much time is spent for this step: "set > props: basedir" > Right now I can only see up to "set props" `set props` has a `action.timings.builder.parts` which is an array of all properties found in that line.
Depends on: 1213262
Please let me know when you fix the `build.platform` bug. Cool. I can now separate each "set props" step into their individuality: http://activedata.allizom.org/tools/query.html#query_id=q1bdWjwp The one above shows the averages. If I look at this log [1] there are 14 steps but the report above shows 13 steps. * The following two steps should show as the same one (Started set props: build_url blobber_files) ** See the averages are the same on the report ** Could you please fix it? * Unfortunately we're conflating 'bash -C' and 'rm -rf ...' into 1 step Wrt to the conflating of the 3rd argument and forward. We can only give you differentiation if we change the Buildbot behaviour to include the 3rd argument as part of the output. I think we can add a description to each step on ScriptFactory and that will give us more granularity. Filed as bug 1213262. (In reply to Kyle Lahnakoski [:ekyle] from comment #9) > (In reply to Armen Zambrano Gasparnian [:armenzg] from comment #6) > > Are the following questions answered by using another from? > > How can I see the actual Mozharness steps? (e.g. clobber) > The `action.timings.harness.step` will have the mozharness steps. > For some reason this has mozharness actions ("Android armv7 API 9 try opt test xpcshell-2": http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3 but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1): http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6 > > How can I see the total duration of a job? > Of a single job? Then you must isolate a buildid (or builduid), wherever > that is in the schema (which I have not documented yet) and `sum` the > `builder.duration`. For the average of a job, hmmmm, that may be an > oversight: I was planning to get the job numbers from the buildbot json > logs and stick them in `start_time`, `end_time` and `duration` of the `jobs` > table, but that is not done yet. This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on average takes 7 minutes http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj This tells me that the average of Windows XP 32-bit try opt test mochitest-1 takes 1 minute (that's sounds wrong) http://activedata.allizom.org/tools/query.html#query_id=msicR0VK Can I assume I need to wait for trying this again? [1] wget http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1444383827/mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt.gz gzip -d mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt.gz grep "== Started" mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt | sort [2] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l4705
Please let me know if you would like me to pursue bug 1213262.
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #11) > Please let me know when you fix the `build.platform` bug. Fixed > Cool. I can now separate each "set props" step into their individuality: > http://activedata.allizom.org/tools/query.html#query_id=q1bdWjwp > > If I look at this log [1] there are 14 steps but the report above shows 13 > steps. This is because some steps are merged, like "rm -rf ...". If you stick to just one run, and show the `jobs.action.timings.order`, you see the 14 steps. The count can also be off because you may be picking up more than one run; combining the steps of multiple runs. > http://activedata.allizom.org/tools/query.html#query_id=Wv6VEPlZ > * The following two steps should show as the same one (Started set props: > build_url blobber_files) > ** See the averages are the same on the report > ** Could you please fix it? This is only one step: `parts` is multi-valued, and can cause double counting if you `groupby` it. It is best you do not use `groupby` on this column. > For some reason this has mozharness actions ("Android armv7 API 9 try opt > test xpcshell-2": > http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3 > > but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1): > http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6 Hmm, maybe this is a bug. Do you have an example of win32 with a harness step? > > > How can I see the total duration of a job? > > Of a single job? Then you must isolate a buildid (or builduid), wherever > > that is in the schema (which I have not documented yet) and `sum` the > > `builder.duration`. For the average of a job, hmmmm, that may be an > > oversight: I was planning to get the job numbers from the buildbot json > > logs and stick them in `start_time`, `end_time` and `duration` of the `jobs` > > table, but that is not done yet. > > This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on > average takes 7 minutes > http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj No, that is the average length of a single buildbot step, which includes the very short 'rm -f ...'. If you want the time of the whole job select `{"from": "jobs"}` will have job-level properties and measures. I will update the ETL to include `action.duration` so the follow query will work > http://activedata.allizom.org/tools/query.html#query_id=dozL4WBX WARNING: What I put in `action.duration` will be shorter than the real time; only the buildbot json logs have the real time.
(In reply to Kyle Lahnakoski [:ekyle] from comment #13) > (In reply to Armen Zambrano Gasparnian [:armenzg] from comment #11) > > Please let me know when you fix the `build.platform` bug. > Fixed Well, maybe not. Still looking...
> > For some reason this has mozharness actions ("Android armv7 API 9 try opt > > test xpcshell-2": > > http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3 > > > > but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1): > > http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6 > > Hmm, maybe this is a bug. Do you have an example of win32 with a harness > step? All of these jobs have Mozharness steps: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Windows Trying a different query with a different builder gives 140 empty rows: http://activedata.allizom.org/tools/query.html#query_id=op+iWGGo 180 empty rows for the Windows *build* jobs: http://activedata.allizom.org/tools/query.html#query_id=+BVKoK6t It works for Mac: http://activedata.allizom.org/tools/query.html#query_id=OTsU+kIC > > > > How can I see the total duration of a job? > > > Of a single job? Then you must isolate a buildid (or builduid), wherever > > > that is in the schema (which I have not documented yet) and `sum` the > > > `builder.duration`. For the average of a job, hmmmm, that may be an > > > oversight: I was planning to get the job numbers from the buildbot json > > > logs and stick them in `start_time`, `end_time` and `duration` of the `jobs` > > > table, but that is not done yet. > > > > This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on > > average takes 7 minutes > > http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj > > No, that is the average length of a single buildbot step, which includes the > very short 'rm -f ...'. > > If you want the time of the whole job select `{"from": "jobs"}` will have > job-level properties and measures. I will update the ETL to include > `action.duration` so the follow query will work > > > http://activedata.allizom.org/tools/query.html#query_id=dozL4WBX > > WARNING: What I put in `action.duration` will be shorter than the real time; > only the buildbot json logs have the real time. Why is that?
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #15) > All of these jobs have Mozharness steps: > https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter- > searchStr=Windows Thank you for the links. I will use them for debugging, and confirming the results after a retrigger. > > WARNING: What I put in `action.duration` will be shorter than the real time; > > only the buildbot json logs have the real time. > > Why is that? Something that reports time, like the buildbot steps, can not be expected to include the entire time. Immediately after the `======== Finished` line is written, the step is still running; probably not for long, but we can not be sure. We need a 3rd system, the buildbot json logs, that subsumes those steps to be certain we have an upper bound on the total time of the job. That said, the `action.duration` can be used to identify inefficiencies; but we will remain uncertain the system is efficient.
The job timing data was reprocessed over the weekend. The windows jobs are now showing their harness steps. Here is an example using the normalized Buildbot properties: http://activedata.allizom.org/tools/query.html#query_id=W_dMu5o+ Here is an example of average job duration: http://activedata.allizom.org/tools/query.html#query_id=vENRy9c1 Added `builder.elapsedTime` so the we can report the buildbot master lag. Here it is by step for win32: http://activedata.allizom.org/tools/query.html#query_id=AgF8ZYLB
Reviewing `builder.elapseTime`, the results are suspicious. I will look into them deeper, but do not expect any conclusions until tomorrow, at least.
There was a question about build times: They are seriously incomplete. Here is build times by machine, over time. Notice the zeros. http://activedata.allizom.org/tools/query.html#query_id=ucHUSg_h
No one machine has the resources to compare the buildbot json logs [1] to the `jobs` in ActiveData. Even when differences are found (which are considerable, when it comes to the plain builds), we must still parse the buildbot properties to fill in the holes. For this reason I wrote the BuildbotTranslator [2]. It was based on the PulseTranslator, but with the added advantages of * handling the many cases PulseTranslator ignores the pulse message * keeping all build properties, even if not known * sending alerts when buildnames can not be parsed so the code stays up to date This translator will be used by another ETL pipeline to fill an alternate `jobs` table in ActiveData: Which, in turn, will be used to perform a comparison with what's coming from Pulse. [1] http://builddata.pub.build.mozilla.org/builddata/buildjson/ [2] https://github.com/klahnakoski/TestLog-ETL/blob/0c8c6101c20d68880918cda8bf88af44c31a78d1/testlog_etl/imports/buildbot.py
Hi Kyle, There are Mozharness actions in here which are not (e.g. DesktopSingleLocale) - group by win32 platform: http://activedata.allizom.org/tools/query.html#query_id=vJYnZg7H https://dxr.mozilla.org/mozilla-central/search?q=DesktopSingleLocale&redirect=false&case=true
Armen, I see those in the buildbot json logs, but I was not aware those steps were getting through the PulseTranslator: Many of the pulse messages concerning "builds" are ignored by the pulse translator, some do get through, and that is what you are seeing here. I put "build" in quotes because buildbot is doing a lot more than just building and testing, or so it seems from the data [1] I am solving this problem by adding a few properties (not in ActiveData yet) that you can filter by * action.build - true if this is a "build" action * action.test - true if this is a test * action.talos - true if this includes a talos test [1] https://github.com/klahnakoski/TestLog-ETL/blob/03f77d0d35a1dec4c73f845f246ded5c05248a32/testlog_etl/imports/buildbot.py#L323
Thanks for that! let me know when it is available. Why is this query not sorting? http://activedata.allizom.org/tools/query.html#query_id=7+cUFtPn ekyle, I think I'm going to be writing some client that will be able to use mozci to fetch data from active data and display it somehow. For now I will start tinkering in ipython notebook and see where I get. I found that trying to build a single query to answer all that I want is rather hard but instead I should focus on building something with many small focused queries.
How can I post to active data and get a json reply? Are there available APIs I can hit?
Yes, ActiveData is best interacted using a few queries, rather than one: The query language is not powerful enough to do the more interesting (useful) analysis. Yes, if you open the debugger (f12) you will see the network requests/responses, which are all in JSON. You will also see that the endpoint is http://activedata.allizom.org/query . There are no APIs, but hopefully you do not need one; the `requests` module does a lot: `requests.get(json={"from":"jobs"}).json()`
Here is an example showing "typical" steps for win32, and their timings. An inspection of the page source will show what's involved: A two step process; collecting the common steps, then getting their duration. My next step is to parameterize this for all platforms and test suites, and also show a timing timeline from the past few weeks. Caveats: The code that looks at the "next line", and provides a better description of the buildbot steps has not been deployed yet: When this happens, the steps will have finer detail. The code that consumes the buildbot json logs has not been deployed yet either: When done, the overall timing will be more accurate. [1] http://activedata.allizom.org/tools/timing.html
I get: populous[0] is undefined File timing.html, line 60, in __main File thread.js?1445883578654, line 242, in Thread_prototype_resume File thread.js?1445883578654, line 220, in Thread_prototype_resume/retval File Rest.js?1445883578654, line 93, in Rest.send/request.onreadystatechange
This error will occur when there has been no data over the past two days. This makes sense because I was pushing "improvements" to the code this weekend. I will update this bug when I have he problem fixed.
Hi Kyle, Is there a way to determine how much buildbot lag is introduced for Windows tests? The current backlog of jobs are around 4,000 to 8,000 jobs pending for each pool [1] On a given day, we can run around 28,000 jobs in total for all three Windows pools [2]. Is there a way that we could know what is the percentage wasted every day due to the Buildbot lags? I assume we only include builders with "Win" and "test" or "talos" in their name. [1] http://builddata.pub.build.mozilla.org/reports/pending/pending.html [2] https://groups.google.com/forum/#!topic/mozilla.dev.tree-alerts/7lKN1nm3_2M
Adding catlee as he is very interested on this topic (see comment 29).
Depends on: 1220236
No longer blocks: 1213004
Armen, The windows machines appear to have parsing problems (probably due to \r): I will be looking into this parsing problem next. Until then, a parameterized version of the BuildbotTimings dashboard is up [1]. I put it into its own repo [2] to emphasize it is not ActiveData, just a consumer. There are some unused time-range variables, and they are meant to control the timing chart, which has not been added yet. [1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/#sampleInterval=week&sampleMax=2015-10-31&sampleMin=2015-06-21&pool=b-2008-ix [2] https://github.com/mozilla/MoBuildbotTimings
...and It only looks at the past 2 days (as per the title), and still requires plenty of CSS love.
This is pretty incredible! Looking forward to use the date picker or seeing the Buildbot lags (I see the code got r+'ed)
Hi Kyle, is bug 1213262 a real blocker for this? I think I added the dependency early on and I don't know if it is in anyway blocking you.
The issue with buildbot timings [1] was lack of data: I am still manually triggering the buildbot JSON ingestion, which is always a bit late. I set the averages to weekly averages until I automate the whole pipeline. [1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/#sampleInterval=week&sampleMax=2015-11-07&sampleMin=2015-06-28&product=firefox
The Buildbot test timings page [1] has been updated with charts over time. Right now it only looks for the "run-tests" step: There are probably other important steps (like in builds) that can be added, so this page can be used on more than just tests. [1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/
This is done. Reopen if you disagree.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.