Closed
Bug 1212500
Opened 9 years ago
Closed 9 years ago
Ingest Text Logs for Test Infrastructure Timing
Categories
(Testing Graveyard :: ActiveData, defect)
Testing Graveyard
ActiveData
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ekyle, Assigned: ekyle)
References
Details
Let us put the timing data from the text logs into ActiveData. Then we can slice and dice it, and generate statistics.
Work is currently being done on `dev` branch
https://github.com/klahnakoski/TestLog-ETL/blob/dev/testlog_etl/transforms/pulse_block_to_job_logs.py
Assignee | ||
Comment 1•9 years ago
|
||
I deployed the job timing ETL to ActiveData [1]. There is still some post-deployment monitoring I must do, and I am sure a few more ETL issues will reveal themselves as we process significantly more logs than I can from home.
Next step is to grab the "elapsedTime=" lines that Armen pointed out yesterday, and attempt a view-timings-by-pool, to better understand what the complications might be.
[1] http://activedata.allizom.org/tools/query.html#query_id=nUt9Vkp5
Comment 2•9 years ago
|
||
WRT to "elapsedTime=", we have a hidden cost per Buildot step which currently can *only* be seen by looking at the full log.
This was brought up in [1]
[1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ
###
There are some interesting things in this pre-mh time:
elapsedTime=0.301000
========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) =========
Notice that elapsed time is 42s while the command itself thinks it only too
0.3s. This is a symptom of master lag, which we track an approximation of
here:
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters
The "lag" is time between a step completing on the worker and the master
responding to that. If this gets too high, it means we need to add more
masters.
###
Comment 3•9 years ago
|
||
Kyle, could you please make the table that shows up on ActiveData something that I can expand? or open on a new window?
It's hard to scroll up/down left/right.
Assignee | ||
Comment 4•9 years ago
|
||
Armen,
The plan is a static javascripty page that will make navigation easier. The ActiveData query tool is only for those willing in to suffer. Remember, without a `limit` clause you will only get 10 rows.
In the short term, use the COPY button, then copy-and-paste to a spreadsheet.
Comment 5•9 years ago
|
||
Gotcha.
Why do I only get 6 rows in this query? I should see a lot more.
http://activedata.allizom.org/tools/query.html#query_id=Di5asUdy
Comment 6•9 years ago
|
||
There's something off in here:
http://activedata.allizom.org/tools/query.html#query_id=TVBVHJOx
Most steps should not take less than 1 second.
How can I round the steps to only have two decimals?
Are the following questions answered by using another from?
How can I see the actual Mozharness steps? (e.g. clobber)
How can I see the total duration of a job?
The Windows non-mozharness step are significantly bigger:
http://activedata.allizom.org/tools/query.html#query_id=CY40m0pj
How can I see the list of distinct build.name which have build.platform == android-api-11?
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #5)
> Gotcha.
>
> Why do I only get 6 rows in this query? I should see a lot more.
> http://activedata.allizom.org/tools/query.html#query_id=Di5asUdy
I think that combination is not run much, being the ASH branch and all.
Comment 8•9 years ago
|
||
I'm also trying to look into how much time is spent for this step: "set props: basedir"
Right now I can only see up to "set props"
Thanks for pointing the Ash issue.
Assignee | ||
Comment 9•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #6)
> There's something off in here:
> http://activedata.allizom.org/tools/query.html#query_id=TVBVHJOx
> Most steps should not take less than 1 second.
Maybe that is how long they take? The `rm` and `set props` are usually below a second, not sure about the others. To find out, the `run.logurl` will give you the file it came from. (SEE BELOW**)
> How can I round the steps to only have two decimals?
Copy-and-paste to a spreadsheet. :) Maybe I should have the Javascript output nicer numbers? Who needs more than a few significant digits anyway?
> Are the following questions answered by using another from?
> How can I see the actual Mozharness steps? (e.g. clobber)
The `action.timings.harness.step` will have the mozharness steps.
> How can I see the total duration of a job?
Of a single job? Then you must isolate a buildid (or builduid), wherever that is in the schema (which I have not documented yet) and `sum` the `builder.duration`. For the average of a job, hmmmm, that may be an oversight: I was planning to get the job numbers from the buildbot json logs and stick them in `start_time`, `end_time` and `duration` of the `jobs` table, but that is not done yet.
> The Windows non-mozharness step are significantly bigger:
> http://activedata.allizom.org/tools/query.html#query_id=CY40m0pj
>
> How can I see the list of distinct build.name which have build.platform ==
> android-api-11?
** Due to a SMALL BUG, I can not seem to pull `build.platform`, which is what you are looking for. I will fix that tomorrow morning.
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #8)
> I'm also trying to look into how much time is spent for this step: "set
> props: basedir"
> Right now I can only see up to "set props"
`set props` has a `action.timings.builder.parts` which is an array of all properties found in that line.
Comment 11•9 years ago
|
||
Please let me know when you fix the `build.platform` bug.
Cool. I can now separate each "set props" step into their individuality:
http://activedata.allizom.org/tools/query.html#query_id=q1bdWjwp
The one above shows the averages.
If I look at this log [1] there are 14 steps but the report above shows 13 steps.
* The following two steps should show as the same one (Started set props: build_url blobber_files)
** See the averages are the same on the report
** Could you please fix it?
* Unfortunately we're conflating 'bash -C' and 'rm -rf ...' into 1 step
Wrt to the conflating of the 3rd argument and forward.
We can only give you differentiation if we change the Buildbot behaviour to include the 3rd argument as part of the output.
I think we can add a description to each step on ScriptFactory and that will give us more granularity.
Filed as bug 1213262.
(In reply to Kyle Lahnakoski [:ekyle] from comment #9)
> (In reply to Armen Zambrano Gasparnian [:armenzg] from comment #6)
> > Are the following questions answered by using another from?
> > How can I see the actual Mozharness steps? (e.g. clobber)
> The `action.timings.harness.step` will have the mozharness steps.
>
For some reason this has mozharness actions ("Android armv7 API 9 try opt test xpcshell-2":
http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3
but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1):
http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6
> > How can I see the total duration of a job?
> Of a single job? Then you must isolate a buildid (or builduid), wherever
> that is in the schema (which I have not documented yet) and `sum` the
> `builder.duration`. For the average of a job, hmmmm, that may be an
> oversight: I was planning to get the job numbers from the buildbot json
> logs and stick them in `start_time`, `end_time` and `duration` of the `jobs`
> table, but that is not done yet.
This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on average takes 7 minutes
http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj
This tells me that the average of Windows XP 32-bit try opt test mochitest-1 takes 1 minute (that's sounds wrong)
http://activedata.allizom.org/tools/query.html#query_id=msicR0VK
Can I assume I need to wait for trying this again?
[1]
wget http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1444383827/mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt.gz
gzip -d mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt.gz
grep "== Started" mozilla-central_xp-ix_test-mochitest-1-bm109-tests1-windows-build246.txt | sort
[2] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l4705
Comment 12•9 years ago
|
||
Please let me know if you would like me to pursue bug 1213262.
Assignee | ||
Comment 13•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #11)
> Please let me know when you fix the `build.platform` bug.
Fixed
> Cool. I can now separate each "set props" step into their individuality:
> http://activedata.allizom.org/tools/query.html#query_id=q1bdWjwp
>
> If I look at this log [1] there are 14 steps but the report above shows 13
> steps.
This is because some steps are merged, like "rm -rf ...". If you stick to just one run, and show the `jobs.action.timings.order`, you see the 14 steps. The count can also be off because you may be picking up more than one run; combining the steps of multiple runs.
> http://activedata.allizom.org/tools/query.html#query_id=Wv6VEPlZ
> * The following two steps should show as the same one (Started set props:
> build_url blobber_files)
> ** See the averages are the same on the report
> ** Could you please fix it?
This is only one step: `parts` is multi-valued, and can cause double counting if you `groupby` it. It is best you do not use `groupby` on this column.
> For some reason this has mozharness actions ("Android armv7 API 9 try opt
> test xpcshell-2":
> http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3
>
> but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1):
> http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6
Hmm, maybe this is a bug. Do you have an example of win32 with a harness step?
> > > How can I see the total duration of a job?
> > Of a single job? Then you must isolate a buildid (or builduid), wherever
> > that is in the schema (which I have not documented yet) and `sum` the
> > `builder.duration`. For the average of a job, hmmmm, that may be an
> > oversight: I was planning to get the job numbers from the buildbot json
> > logs and stick them in `start_time`, `end_time` and `duration` of the `jobs`
> > table, but that is not done yet.
>
> This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on
> average takes 7 minutes
> http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj
No, that is the average length of a single buildbot step, which includes the very short 'rm -f ...'.
If you want the time of the whole job select `{"from": "jobs"}` will have job-level properties and measures. I will update the ETL to include `action.duration` so the follow query will work
> http://activedata.allizom.org/tools/query.html#query_id=dozL4WBX
WARNING: What I put in `action.duration` will be shorter than the real time; only the buildbot json logs have the real time.
Assignee | ||
Comment 14•9 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #13)
> (In reply to Armen Zambrano Gasparnian [:armenzg] from comment #11)
> > Please let me know when you fix the `build.platform` bug.
> Fixed
Well, maybe not. Still looking...
Comment 15•9 years ago
|
||
> > For some reason this has mozharness actions ("Android armv7 API 9 try opt
> > test xpcshell-2":
> > http://activedata.allizom.org/tools/query.html#query_id=c34ch3+3
> >
> > but this does not (Windows XP 32-bit mozilla-central opt test mochitest-1):
> > http://activedata.allizom.org/tools/query.html#query_id=v7+MSts6
>
> Hmm, maybe this is a bug. Do you have an example of win32 with a harness
> step?
All of these jobs have Mozharness steps:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Windows
Trying a different query with a different builder gives 140 empty rows:
http://activedata.allizom.org/tools/query.html#query_id=op+iWGGo
180 empty rows for the Windows *build* jobs:
http://activedata.allizom.org/tools/query.html#query_id=+BVKoK6t
It works for Mac:
http://activedata.allizom.org/tools/query.html#query_id=OTsU+kIC
> > > > How can I see the total duration of a job?
> > > Of a single job? Then you must isolate a buildid (or builduid), wherever
> > > that is in the schema (which I have not documented yet) and `sum` the
> > > `builder.duration`. For the average of a job, hmmmm, that may be an
> > > oversight: I was planning to get the job numbers from the buildbot json
> > > logs and stick them in `start_time`, `end_time` and `duration` of the `jobs`
> > > table, but that is not done yet.
> >
> > This seems to tell me that "Android armv7 API 9 try opt test xpcshell-2" on
> > average takes 7 minutes
> > http://activedata.allizom.org/tools/query.html#query_id=PkxyFAgj
>
> No, that is the average length of a single buildbot step, which includes the
> very short 'rm -f ...'.
>
> If you want the time of the whole job select `{"from": "jobs"}` will have
> job-level properties and measures. I will update the ETL to include
> `action.duration` so the follow query will work
>
> > http://activedata.allizom.org/tools/query.html#query_id=dozL4WBX
>
> WARNING: What I put in `action.duration` will be shorter than the real time;
> only the buildbot json logs have the real time.
Why is that?
Assignee | ||
Comment 16•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #15)
> All of these jobs have Mozharness steps:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-
> searchStr=Windows
Thank you for the links. I will use them for debugging, and confirming the results after a retrigger.
> > WARNING: What I put in `action.duration` will be shorter than the real time;
> > only the buildbot json logs have the real time.
>
> Why is that?
Something that reports time, like the buildbot steps, can not be expected to include the entire time. Immediately after the `======== Finished` line is written, the step is still running; probably not for long, but we can not be sure. We need a 3rd system, the buildbot json logs, that subsumes those steps to be certain we have an upper bound on the total time of the job. That said, the `action.duration` can be used to identify inefficiencies; but we will remain uncertain the system is efficient.
Assignee | ||
Comment 17•9 years ago
|
||
The job timing data was reprocessed over the weekend.
The windows jobs are now showing their harness steps. Here is an example using the normalized Buildbot properties:
http://activedata.allizom.org/tools/query.html#query_id=W_dMu5o+
Here is an example of average job duration:
http://activedata.allizom.org/tools/query.html#query_id=vENRy9c1
Added `builder.elapsedTime` so the we can report the buildbot master lag. Here it is by step for win32:
http://activedata.allizom.org/tools/query.html#query_id=AgF8ZYLB
Assignee | ||
Comment 18•9 years ago
|
||
Reviewing `builder.elapseTime`, the results are suspicious. I will look into them deeper, but do not expect any conclusions until tomorrow, at least.
Assignee | ||
Comment 19•9 years ago
|
||
There was a question about build times: They are seriously incomplete. Here is build times by machine, over time. Notice the zeros.
http://activedata.allizom.org/tools/query.html#query_id=ucHUSg_h
Assignee | ||
Comment 20•9 years ago
|
||
No one machine has the resources to compare the buildbot json logs [1] to the `jobs` in ActiveData. Even when differences are found (which are considerable, when it comes to the plain builds), we must still parse the buildbot properties to fill in the holes.
For this reason I wrote the BuildbotTranslator [2]. It was based on the PulseTranslator, but with the added advantages of
* handling the many cases PulseTranslator ignores the pulse message
* keeping all build properties, even if not known
* sending alerts when buildnames can not be parsed so the code stays up to date
This translator will be used by another ETL pipeline to fill an alternate `jobs` table in ActiveData: Which, in turn, will be used to perform a comparison with what's coming from Pulse.
[1] http://builddata.pub.build.mozilla.org/builddata/buildjson/
[2] https://github.com/klahnakoski/TestLog-ETL/blob/0c8c6101c20d68880918cda8bf88af44c31a78d1/testlog_etl/imports/buildbot.py
Comment 21•9 years ago
|
||
Hi Kyle,
There are Mozharness actions in here which are not (e.g. DesktopSingleLocale) - group by win32 platform:
http://activedata.allizom.org/tools/query.html#query_id=vJYnZg7H
https://dxr.mozilla.org/mozilla-central/search?q=DesktopSingleLocale&redirect=false&case=true
Assignee | ||
Comment 22•9 years ago
|
||
Armen,
I see those in the buildbot json logs, but I was not aware those steps were getting through the PulseTranslator: Many of the pulse messages concerning "builds" are ignored by the pulse translator, some do get through, and that is what you are seeing here. I put "build" in quotes because buildbot is doing a lot more than just building and testing, or so it seems from the data [1]
I am solving this problem by adding a few properties (not in ActiveData yet) that you can filter by
* action.build - true if this is a "build" action
* action.test - true if this is a test
* action.talos - true if this includes a talos test
[1] https://github.com/klahnakoski/TestLog-ETL/blob/03f77d0d35a1dec4c73f845f246ded5c05248a32/testlog_etl/imports/buildbot.py#L323
Comment 23•9 years ago
|
||
Thanks for that! let me know when it is available.
Why is this query not sorting?
http://activedata.allizom.org/tools/query.html#query_id=7+cUFtPn
ekyle, I think I'm going to be writing some client that will be able to use mozci to fetch data from active data and display it somehow.
For now I will start tinkering in ipython notebook and see where I get.
I found that trying to build a single query to answer all that I want is rather hard but instead I should focus on building something with many small focused queries.
Comment 24•9 years ago
|
||
How can I post to active data and get a json reply?
Are there available APIs I can hit?
Assignee | ||
Comment 25•9 years ago
|
||
Yes, ActiveData is best interacted using a few queries, rather than one: The query language is not powerful enough to do the more interesting (useful) analysis.
Yes, if you open the debugger (f12) you will see the network requests/responses, which are all in JSON. You will also see that the endpoint is http://activedata.allizom.org/query . There are no APIs, but hopefully you do not need one; the `requests` module does a lot: `requests.get(json={"from":"jobs"}).json()`
Assignee | ||
Comment 26•9 years ago
|
||
Here is an example showing "typical" steps for win32, and their timings. An inspection of the page source will show what's involved: A two step process; collecting the common steps, then getting their duration. My next step is to parameterize this for all platforms and test suites, and also show a timing timeline from the past few weeks.
Caveats: The code that looks at the "next line", and provides a better description of the buildbot steps has not been deployed yet: When this happens, the steps will have finer detail. The code that consumes the buildbot json logs has not been deployed yet either: When done, the overall timing will be more accurate.
[1] http://activedata.allizom.org/tools/timing.html
Comment 27•9 years ago
|
||
I get:
populous[0] is undefined
File timing.html, line 60, in __main
File thread.js?1445883578654, line 242, in Thread_prototype_resume
File thread.js?1445883578654, line 220, in Thread_prototype_resume/retval
File Rest.js?1445883578654, line 93, in Rest.send/request.onreadystatechange
Assignee | ||
Comment 28•9 years ago
|
||
This error will occur when there has been no data over the past two days. This makes sense because I was pushing "improvements" to the code this weekend. I will update this bug when I have he problem fixed.
Comment 29•9 years ago
|
||
Hi Kyle,
Is there a way to determine how much buildbot lag is introduced for Windows tests?
The current backlog of jobs are around 4,000 to 8,000 jobs pending for each pool [1]
On a given day, we can run around 28,000 jobs in total for all three Windows pools [2].
Is there a way that we could know what is the percentage wasted every day due to the Buildbot lags?
I assume we only include builders with "Win" and "test" or "talos" in their name.
[1] http://builddata.pub.build.mozilla.org/reports/pending/pending.html
[2] https://groups.google.com/forum/#!topic/mozilla.dev.tree-alerts/7lKN1nm3_2M
Comment 30•9 years ago
|
||
Adding catlee as he is very interested on this topic (see comment 29).
Assignee | ||
Comment 31•9 years ago
|
||
Armen,
The windows machines appear to have parsing problems (probably due to \r): I will be looking into this parsing problem next.
Until then, a parameterized version of the BuildbotTimings dashboard is up [1]. I put it into its own repo [2] to emphasize it is not ActiveData, just a consumer. There are some unused time-range variables, and they are meant to control the timing chart, which has not been added yet.
[1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/#sampleInterval=week&sampleMax=2015-10-31&sampleMin=2015-06-21&pool=b-2008-ix
[2] https://github.com/mozilla/MoBuildbotTimings
Assignee | ||
Comment 32•9 years ago
|
||
...and
It only looks at the past 2 days (as per the title), and still requires plenty of CSS love.
Comment 33•9 years ago
|
||
This is pretty incredible!
Looking forward to use the date picker or seeing the Buildbot lags (I see the code got r+'ed)
Comment 34•9 years ago
|
||
Hi Kyle, is bug 1213262 a real blocker for this? I think I added the dependency early on and I don't know if it is in anyway blocking you.
Assignee | ||
Comment 35•9 years ago
|
||
The issue with buildbot timings [1] was lack of data: I am still manually triggering the buildbot JSON ingestion, which is always a bit late. I set the averages to weekly averages until I automate the whole pipeline.
[1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/#sampleInterval=week&sampleMax=2015-11-07&sampleMin=2015-06-28&product=firefox
Assignee | ||
Comment 36•9 years ago
|
||
The Buildbot test timings page [1] has been updated with charts over time. Right now it only looks for the "run-tests" step: There are probably other important steps (like in builds) that can be added, so this page can be used on more than just tests.
[1] http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/
Assignee | ||
Comment 37•9 years ago
|
||
This is done. Reopen if you disagree.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•