Closed Bug 387174 (talos-generalize) Opened 17 years ago Closed 16 years ago

Create a generalized framework for running additional talos tests

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zach, Assigned: zach)

References

Details

In a recent conversation with schrep, the need for a more generalized framework for running new types of tests in Talos came up. If we're going to add, among others, tdhtml, a javascript benchmark suite, and tab switching performance tests, it doesn't make much sense to essentially duplicate test execution and result parsing code for each of these new suites. As such, I propose a generalized 'generic_test.py' that these new tests can hook into. This component will take care of launching the browser against a specified test file and parsing the test results from a standardized result format like that currently used for tp. This component will also abstract away platform-specific differences, so that implementing a new test can generally be done in one file, rather than having _mac and _linux variants. While some more complex tests will still need to control things manually (e.g. tp), we should be able to generalize out the most common case: launch the browser against this directory of test files, collect the results, and send them to the graph server.
Here's a proposal for a generalized test result format that talos can use and report to the graph server. I've based this on the existing format used by the pageload tester, and simpler tests could choose not to report any stats that are inapplicable. Comments are appreciated. General test result format: __start_report:testSuiteName __summarystats:number test:testname:median,mean,min,max,stddev test:testname:median,mean,min,max,stddev ... __end_report Failures would be denoted: __FAILfailureInformationHere__FAIL Any unavailable stats get 'NaN' in place of a number. Summary stats is the top-level stat that gets reported to Tinderbox (usually an average or a sum of the various subtests). This is basically a superset of what the tp tests are reporting now; currently tp just reports averages for each test. Presumably, we'd want to have a javascript library that tests can use to do the calculations and generate the report.
The simple json format that i've been using is: [ { name: "foo.bar.com", value: 123.4, stddev: 2.5 }, { name: "...", value: ..., stddev: ... }, ... ] with json it's easy to add new members to each element without breaking existing parsers, so it's possible to add min/max, as well as even adding a more complex data member, e.g.: { name: "foo.bar.com", value: 123.4, stddev: 2.5, data: { foo: 123, bar: 123, ... } } Then we can make generic tools for tests, and any tools that want to examine detailed data for each test can do so by just knowing more about the json output for that test.
Any reason that you've picked out stddev to be part of the standard format? Would there be any value in also have max/min/median/etc?
(In reply to comment #2) > The simple json format that i've been using is: > > [ > { name: "foo.bar.com", value: 123.4, stddev: 2.5 }, > { name: "...", value: ..., stddev: ... }, > ... > ] I should note (don't know if it matters here or not, it might sometime) that according to the JSON RFC, the names in an object must actually be strings, not sequences of character literals or numbers: [ { "name": "foo.bar.com", "value": 123.4, "stddev": 2.5 } ] Again, I don't know whether you care in this instance, but even if you don't, you might in others. I don't know whether or not JSON parsers usually implement this extension of allowing non-string names.
(In reply to comment #3) > Any reason that you've picked out stddev to be part of the standard format? > Would there be any value in also have max/min/median/etc? There might be, but I'd make all of them optional. (Also, they really only have meaning for each specific test, but have no value for the composite score -- e.g. a "max" for Tp doesn't make sense.) However, maybe it would be better to make stdd,max,min,median,etc. all optional, and have the tests just output (with quoted names!): [ { "name": "foo", "value": 123.4, "values": [ 120.0, 125.0, 122.2, 129.2 ] } ] With the only required fields being "name" and "value". "values" should provide the raw values for a multi-run sample for that test, and then each app can just calculate min/max/median/stddev/whatever else they want. Doing those calculations is easy enough that I don't see any value in putting the values in this output.
So alice and I talked about this some, and we came up with a suggested format; the required values look like: { type: 'testtype', data: [ { name: 'somename', values [ 100, 200, 300 ] }, ... ] } For example, the Tp test could generate something like: { type: 'tp', data: [ { name: "foo", values: [ 123, 123, 123 ], value: 123, avg: 123, max: 200, stddev: 2.5 }, { name: "foo", values: [ 123, 123, 123 ], value: 123, avg: 123, max: 200, stddev: 2.5 }, … ] } Where the "value", "avg", "max", etc. things being extra data that won't be used by the graph server (as it should recompute that data itself), but are there for someone who wants to use the JSON data directly. Some graph types, such as the mem/cpu usage graph, would have an added required 'interval' member to specify the amount of time between each sample: { type: 'resourceUsage', interval: 100, data: [ { name: "memory", values: [ 123, 123, 123 ] }, { name: "cpu", values: [123, 123, 123 ] }, ] } and a simple test like Ts would look like: { type: 'ts', interval: 100, data: [ { name: "startup", values: [ 100, 200, 300 ] } ] } I've been thinking about a better schema for the graph stuff (since it wasn't really ever designed to do anything with per-run data other than a single point), and came up with something like this: build_info: build_id, machine, branch, description test_info: test_id, type, name, description run_info: run_id, build_id, test_id, time, run_value run_values: run_id, name, index, value run_annotations: run_id, anno_type, annotation Each run corresponds to a specific test on a particular build; each test can produce one or more values. The raw data from the test could be stored as an annotation with anno_type 'json'; user-entered annotations can be stored with a different anno_type. Each test could be frozen; if a test changes, a new test_id is allocated for it (e.g. if the pageset is changed). The run_values table would store all the data (the resulting named indexed arrays) from each test run, and the annotations could be used to store precomputed data (such as the final "Tp" number) so that it can be queried quickly. For example, querying the Tp graph for a particular machine/branch would mean getting all annotations of anno_type 'tp_avg' from every run for every build_id where the machine and branch match the requested data, and where the run test_id is the 'tp' test id. It's a bit more work on the DB side, but it can store all the data for each test without having to ship blobs back and forth. Asking for the data for a particular run would give you back a json blob that includes all the data, plus all the annotations.
Is there an accepted way of incorporating error codes into this scheme? Currently, if there is a timeout on any given page in the page set tp fails and informs the talos framework of what page caused the timeout. I wouldn't want to lose that functionality.
We've incorporated new tests into talos in a modularized way. Adding new tests is now mostly trivial as long as they use the basic tinderboxformat as a means of passing test results. Other issues can be filed into new bugs.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Mass move of Core:Testing bugs to mozilla.org:Release Engineering:Talos. Filter on RelEngTalosMassMove to ignore.
Component: Testing → Release Engineering: Talos
Product: Core → mozilla.org
QA Contact: testing → release
Version: unspecified → other
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.