Closed Bug 1423267 Opened 8 years ago Closed 8 years ago

Add motionmark benchmark results for nightly/inbound/autoland/try

Categories

(Testing :: Talos, enhancement)

Version 3
enhancement
Not set
normal

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: sphilp, Assigned: jmaher)

References

Details

(Whiteboard: [PI:February])

Attachments

(1 file)

As part of the WebRender project, they would like to have results from the motion mark benchmark to check progress and regressions as WebRender moves along in development. All major platforms if possible, the necessary prefs to enable are: gfx.webrender.enabled gfx.webrender.blob-images image.mem.shared and on (Linux only) layers.acceleration.force-enabled
we will need a description here: https://wiki.mozilla.org/Buildbot/Talos/Tests that description needs: * development owner to contact for questions <- this is not something that can be determined from the motionmark source code * what we are measuring * a summary of the calculation and example of numbers Also does this duplicate any work we have already done or is in work. For example there are some new tests for OMTP coming online in bug 1419306. We also have tcanvasmark, and glterrain- I am not sure if any of those tests could be retired :) Does this test require -qr builds or any other special builds? Also are there any issues with motionmark being added as third_party code to the mozilla-central tree?
Whiteboard: [PI:December]
We will probably want to do the equivalent of http://browserbench.org/MotionMark/developer.html with certain preferences modified from the default. We get more consistent results that way. Glenn can help with some instructions as to what the most optimal setup may be.
Flags: needinfo?(gwatson)
The settings we'd ideally want for an initial implementation would be (based on the options available at the http://browserbench.org/MotionMark/developer.html page mentioned above): * Run all tests in the Animometer and HTML Suite groups. * Test length: 15 seconds. * Complexity: "Keep at a fixed complexity". * Other settings as default. We'll need to work out good values to set the complexity value for each of those tests. This will depend on the hardware in question that the tests are being run on.
Flags: needinfo?(gwatson)
right now our hardware is nvidia graphics: https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation this will change in Q1 for linux/windows to new hardware which has Intel chipset. I have not seen us adjust parameters of talos tests based on hardware, only because we are comparing against the previous revision, luckily anyone can edit this as it will be living in tree.
Blocks: 1425845
Whiteboard: [PI:December] → [PI:January]
Assignee: nobody → jmaher
Depends on: 1428435
I see that changing complexity from "ramp" (default) -> "fixed" (as suggested in the bug), I get no results (a lot of 0.0 or NaN). :gw, how important is that setting?
Flags: needinfo?(gwatson)
When I use 'ramp' on my my local machine, I get very unstable results - the reported value is often different by an order of magnitude between runs, which is of course not ideal for benchmarking. Ramp modifies the complexity of the benchmark dynamically, depending on how it thinks the browser is performing. I have an idea why you might be seeing invalid results in Fixed mode, although it's just a guess. When you use Fixed mode in a web browser, you get a series of text input boxes, one for each test. This contains the complexity to run the test at, and the number is stored between runs. Perhaps in the CI context we're running it, those complexity values are not initialized and that may be why you're seeing invalid results? If that's the case, I suspect it's probably possible to specify a complexity value to use for the test via query parameters in the URL.
Flags: needinfo?(gwatson)
I am stuck trying to get it to run outside of CI, this is just loading the file locally and running in nightly. using ramp it all works, using fixed it fails. :gw, can you own driving this to make sure that we can run this, I run it via: file:///C:/Users/elvis/mozilla-inbound/third_party/webkit/PerformanceTests/MotionMark/developer.html Without proper specs for how to run this, I cannot move forward- so this is not a priority for me until I can run it locally first. As a note, I get the same results in both my attempt at CI and the method above.
Flags: needinfo?(gwatson)
What complexity value are you setting in the fixed test mode? For example, on the page you linked to: * Click on "Keep at a fixed complexity" * Click on the "Animometer" option to open that test suite. * Click on the tickbox for "Multiply" to enable running that test. * What is the currently set complexity value for that test (there will be a number entry field next to each of those tests). * Do you get valid results if you set that complexity value to ~500 (anywhere from 100 - 5000 might be reasonable for that test, depending on test hardware).
Flags: needinfo?(gwatson)
the instructions I had were: * Run all tests in the Animometer and HTML Suite groups. * Test length: 15 seconds. * Complexity: "Keep at a fixed complexity". * Other settings as default. I clicked the checkbox to run all the tests in animometer and html, then set complexity at 'fixed' (there is one option). If there are other options, please specify them (for example Multiple is default to '44')- going back and forth for each subtest seems like a lot of randomization, can you give the specific requirements needed to run the benchmark to get value for your use case and I can do that?
Sorry, I should have been a bit clearer above that we need to tweak the complexity values for each test. I can't give you exact values right now, because the "correct" value to use depends on the hardware being used as the benchmark runner. We should only need to do this once (well, each time we change the underlying hardware that the benchmark will be running on). As a rough guide, we want to choose complexity numbers for each test such that the test runs at approximately 30 FPS. The reason for this is that we can't typically measure above 60 FPS (due to the way vertical sync works). If we tune each test on the benchmark hardware to run at ~30FPS, we should be able to clearly see any major regressions or improvements in each benchmark. Does that help? Feel free to ping me on IRC (gw) or we can set up a video call to discuss further, if that's easier.
got it- we are switching hardware for our CI machines- ideally next week for linux and a few weeks later for windows- we do have different hardware for osx though- I am not sure how to differentiate this. Is there a preferred method for determining this- maybe a debug mode? I really don't know what the benchmark does or how to determine- when I run it locally there is usually a blank white screen.
Is the (new) hardware for each of the platforms reasonably comparable in terms of GPU and CPU? If so, we may be able to find a complexity number that is good enough to share between platforms. The way to determine it is a bit manual. What I do is: * Select one of the tests (e.g. the Multiply test). * Set to fixed complexity, and a random number for the complexity value for that test (e.g. 500). * Run the test. * At the end of the test, the report screen (MotionMark Score) will include a table that lists the average FPS. For example: Test Name | Time Complexity | FPS Multiply | 100.00 ± 0.00% | 20.25 ± 18.19% In this test case, I've set the complexity to 100, and the result was an average FPS of 20.25. So, I'd then re-run the test, with a lower complexity, in order to find a complexity that gives me ~30 FPS as a result. It's a concerning that you're seeing a white screen though? I wonder if there is something going wrong with the machine you're running it on? For example, in the Multiply test, you should see a black screen, with a number of rotating alpha-blended border corners. Is that what you see?
our new machines are much different hardware than the existing ones- possibly it is best to wait a bit until that is deployed. As for the blank screen, this is at the end of a test- there is an error in the console when I don't use 'ramp' and I haven't been able to figure it out. Is there a different way to run this?
odd, I have ran the Multiply tests dozens of times locally and it ends in a white screen and there are no console errors :( Possibly this benchmark isn't ready for prime time? I have tried to read a bit more on it to see if there are setup things which I need to do in order to make it work- unfortunately I didn't come up with anything. I do see the rotating items, but they disappear into a white screen after a short while and that is all I see
Depends on: 1429597
Waiting until the new machines are available sounds like a plan. That's really strange - I've been using this test suite on and off for a year or so, and never seen that problem. Do you see the same issue running the Apple hosted version at http://browserbench.org/MotionMark/developer.html ? Could it possibly be an addon related problem or anything like that?
ok, I get results in the same browser session but from http://browserbench.org - I wonder if this is file:// access vs http:// access, let me try
ok, http works vs file
OK, great. I know nothing about the build / test / benchmark process. Is it easy enough to serve those files locally over HTTP during the benchmarking process?
Depends on: 1431408
yeah, we always run via http, so this should work- I will profile on specific machines and get settings in place- probably land this sooner and then trust the numbers when we get new hardware. This will be for: linux64 osx10.10 windows10x64 32 bit firefox on windows10x64 if there are any of the above platforms we shouldn't be running on, now would be a good time to speak up:)
I don't know much about osx versions, but the above sounds good to me. Thanks!
running on loaners we are getting the values for ~30FPS There is an option to run at a fixed FPS, and this gives us the time complexity values we are seeking- the question I have is why don't we run normally at fixedFPS instead of hardcoding time complexity values?
Flags: needinfo?(gwatson)
for the new moonshot machines I have: animometer: multiply: 391 canvas arcs: 1287 leaves: 550 paths: 4070 canvas lines: 4692 Focus: 44 Images: 293 Design: 60 Suits: 210 html suite: css bouncing circles: 322 css bouncing clipped rects: 520 css bouncing gradient circles: 402 css bouncing blend circles: 171 css bouncing filter circles: 189 css bouncing SVG images: 329 css bouncing tagged images: 255 leaves 2.0: 262 focus 2.0: 15 dom particles, svg masks: 390 composited transforms: 400 this is from a linux OS, I don't have a windows OS, but I understand this is hardware specific, not necessarily OS specific.
On the new OSX machines (loaner t-yosemite-r7-472.test.releng.mdc1.mozilla.com): Default options except: - Test length 15 seconds - Maintain target FPS - Target frame rate 30 FPS Animometer suite: multiply: 193.6 canvas arcs: 575.5 leaves: 271.48 paths: 2024.17 canvas lines: 10932.56 Focus: 32.56 Images: 188.6 Design: 17.72 Suits: 145.85 HTML suite: css bouncing circles: 217.93 css bouncing clipped rects: 75.28 css bouncing gradient circles: 97.32 css bouncing blend circles: 254.56 css bouncing filter circles: 188.98 css bouncing SVG images: 391.87 css bouncing tagged images: 350.81 leaves 2.0: 191.41 focus 2.0: 18.00 dom particles, svg masks: 54.13 composited transforms: 74.88
I'm not sure which fixed FPS option you're meaning? I only see "Keep at fixed complexity", "Maintain target FPS" and "Ramp". Do you mean the "Maintain target FPS" option? If that's the case - whenever I've used that locally, I get very unstable results (sometimes an order of magnitude difference, on the same hardware / build). I'm not sure why that is - I think Gecko does something that's tripping up the test suite code that tries to maintain a frame rate. Or are you referring to a different option that I missed?
Flags: needinfo?(gwatson)
this is the "maintain target FPS" option- I manually ran a bunch of scenarios and found that "maintain target FPS" ended up with similar numbers- while manually running numbers it ended up fluctuating +-2% from the target number often. Lastly this test takes a long time- I prefer to get many cycles for ensuring consistency- but in this case 5 cycles takes 40+ minutes to run. Is there a way to make this faster or do we think fewer cycles will be ok?
Flags: needinfo?(gwatson)
looking at this, I do 5 runs and take the average- over 5 runs most of the results vary widely which typically I would collect more replicates to give me a better sample. Unfortunately with the run time this isn't ideal- Possibly we split this into 2 tests- animometer and htmlsuite ? If I do that we could get more replicates and ideally keep the runtime down. here is a log file from a try [1] run: https://public-artifacts.taskcluster.net/AADMh_05Sa-Omw-MWi6ROg/0/public/logs/live_backing.log in the log file you can see thousands of messages: 17:00:09 INFO - PID 20172 | [GFX1-]: Failed buffer for 0, 0, 80, 80 My understand of the options we are sending to MotionMark is that the test-interval=15 is 15 seconds, so each test will run for 15 seconds- could we reduce that value so we get more replicates? Maybe increase if it is more stable? :gw, I could use your expertise here [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c1929db0f04c48aab0c07c0c864954860bdc7c74&filter-searchStr=speedometer
Changing the test interval to something smaller (perhaps 7?) should be fine, I think. Splitting the tests up into smaller runs is totally fine too - they can even be split down to single test level (e.g. Animometer/Multiply) if that makes it easier to get more replicates. I'm a bit confused by your comments above (or I am misinterpreting them). In comment 25, you've said that you get similar numbers (+-2%) from the target number. But then in comment 26, you'd said you get widely varying numbers? Or are those comments referring to different things? Are you saying that the overall final result varies a lot, but the individual test numbers are relatively stable? If it helps at all, I really don't mind whether we run in Fixed Complexity or Maintain Target FPS mode, as long as we're getting stable numbers. I suspected that will be easier in Fixed Complexity mode, but if you're finding otherwise then Maintain Target FPS mode could be fine.
Flags: needinfo?(gwatson)
I get similar complexity numbers via manual or target FPS. When I run the test 5 times, I get a range of numbers for each test- typically +-a few percent. Overall the noise for some subtests over time is pretty high. I really don't know what the complexity number is or how it relates to things- I just take a set of results and look for consistency. Let me try a value of '7' and also breaking the test up a bit more- good next steps.
ok, test-interval=7 doesn't change the runtime- I think we need to: 1) live with noise/fewer replicates 2) accept the large cpu load 3) run fewer subtests I don't think we can accept the large cpu load (i.e. 90 minutes per config/push)- If there are any subtests we can remove that will help- I think living with fewer replicates will be the end solution.
Whiteboard: [PI:January] → [PI:February]
the last part I am stuck on here is what we want to report, to be honest the data the shows up when finished running the test is confusing to me- :gw, do you have any experience/advice on what number we want to report as the "benchmark score" ?
Flags: needinfo?(gwatson)
Is it possible to report the result number (which I think is the average FPS in the config you're using) for each of the sub-tests? Or does it just need to be one number that we report?
Flags: needinfo?(gwatson)
there are 20 subtests which we are running- that will be a large volume to track. Ideally there is a single score when running the benchmark two suites. We would report the one number as a score and the 20 subtests as well- but the alerts we get automatically will be based on the score. I can take the geometric mean of the 20 subtests, that is what we do for many other tests.
Each of the sub-tests checks for a very different area of gfx performance, so ideally we'd want to know if any one of those values regresses. I'm not sure that taking the mean would pick this up, since (I suspect) it's quite likely we could regress one test quite badly while having minimal effect on the other tests. Hmm, this is tricky. Perhaps we try what you suggested with the single mean value above, and see how it goes? I'm guessing that we'll be able to manually graph each of the 20 subtest results over time, is that right? If that's the case, perhaps for an initial setup we don't need to worry about any automated alerts and I can just monitor those graphs manually, until we get a better feel for how to detect regressions? Apologies if this makes no sense, I don't know much about the mozilla performance infrastructure!
yeah, right now we have each data point reporting 10x per push, so adding 1 data point is reasonable to sheriff, adding 20 data points becomes harder to sheriff. Of course we wouldn't get a large volume of alerts for each of the 20- the devtools team has a custom dashboard for the subtests- we catch about 2/3 of the regressions with the summary score, and the dashboard of the subtests helps them see slight differences over time and find a few other regressions. Let me work on a geometric mean (which I think it does based on the code) and see if I can get a reasonable score produced.
first attempt here- this seems to work well and have little noise. Running on try/central and marking as tier-2 because of the long runtime- we can take hits on manual bisection when regressions do happen- possibly we can consider increasing the frequency on something like autoland or inbound when we determine the extra cpu time we have on our linux and osx pool.
Attachment #8949145 - Flags: review?(rwood)
as a note, this isn't on windows, we would need to add that once windows is on taskcluster.
Comment on attachment 8949145 [details] [diff] [review] add motionmark to try/central Review of attachment 8949145 [details] [diff] [review]: ----------------------------------------------------------------- I think it's a good solution re: using manifest.json for tuning parameters. Just a couple of suggestions and nits but otherwise looks great, r+ ::: testing/talos/talos/test.py @@ +865,5 @@ > + unit = 'score' > + > + > +@register_test() > +class motionmark_htmlsuite(PageloaderTest): Maybe add a 'PageloaderWebkit' test class or something? As these settings for these two additions are the same as speedometer and stylebench (except the manifest of course) ::: testing/talos/talos/tests/motionmark/animometer.manifest.json @@ +1,1 @@ > +{"Animometer": {"Multiply": {"win": 391, "linux": 391, "osx": 193}, In the future will we need to specify other tuning values in here besides complexity? If so, I'd add: {"Animometer": {"Complexity": {"Multiply..." Here so that it gives the option of adding other tuning values in the future also nit: blank space end of line @@ +5,5 @@ > + "CanvasLines": {"win": 4692, "linux": 4692, "osx": 10932}, > + "Focus": {"win": 44, "linux": 44, "osx": 32}, > + "Images": {"win": 293, "linux": 293, "osx": 188}, > + "Design": {"win": 60, "linux": 60, "osx": 17}, > + "Suits": {"win": 210, "linux": 210, "osx": 145} nit: alignment of the above test blocks should be shifted back under 'Multiply' @@ +7,5 @@ > + "Images": {"win": 293, "linux": 293, "osx": 188}, > + "Design": {"win": 60, "linux": 60, "osx": 17}, > + "Suits": {"win": 210, "linux": 210, "osx": 145} > + }, > + "HTMLsuite": {"CSSbouncingcircles": {"win": 322, "linux": 322, "osx": 218}, This suite shouldn't be in this .json ::: testing/talos/talos/tests/motionmark/htmlsuite.manifest.json @@ +1,1 @@ > +{"Animometer": {"Multiply": {"win": 391, "linux": 391, "osx": 193}, This suite shouldn't be here
Attachment #8949145 - Flags: review?(rwood) → review+
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/be0297d57ec6 Add motionmark benchmark to try, mozilla-central. r=rwood
Blocks: 1436818
Blocks: 1436819
Blocks: 1436825
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Blocks: 1445952
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: