Closed Bug 1535031 Opened 9 months ago Closed 5 months ago

3.95 - 3.97% raptor-motionmark-animometer-firefox (linux64, linux64-pgo-qr) regression on push d8098dde10d83f0f9b37bda5ffd37a42317d15b3

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- wontfix
firefox66 --- unaffected
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- fixed

People

(Reporter: igoldan, Assigned: iain)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression)

Attachments

(2 files)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=8620e56a5f3858391b42f51684ff735699895049&tochange=d8098dde10d83f0f9b37bda5ffd37a42317d15b3

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

4% raptor-motionmark-animometer-firefox linux64-pgo-qr opt 49.79 -> 47.81
4% raptor-motionmark-animometer-firefox linux64 pgo 42.91 -> 41.21

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=19871

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Component: General → JavaScript Engine: JIT
Product: Testing → Core
Flags: needinfo?(whawkins)

This is noted. I am going to look into this as soon as possible. I will update this bug as needed!

Will

Flags: needinfo?(whawkins)
Flags: needinfo?(igoldan)

(In reply to Will Hawkins from comment #2)

I am sorry to have to ask this question, but these results seem inconsistent:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=autoland&originalRevision=583048926c121b876326a536f44e14034035372c&newProject=autoland&newRevision=95274fd66e5bea637cc26d7047aef7f96d5e19a7&framework=10

shows a performance decrease, but when I click through to
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=583048926c121b876326a536f44e14034035372c&newProject=autoland&newRevision=95274fd66e5bea637cc26d7047aef7f96d5e19a7&originalSignature=1801528&newSignature=1801528&framework=10

I only see positive improvements. I'm not sure where to look to debug, in this case.

:gw could there be a problem with the integration of Motionmark Animometer? From the logs, I see all subsets have lowerIsBetter: true, while the overall average has lowerIsBetter: false.

I apologize if I am reading these results incorrectly.

No need to worry about that. Feedback like this is always welcomed.

Flags: needinfo?(igoldan) → needinfo?(gwatson)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #3)

(In reply to Will Hawkins from comment #2)

I am sorry to have to ask this question, but these results seem inconsistent:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=autoland&originalRevision=583048926c121b876326a536f44e14034035372c&newProject=autoland&newRevision=95274fd66e5bea637cc26d7047aef7f96d5e19a7&framework=10

shows a performance decrease, but when I click through to
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=583048926c121b876326a536f44e14034035372c&newProject=autoland&newRevision=95274fd66e5bea637cc26d7047aef7f96d5e19a7&originalSignature=1801528&newSignature=1801528&framework=10

I only see positive improvements. I'm not sure where to look to debug, in this case.

:gw could there be a problem with the integration of Motionmark Animometer? From the logs, I see all subsets have lowerIsBetter: true, while the overall average has lowerIsBetter: false.

I apologize if I am reading these results incorrectly.

No need to worry about that. Feedback like this is always welcomed.

Thanks! I am looking forward to getting to the bottom of this!

I discussed this with Will on IRC.

Although I'm not 100% confident, I believe that the overall number is being reported correctly, but that the sub-test results are being incorrectly reported (lower vs. higher).

Flags: needinfo?(gwatson)

(In reply to Glenn Watson [:gw] from comment #5)

I discussed this with Will on IRC.

Although I'm not 100% confident, I believe that the overall number is being reported correctly, but that the sub-test results are being incorrectly reported (lower vs. higher).

Must I file a bug for that or have you already done that?

Flags: needinfo?(gwatson)

I haven't opened a bug for this.

Flags: needinfo?(gwatson)
See Also: → 1538201

Will, have you managed to figure out what caused the regression? We've yet to fix this test issue, but I think it can be circumvented.

Flags: needinfo?(whawkins)

I know what is causing the issue but I've yet to deliver a solution. Sorry! I am investigating, though, and hope to have it resolved as quickly as possible.

Thanks!
Will

Flags: needinfo?(whawkins)
Priority: -- → P1

Hi Will, is there any update on this? (also is this something you are working on - it's still unassigned :)

Flags: needinfo?(whawkins)

Per the regression bug it sounds like this affects 67.

At this point in the cycle, this is not actionable for 67, so wontfix.

Assignee: nobody → whawkins
Flags: needinfo?(whawkins)

Too late to fix in 68 but we could still take a patch for 69/70.

Flags: needinfo?(whawkins)
Attached file cacheir20141.json

It looks like this has been stalled for a while, so I took a quick look. The specific subtest of motionmark-animometer that regressed was Images (graph here). I ran the subtest using this handy link, dumped the cache IR logs (which I've attached), and looked for anything interesting using Tom's CacheIR analyzer.

What I found was a lot of SetElem failures on lines 138-141 of image-data.js, where we try to access a negative index of a Uint8ClampedArray. The problem appears to be that the motionmark code is bad, and "getRandomNeighboringPixelIndex" occasionally returns a negative index. Here's a cutdown version:

pixelStride = 4;
rowStride = 200;
weightNegativeThreshold = 0.04;
weightPositiveThreshold = 0.96;

function getRandomNeighboringPixelIndex(pixelIdx, pixelArrayLength) {
    var xOffset = Math.floor((Math.random() - weightNegativeThreshold) / (weightPositiveThreshold - weightNegativeThreshold));
    var yOffset = Math.floor((Math.random() - weightNegativeThreshold) / (weightPositiveThreshold - weightNegativeThreshold));
    return (pixelIdx + pixelStride * xOffset + rowStride * yOffset) % pixelArrayLength;
}

outer: while (true) {
    for (var j = 0; j < 2500; j+=pixelStride) {
	var index = getRandomNeighboringPixelIndex(j, 2500)
	if (index < 0) {
	    print("Generated negative index: j=" + j + " index=" + index);
	    break outer;
	}
    }
}

In addition to pessimizing on non-integer indices, the original patch also added non-negative pessimization to typed arrays. (Previously, we only pessimized in GetElemTryDense.)

It looks like these particular negative indices show up rarely enough that we're better off doing the optimization and accepting occasional bailouts, whereas in the testcase that motivated the original patch, it was the other way around. Unfortunately, this all boils down to guessing about what sorts of bad code we are more likely to see on the web. I'm not sure what the best answer is here: ideally, we'd like to know what proportion of indices were bad, and only pessimize if it exceeded some threshold, but we'd need to use more bits to track that. In this case, we are apparently getting almost 50 failures, which seems like a lot.

Maybe in IonBuilder we can compare the number of failures to the total hit-count, and only pessimize if the ratio is too high?

Thanks for investigating this, Iaian. Your analysis seems entirely consistent with the way that a) I remember writing the patch and b) the way that I remember the regression to look. I am happy to update the patch in whatever way you think is the best. Let me know if you'd like to take the ratio approach or not and I can start looking in to an implementation.

Thanks again for investigating!
Will

Flags: needinfo?(whawkins)

I can think of two ways to address this problem.

  1. Tweak the existing approach to do a decent job on known benchmarks. Specifically, get rid of the negative bailout for typed arrays, because it is causing this regression, but leave in the non-integer bailout, because it improved the code in bug 1526315. The advantage of this approach is that it takes very little developer time.

  2. Completely overhaul this code. This has a number of steps:
    a) Combine "hasSeenNegativeIndexGetElement" and "hasSeenNonIntegerIndex" into a single query, along the lines of "hasTooManyInvalidIndices".
    b) Rewrite the query to check not just the flags, but also the ratio between the entry count of the fallback stub (see here) and the sum of the entry counts of the other stubs in the chain (which would need code similar to this). If the flags are set, but the ratio of failures to successes is low enough, then "hasTooManyInvalidIndices" should return false.
    c) Calibrate what "low enough" means by testing. I think you should be able to get a good estimate by testing to see how big the array in your microbenchmark for bug 1526315 needs to be before your patch doesn't improve performance:

function f() {
    var ta = new Int32Array(100000);
    var xs = [0, 1, 2, 59853.5]; // How many integral entries do we need in this array before the cost of bailing out on the last entry is worth it?
    var r = 0;
    for (var i = 0; i < 100000; ++i) {
        r += ta[xs[i & 3]];
    }
    return r;
}

The advantage of the second option is that it hopefully gives half-decent performance on a broader range of bad code. The disadvantages are that it adds complexity and takes time to implement, and I'm not sure that this weird corner case is the best use of our developer time.

Do we have any evidence of this being a real problem on the web?

Note: Option 1 is a trivial patch, so I ran a try build. As expected, it improves motionmark-animometer by 4% (and the Images subtest specifically by almost 50%).

Since we don't really know at this point how often this appears in the real world, I'd say that we go with (1) for the time being and revisit (2) if we find that it's a problem? Since you already have the patch for that, would you be willing to submit?

Flags: needinfo?(iireland)

I spent a while thinking through the right thing to do for this patch, and eventually circled back to the first choice. Documenting my reasoning for posterity:

This patch is getting rid of the "no fast path if we've seen a negative index" behaviour for TypedArray, but not for regular dense elements. In either case, it's pretty easy to accidentally access a negative index (as happened in this case). However, because negative indices always return |undefined| for TypedArray, it's very unlikely that anybody would be accessing negative indices on purpose. On regular plain objects, negative indices on plain objects can be accessed like any other property, so seeing a negative index is a stronger signal that we can expect frequent negative indices in the future.

We are keeping the "no fast path if we've seen a non-integral index" behaviour for both, because it's harder to have sporadic non-integral indices. If you see one non-integral index, it's pretty likely that another one will follow shortly.

In the long run, we would like a more robust solution. I lean towards "generate the slow path if we have had to invalidate the script because of too many bailouts here". But that's more work than this particular bug justifies.

Flags: needinfo?(iireland)
Assignee: whawkins → iireland

Thanks for doing this, Iain. I'll work with you on doing option 2. I think it will help me learn some more about the code. If that's ok with you, of course.

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9340f815a2d1
Don't force slow path for negative indices in typed arrays r=tcampbell
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

== Change summary for alert #21749 (as of Fri, 05 Jul 2019 18:49:40 GMT) ==

Improvements:

5% raptor-motionmark-animometer-firefox linux64-shippable-qr opt 47.79 -> 50.15
4% raptor-motionmark-animometer-firefox linux64-shippable opt 41.00 -> 42.69

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21749

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.