Closed Bug 1015576 Opened 6 years ago Closed 6 years ago
23% android 4 tcheck2 regression around May 21st on inbound (Fx32)
This is a bug where the large pattern of noise has shifted. The magic of Datazilla has pointed this out to me. Unfortunately I am not sure where this regression started. As this test is so noisy, we could probably do 10 retriggers on a range of 20 commits to narrow this down. here is a graph server view: http://graphs.mozilla.org/graph.html#tests=[[201,201,29],[201,63,29]]&sel=1400322345088,1400927145088&displayrange=7&datatype=running I have an alert for this on b2ginbound from datazilla which shows this backout as the point of trouble: https://tbpl.mozilla.org/?tree=B2g-Inbound&rev=331b5c31b19c&jobname=Android%204.0%20Panda%20b2g-inbound%20talos%20remote-trobocheck2 Maybe it is +- a few changesets. The datazilla view for b2ginbound is: https://datazilla.mozilla.org/?start=1400306602&stop=1400924341&product=Fennec&repository=B2G-Inbound&os=Android&os_version=4.0.4&test=tcheck2&graph_search=331b5c31b19c,9aa3042c0023&tr_id=5646310&graph=tcheck2&x86=false&x86_64=false&project=talos looking at datazilla for inbound, I see this: https://datazilla.mozilla.org/?start=1400306602&stop=1400924341&product=Fennec&repository=Mozilla-Inbound&os=Android&os_version=4.0.4&test=tcheck2&graph_search=0d8592079466&tr_id=5608302&graph=tcheck2&x86=false&x86_64=false&project=talos the point that stands out to me is: https://hg.mozilla.org/integration/mozilla-inbound/rev/0d8592079466 again, with the noise, probably +- a few changesets would be the place to start.
doing a lot of retriggers: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&startdate=2014-05-21&enddate=2014-05-22&jobname=Android%204.0%20Panda%20mozilla-inbound%20talos%20remote-trobocheck2 it seems that when we cross into a new high for tcheck2 (>7.9) we have this changeset: https://hg.mozilla.org/integration/mozilla-inbound/rev/25763b7fe938 from this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=877072 As the next changeset is a unittest only change and it also shows >7.9 values for tcheck, then it adds to the evidence that this is the root cause. here is more information about the test: https://wiki.mozilla.org/Buildbot/Talos/Tests#Robocop Here is the source code that runs tcheck2: http://dxr.mozilla.org/mozilla-central/source/mobile/android/base/tests/testCheck2.java Basically we load cnn.com index page and scroll on it looking for checkerboarding. Gabor, you are the patch author- do you have any thoughts on this?
Component: Talos → DOM: Core & HTML
Product: Testing → Core
Blake, how fast is Preferences::GetBool("dom.webcomponents.enabled")? Do you think if the UpdateImport is called too often it can cause this regression? If it's considered to be a slow operation I could call it later, only before it would actually create an import (most UpdateImports returns earlier)
Flags: needinfo?(gkrizsanits) → needinfo?(mrbkap)
Joel, is there a way to test if the applied test fixes the regression? If that is not the issue than I don't know what else could be, the rest of the patch is pretty much inactive unless someone turns on webcomponents and starts importing documents which I highly doubt...
the easiest way is to push to try with this trychooser syntax: try: -b o -p android -u none -t remote-trobocheck2 Once it runs, retrigger it 5 or 6 times and verify the reported numbers (as seen on tbpl when clicking on the test and viewing it in the info pane) are <8.0.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #2) > Blake, how fast is Preferences::GetBool("dom.webcomponents.enabled")? Do you > think if the UpdateImport is called too often it can cause this regression? > If it's considered to be a slow operation I could call it later, only before > it would actually create an import (most UpdateImports returns earlier) I'd expect it to be pretty fast -- it's only a hash lookup and there aren't any locks or anything else like that.
Joel, if I'm looking at the right numbers, with this much noise, this is a completely unreliable way of measuring anything, unless we run it a few hundreds times. The patch _might_ have made it better, but statistically speaking, it's very hard to tell. What do you think?
(In reply to Blake Kaplan (:mrbkap) from comment #7) > I'd expect it to be pretty fast -- it's only a hash lookup and there aren't > any locks or anything else like that. That was my initial assumption as well, but these are persistent flags, so there might be some file io involved at some point, which can slow thing down, especially on a mobile device...
I'm looking at this graph: http://graphs.mozilla.org/graph.html#tests=[[201,201,29],[201,63,29]]&sel=none&displayrange=90&datatype=running I agree that the noise level is very high in this test (so this is probably worth trying to improve regardless of anything else, and even if before April 8th it had even much higher noise). Nevertheless, it looks to me like there's indeed a noticeable shift (upwards) of both the noise level and the average value someplace around to May 21st. It _may_ be similar to the noise level until May 4th, but very hard to tell. As a result, it looks to me very hard to pinpoint the exact revision at which it started, and I probably wouldn't rely on myself to do it with high certainty level, but jeol seems to have reasonable degree of certainty in his evaluation (but of course, nothing is 100%). Tough call this one, especially because you guys don't see anything obvious at this patch which could have caused it, and also because I'm not familiar with this test enough to evaluate its output. Maybe we should get some info from whoever maintains robocheck, see how s/he feels about this shift? Maybe we'll get better understanding of what this could mean. Or, vladan might be able to help here as well.
I did some additional runs on the try push: https://tbpl.mozilla.org/?tree=Try&rev=29eaf4c09dc4 the values we see on mozilla-inbound today have a lot of high 7's, low 8's. This patch has a lot of low 7's! This is good, it appears that try server push has lowered the numbers almost back to normal! Noisy tests are a fact of life, being able to detect a shift is hard. The noisier a test, the more data points which are needed.
Comment on attachment 8430043 [details] [diff] [review] regression fix. v1 Review of attachment 8430043 [details] [diff] [review]: ----------------------------------------------------------------- I guess we cannot lose anything by pushing this to mc, better safe than sorry... About improvement, I think a good first step would be an option to automatically do let's say 10-20 runs without human interaction :) an extra flag maybe?
Attachment #8430043 - Flags: review?(mrbkap)
that can be scripted. The noisier the test or the smaller the regression the more retriggers are needed to add certainty. In fact you can go to self server and 'rebuild' the job X times already.
I looked at the graphs, the test does seem way too noisy and imho I think there is a real regression. However I don't know anything about Android tests. Maybe gbrown has a more informed opinion or ideas for reducing the noisiness of this test
Flags: needinfo?(vdjeric) → needinfo?(gbrown)
Attachment #8430043 - Flags: review?(mrbkap) → review+
(In reply to Vladan Djeric (:vladan) from comment #14) > I looked at the graphs, the test does seem way too noisy and imho I think > there is a real regression. > However I don't know anything about Android tests. Maybe gbrown has a more > informed opinion or ideas for reducing the noisiness of this test tcheck2 was less noisy when it was first introduced and you can see that it was less noisy about a year ago. As I recall, there were some changes to robocop's pageload which introduced a regression + noise -- sometimes the page had not finished loading when the test started scrolling and measuring. But that was fixed, and I cannot find any sign of that problem now. I don't know where the noise is coming from now. :kats -- I think you are the real expert on tcheck2. Do you know why it is noisy?
Flags: needinfo?(gbrown) → needinfo?(bugmail.mozilla)
So I don't know why the test is noisy, to be honest. That's something we should look into. As for the regression on May 21, the most likely culprit is actually my patches on bug 1001438. They introduce a slight change to the displayport calculation on Fennec, and that might cause slightly more drawing to happen in some cases which would affect this test. I have an idea on how to fix that but I haven't tried it. Since there's already a patch landed for this bug I think we should wait to see if that fixes the regression. If it doesn't we should file a follow-up bug to continue investigation (and I can check if bug 1001438 was the culprit, and put together a fix if needed). Regardless we should also file a bug to investigate the noise level in this test and bring it down.
Assignee: nobody → gkrizsanits
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Can you tell if this fixed the regression? If not we should file a follow-up bug.
No, I cannot tell. Looking at inbound, I can believe that the June 2nd check-in improved results, but I do not think the regression has been fully reversed. :jmaher -- any thoughts?
Flags: needinfo?(gbrown) → needinfo?(jmaher)
I see a decline in numbers, but still we have a larger range of noise (not 5-7, but 5-8 now). At least we are keeping most of the numbers <8 instead of <9 :) While this isn't fixed, I would bet it is halfway fixed- a big step in the right direction!
You need to log in before you can comment on or make changes to this bug.