Closed Bug 1210465 Opened 7 years ago Closed 7 years ago

Issues with test_basic_pan.html

Categories

(Core :: Panning and Zooming, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox44 --- affected

People

(Reporter: kats, Assigned: kats)

References

Details

Investigation bug for all the various intermittent failures with test_basic_pan.html which seem to have increased recently. I suspect a few of them have the same underlying cause.
Try push with some logging enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e35818ef4683. My plan is to retrigger test_basic_pan until it hits one or more of these failures and use the logging to find out roughly what's going on, and dig in from there.
If I understand what's intended, the mach command to push to try should be an effective workaround for bug 1183144: running |./mach try --tag apz -p emulator| will run the relevant tests in a single chunk (it will appear as M(1)).
Indeed, the try push I have above seems to provide the results that I need. Removing dependency on that bug.

The failure from bug 1208200 showed up in my retriggers and from the log it looks like the touch events are synthesized just fine, but the main thread timeout is hit before the div gets layerized, and so the touch events end up scrolling the root document's APZ instead. This should be avoidable by increasing the content response timeout. In the log at [1] the timeout is installed at 20:19:29.244 and the response arrives at 20:19:47.784 which is 18.54 seconds later. This exceeds the 15 second timeout that we use in the testing environment, causing the problem. That being said, I'm not sure what was happening between 20:19:30.124 and 20:19:40.604 since that's a 10 second gap with no output in logcat. It could be that there's some underlying bug that caused the main thread to stop pumping events.

[1] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/6801cf4314f00cbd908cf01f3a4422cf93d4b6aaff8fd676fa785a2b4f4f89b9006b51217c69630ac9b79e1b740564de8592fa80ea0b95878bf7edf5afe8e32b
No longer depends on: 1183144
In the log at [2] the same thing happens for the iframe, and there it takes just under 50 seconds for the content response to get back!

[2] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/b977c422e5823bb2f36a3a113ca8d193d79896935fa55317f127992242ddfdbbd063ecf54913a8fdd730a30aea5ff1496b894ca7a1ae4974454ff38fd9f3ab36
Bug 1203901 also showed up in the retriggers, e.g. log at [3]. The reason it times out is because the touch events are delivered to the root APZ in the root process, so no scroll occurs and presumably no APZ:TransformEnd notification is dispatched in the child process. If we look at the hit-testing tree dump we can see that the child process layers (l=3) haven't shown up in the tree yet. This is evident if you compare it to the equivalent tree dump in the logs [1] or [2] from my previous comments. So likely there's a race here where the "onload" script runs before the compositor actually has the layers from the page that got loaded. We can probably fix this by doing a flushApzRepaints or some other equivalent wait-for-paint magic before synthesizing the drag.

[3] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/4630d8afe7fe3fb6b013d8a0fd2ecac92b376ee85b89b41e0e9a3380e9b9681b3802170c2ecac94705661b1c24b40070ca96e81b40be5ecdd9ef4ac27d9bb992
I suspect both bug 1208973 and bug 1209444 are due to improper multithreaded access of mQueuedTask in TaskThrottler. These crashes both started after the recent changes to TaskThrottler. There's a log at [4] with such a crash but it doesn't help that much.

[4] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/01abdc9cfd862deb17f186b59763e6e16763976d17d8c8335448b49b63481eb6ac86086e9fea490bf0b086e3bf585852c2f32c143cbee2e16819842d72e262fb
I did another try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a8b377ccfc0 with what I hope is a fix for bug 1203901 and more logging for the other issues.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> In the log at [2] the same thing happens for the iframe, and there it takes
> just under 50 seconds for the content response to get back!

Log at [5] has more logging, it shows that the APZ deals with the touchstart at around 18:28:34.691 and the TabChild doesn't see it until 18:29:02.152. That's 28 seconds just to get the event to the child process. It doesn't seem to get stalled anywhere in particular, so I think this is just SLOW. Admittedly there's a bunch of other stuff happening (the drag event generates a lot of touch events) so if this is running on a slow machine that might explain it.

[5] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/5f672c2d6bb4f9204651bdb3ef46468bee3dee02dd77daf224a58cea113e2e2c0af11697a013a5a03d382810feb05aa0700694457b633de44f51dde5d8555f9f
Wow - and I thought 15 seconds was an eternity to begin with!
Yeah :(

The new try push has a few "test timed out" as well - for some of them there's a new issue with the drag turning into a long-press. We can fix that by increasing the long-press threshold timeout. And there's still an issue with the layer tree not being correct at the time the touchstart comes in, so I guess my patch didn't really work. It might need an APZ flush as well.
The retriggers on this are showing a couple of failures. One is the same "layer tree not updated" problem. I fixed it for helper_div_pan and helper_iframe_pan but forgot to do it for helper_basic_pan.

The crash from bug 1208973 also showed up and as I expected the log [6] indicates that it's probably a concurrency error. Specifically the task 0x46872780 gets run twice in the same throttler, which should never happen. I suspect that the second of those tasks ends up sending a corrupted framemetrics which then causes an assertion failure on the child send when it tries to unpack it. Adding some locking around this should fix it.

[6] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/ecf36972bf0a5fddc3432da6e9d3fe799eb019e521df230fceb0133f10e9c10ddfe7f9f2e6dc63b8d88a96eb64330821cfb8f51edec74368a8b4ee9dbe438153
My investigation is complete; I have attached patches to 3 of the 5 dependent bugs. The other two are probably dupes.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.