Update webrender to 914d16f9a2fb8d007509894660bae9c61074ae31 (WR PR #3347)

RESOLVED FIXED in Firefox 65

Status

()

enhancement
P3
normal
RESOLVED FIXED
7 months ago
7 months ago

People

(Reporter: kats, Assigned: kamidphish)

Tracking

(Blocks 2 bugs)

65 Branch
mozilla65
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox65 fixed)

Details

(Whiteboard: [gfx-noted], )

Attachments

(2 attachments)

+++ This bug was initially created as a clone of Bug #1509592 +++

I'm filing this as a placeholder bug for the next webrender update. I may be running a cron script [1] that does try pushes with webrender update attempts, so that we can track build/test breakages introduced by webrender on a rolling basis. This bug will hold the try push links as well as dependencies filed for those breakages, so that we have a better idea going into the update of what needs fixing. I might abort the cron job because once things get too far out of sync it's hard to fully automate fixing all the breakages.

When we are ready to actually land the update, we can rename this bug and use it for the update, and then file a new bug for the next "future update".

[1] https://github.com/staktrace/wrupdater/blob/master/try-latest-webrender.sh
There appear to be failures on a particular Linux crashtest that started with servo/webrender#3342. Looks like it might just expose some pre-existing bug with invalidation? But we'll need to resolve it or disable the test before updating to that cset.
Depends on: 1500458

Comment 9

7 months ago
Yup, it does look like it's become a permanent failure, I mistakenly thought we were hitting the referenced intermittent.

I can't reproduce locally unfortunately. I don't see any backtrace in the logs, if I'm looking in the right place?
(In reply to Glenn Watson [:gw] from comment #9)
> Yup, it does look like it's become a permanent failure, I mistakenly thought
> we were hitting the referenced intermittent.
> 
> I can't reproduce locally unfortunately. I don't see any backtrace in the
> logs, if I'm looking in the right place?

It's not actually crashing, it just fails to complete. It was intermittent before (bug 1500458) so most likely it's done running assumption in the test that the WR code path fails to satisfy. The try push in comment 10 has the test disabled, assuming that is green and the failure doesn't just move elsewhere, I'm ok to land with that. I can try to reproduce the failure locally and investigate.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> so most likely it's done running assumption in the test

That should be "... so most likely there's some timing assumption ..." (silly phone autocorrect)

Also apparently it happens on two different tests (which I didn't notice before) and I only disabled one. I'll disable the other one too and do another try push. So far I haven't been able to reproduce the problem locally but I'll try harder tomorrow.
Also servo/webrender#3346 added a couple of failures on the windows reftests - looks fuzzable, :gw can you confirm that's ok to fuzz?
Yup, these are fine to fuzz - they are also fuzzed on other renderer backends.
Disabling the two crashtests just moved the failure down to some other crashtests. I don't really want to disable a slew of crashtests so I'll try debugging it first. I might end up landing some WR PRs out of order while I try to sort this out.

I still haven't successfully reproduced the failure locally so I'll do try pushes with logging to try and track it down.
Noticed that the 972199-1.html seems to trigger a deadlock or infinite loop or something similar. The test does an AdvanceTimeAndRefresh which triggers a sync IPC to the compositor and that blocks on FlushRendering() which blocks. When the harness kills firefox, the stacks show that the render thread is busy [1] which is presumably why the FlushRendering() doesn't return and the test is hanging.

It might be that we're hitting a bug in mesa which is only triggered on the try servers, which is why I don't see it locally. Certainly the top frames of the renderer stack are in swrast_dri.so.

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=213688710&repo=try&lineNumber=21591
Looks like running just the failing tests in isolation makes the failure go away:

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=crashtest&selectedJob=213859306&revision=8c491545a63b500c181872baaa000e03059d79be

So that's annoying. It means that some previous test is setting up some state that's triggering the failure. I'll try to trim down the set of tests that need to be run while also trying to reproduce this locally by running the tests inside the same docker image that is used on the try server.
I can reproduce the problem in a docker container, so now I'm trying to bisect and narrow down the minimal set of tests that trigger the problem.
The bad news is that I'm having trouble getting a minimal set of tests. It seems like when crashtests run, the harness doesn't actually block on the compositor for the most part, so it just loads page after page and the compositor asynchronously goes about rendering things. But if the compositor is too slow then I guess we skip over some of the tests entirely. So this produces some sort of intermittent behaviour. One (or more) of the tests that may or may not get rendered in the compositor seems to get the renderer thread wedged in an infinite loop (or maybe just a really long-running loop) in draw_tile_frame. And then the next test that does any sync IPC to the compositor manifests the failure.

The good news is that I'm fairly sure this isn't an osmesa bug, because I was able to attach gdb to the firefox process while the renderer thread was doing its infinite loop, and I was able to `fin` my way out of the swrast_dri.so stack frames, but couldn't `fin` out of the draw_tile_frame stack frame. It was tricky enough getting gdb working in the docker image, but I'll see if I can do a build of FF inside the image and get some more info, because right now I'm not getting a lot of useful info out of gdb.

For those following along, here are some steps to reproduce what I did: https://gist.github.com/staktrace/a83dd0d66e29f0d049cc6b16d6cf71b2 (note that the "magic" bits like the task-id for the docker image, env vars, and the command to run can all be found on the task details page, e.g. https://tools.taskcluster.net/groups/EfFrMMd8QAyY9bg27ZoXcA/tasks/Sbohi5I9TH2QBypUncosTQ/details for this case)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #24)
> The bad news is that I'm having trouble getting a minimal set of tests. It
> seems like when crashtests run, the harness doesn't actually block on the
> compositor for the most part, so it just loads page after page and the
> compositor asynchronously goes about rendering things.

I'm going to try and fix this first since it should be easy and point us more directly to a culprit test.
Depends on: 1510026
The problematic crashtests are actually the large-border-radius-* tests in layout/generic/crashtests. With those disabled the crashtests are green (see last couple of try pushes above, which include a patch to disable those tests). I discussed with :gw on IRC, and I'll land the WR updates with those tests disabled on linux for now, and he'll fix it in upstream WR.
Alias: wr-future-update
Assignee: nobody → dglastonbury
No longer depends on: 1500458, 1510026
Summary: Future webrender update bug → Update webrender to 914d16f9a2fb8d007509894660bae9c61074ae31 (WR PR #3347)

Comment 32

7 months ago
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/51e713e72d92
Update webrender to commit 914d16f9a2fb8d007509894660bae9c61074ae31 (WR PR #3347). r=kats
https://hg.mozilla.org/integration/autoland/rev/9f0228da2763
Re-generate FFI header. r=kats

Comment 33

7 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/51e713e72d92
https://hg.mozilla.org/mozilla-central/rev/9f0228da2763
Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
You need to log in before you can comment on or make changes to this bug.