Closed Bug 1437215 Opened 2 years ago Closed Last year

Measure WebRender memory usage using AWSY tests

Categories

(Core :: Graphics: WebRender, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: erahm, Assigned: bc)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P1][needs-investigation])

Attachments

(3 files)

Similar to bug 1378526 where we started tracking stylo memory usage well before landing, it would be good to get measurements in place with WebRender enabled.

We've seen some pretty extreme memory leaks from early adopters recently (bug 1437112 for example), it seems like we should give this a somewhat high priority.
Bob, is this something you could look into?
Flags: needinfo?(bob)
An earlier Linux only run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a1cc988139624e85faedb52f09204137b82e760

Linux + Windows (try backed up at the moment)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a78f0bd06245ef0655af4d8dea3d1b2357ab06b

This adds awsy to linux64-qr/opt, windows10-64-qr/opt

Do you think this is sufficient?
Flags: needinfo?(bob)
(In reply to Bob Clary [:bc:] from comment #2)
> An earlier Linux only run:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=5a1cc988139624e85faedb52f09204137b82e760
> 

Although the patch you have here looks correct, I'm not sure it's working as intended. Usually when Firefox starts up with WebRender enabled, it spits out a line that looks like this:
  WebRender - OpenGL version new 3.3 (Core Profile) Mesa 17.2.4

I don't see that in the log of the awsy job. Do you know if the firefox output is being suppressed from the log? If not it may be detecting the hardware as incapable of running WebRender and so might be falling back to non-WebRender.
I bet I need to specify the virtualization with a gpu. I'll look into that next.
Actually, looking at the gecko log for the first run I do see the WebRender message:
https://public-artifacts.taskcluster.net/Elh8zEJjTji2nACtF4_q1g/0/public/test_info//gecko.log
WebRender - OpenGL version new 3.3 (Core Profile) Mesa 17.2.4

My attempt to use virtual-with-gpu failed to start the test at least on linux:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a1d7104e24f769783f9534d6dda82609c9961f

I think the original patch was good.
Ah! I didn't realize the gecko log was in a separate artifact, I was looking at the live_backing.log. So yeah that looks good for Linux at least. On Windows you'll need a virtual-with-gpu I think; I had to use that for other windows QR test jobs.
Great. Thanks kats! I'll follow up after taskcluster gets the windows situation under control.
Assignee: nobody → bob
Status: NEW → ASSIGNED
The first try run finally completed on Windows and it does not show the WebRender message which probably means virtual-with-gpu is required there. My attempt in the second run to use virtual-with-gpu didn't work. I'll revisit and try to do normal virtual for linux and virtual-with-gpu for windows.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7337c28a612bd457753ed5fd130d52d0a9b9397

I am getting an exception in the windows job which causes the windows job to be retried up to 5 times. Unfortunately, the logs are lost once an exception occurs and I am unable to diagnose. Jonas: This is with Generic Worker  but I've also see this with Taskcluster Worker where exceptions cause the logs to be lost. What can I do to diagnose this failure?
Flags: needinfo?(jopsen)
@bc, ideally the reasonResolved should explain why there was an exception, or at-least who is to blame.
For example reasonResolved: 'malformed-payload' that's almost always your fault, and the logs should explain why.
And reasonResolved: 'internal-error' almost always a worker bug :)
(generally, logs should be available, depending on the exception reason)

In this case you seem to have:
  reasonResolved: 'claim-expired'
  See: https://tools.taskcluster.net/groups/KPo2saEETOiwWLS8na_68Q/tasks/FDSKyPtORqiMGUmWuodiYw/runs/1
Which implies that the worker just stopped reclaiming the task from the queue.
So the queue assumed the worker was dead and retried the task.

This suggests that your tasks is able to consistently kill the worker, the ec2 machine, or somehow decouple it from the internet.
If this was docker-worker I would say that's a bug in the worker implementation. Because we try to make docker-worker as robust as possible using the container isolation (even if it's not always perfect). But this is Windows, and for Windows we know isolation is not perfect. You can probably ask pmoore or grenade to help you debug this... Or you can try to split the task until you figure out what part is killing the worker.

For tc-worker I would argue that whether this is a bug or just bad usage depends on the configuration.
For example tc-worker with native-engine can be configured to allow task to run as root, obviously then bad tasks can kill anything. But for tc-worker with qemu-engine (not in production yet) I would consider reliably crashing a serious bug.
For tc-worker with script-engine it really depends on what script you've provided, most likely it's the script killing the worker somehow (or a tc-worker bug).
Flags: needinfo?(jopsen)
Jonas: Thanks.

pmoore, grenade: Can you help me figure this out?

Retriggering the sy job for Windows 10 x64 QuantumRender opt in https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7337c28a612bd457753ed5fd130d52d0a9b9397 will reproduce the problem for you.

Once the job is running, you will be able to watch the live log in progress but once the problem appears the log disappears and there is nothing to help debug the problem.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
on windows 10 gpu instances, some tests cause the instance to go into an impaired state. ec2 support tell us that the network card goes offline. my suspicion is that both the gpu and the network interface are sharing some hardware or address resource and that some gpu functions trigger the state (we only see this occur when specific gpu tests run).

unfortunately while the instances are in this state, they lose comms with the world so we get no output from either the task logs or the aggregated system event logs which compounds the problem of debugging this issue. we run a hacky external script that rebbots instances which it finds to be in this state and this shows up as a blue exception in treeherder.

the issue is tracked in bug 1372172. it's been a problem for many months and we have exhausted many resolution paths already. there is ongoing work that may get us a resolution but i don't have any time estimates for this.
Flags: needinfo?(rthijssen)
linux x86_64 qr only. leaving windows 10 until bug 1372172 is resolved. There does not appear to be any macos test types for qr.
Attachment #8952298 - Flags: review?(erahm)
try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=70cccc9a818df4ef31ba76b6a79d865c6302164e

Note try is now closed due to failure to start Windows builds.
Flags: needinfo?(pmoore)
Attachment #8952298 - Flags: review?(erahm) → review?(jmaher)
Comment on attachment 8952298 [details] [diff] [review]
bug-1437215-awsy-webrender.patch

Review of attachment 8952298 [details] [diff] [review]:
-----------------------------------------------------------------

a few nits- happy to r+ with a bit more context.

::: taskcluster/ci/test/awsy.yml
@@ +23,5 @@
>              default: built-projects
> +    virtualization:
> +        by-test-platform:
> +            windows10-64-qr/.*: virtual-with-gpu
> +            default: virtual

right now we won't be running this on win10- I am ok leaving this in, but it seems more reasonable in a patch when we turn it on, even if it was just for try server.

::: testing/mozharness/scripts/awsy_script.py
@@ +47,5 @@
> +        [["--enable-webrender"],
> +         {"action": "store_true",
> +          "dest": "enable_webrender",
> +          "default": False,
> +          "help": "Tries to enable the WebRender compositor.",

I don't see any code using this- is there a magic transform somewhere that already exists?
Attachment #8952298 - Flags: review?(jmaher) → review-
Removed the windows virtualization.
Added support for --enable-webrender to mach command.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f92fd164185bb708d67a531e226e347643414e7

Note that gecko.log will show a WebRender - OpenGL version line when webrender is enabled.
Attachment #8952616 - Flags: review?(jmaher)
Comment on attachment 8952616 [details] [diff] [review]
bug-1437215-awsy-webrender.patch

Review of attachment 8952616 [details] [diff] [review]:
-----------------------------------------------------------------

looks good
Attachment #8952616 - Flags: review?(jmaher) → review+
leave open for windows support.
Whiteboard: [MemShrink] → [MemShrink][leave-open]
Pushed by bclary@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1162b0f53b85
Measure WebRender memory usage on Linux x86_64 using AWSY tests, r=jmaher
Whiteboard: [MemShrink][leave-open] → [MemShrink:P1][leave-open]
For the record the windows version is still failing the same way (job turns blue on TreeHerder, logs disappear): https://treeherder.mozilla.org/#/jobs?repo=try&revision=92d4aa37f122a37f318883dd3becc07c66045626
Kartikaya, you could work with a one click loaner to maybe find out why something breaks the job. See here for details:

https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards
This shouldn't stop WR going to Beta, but I'd like to understand the effort to get this working (measuring the memory) on Windows before we get to Beta.
Priority: P1 → P2
Whiteboard: [MemShrink:P1][leave-open] → [MemShrink:P1][leave-open][needs-investigation]
(In reply to Maire Reavy [:mreavy] Plz needinfo from comment #24)
> This shouldn't stop WR going to Beta, but I'd like to understand the effort
> to get this working (measuring the memory) on Windows before we get to Beta.

Are we going to have a separate WebRender build or are we just mass enabling on Beta? The former would be okay, the latter will break all memory measurements on windows which should be a blocker.
Flags: needinfo?(mreavy)
With gfx.webrender.all.qualified;true (instead of gfx.webrender.all;true) WebRender will be only enabled for modern Nvidia on a Win10 without battery. (bug 1475355, bug 1478150)
(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #26)
> With gfx.webrender.all.qualified;true (instead of gfx.webrender.all;true)
> WebRender will be only enabled for modern Nvidia on a Win10 without battery.
> (bug 1475355, bug 1478150)

Sorry I should have specified more clearly, I want to make sure that we don't break memory tests in automation. It sounds like we're just selectively enabling and that probably doesn't include automation?
We currently run WebRender automated tests in parallel to our existing ones.  We'll do the same thing when WR goes to Beta.
Flags: needinfo?(mreavy)
Latest status: https://treeherder.mozilla.org/#/jobs?repo=try&bugfiler=&group_state=expanded&revision=b72f111a67c1188a06046fa2773a82fb6f1c8044

The base test works fine on Windows, but the real AWSY test still makes jobs go blue. Waiting on the same set of releng bugs that bug 1424755 is waiting on.
Comment on attachment 9013466 [details]
Bug 1437215 - Run AWSY on windows10-qr builds. r=kats

Kartikaya Gupta (email:kats@mozilla.com) has approved the revision.
Attachment #9013466 - Flags: review+
Keywords: leave-open
Whiteboard: [MemShrink:P1][leave-open][needs-investigation] → [MemShrink:P1][needs-investigation]
Once Joel's patch lands this bug can be closed, so dropping the leave-open flag.
Keywords: leave-open
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f44fb4c2bbde
Run AWSY on windows10-qr builds. r=kats
https://hg.mozilla.org/mozilla-central/rev/f44fb4c2bbde
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
If we care enough about AWS costs we can try switching the awsy tests back to virtual-with-gpu instead of hardware, more that the new AMIs are deployed.
You need to log in before you can comment on or make changes to this bug.