1437215 - Measure WebRender memory usage using AWSY tests

Reporter

Description

•

7 years ago

Similar to bug 1378526 where we started tracking stylo memory usage well before landing, it would be good to get measurements in place with WebRender enabled. We've seen some pretty extreme memory leaks from early adopters recently (bug 1437112 for example), it seems like we should give this a somewhat high priority.

Eric Rahm [:erahm]

Reporter

Comment 1

•

7 years ago

Bob, is this something you could look into?

Flags: needinfo?(bob)

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

7 years ago

Blocks: stage-wr-trains

Priority: -- → P1

Bob Clary [:bc] (inactive)

Assignee

Comment 2

•

7 years ago

An earlier Linux only run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a1cc988139624e85faedb52f09204137b82e760 Linux + Windows (try backed up at the moment) https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a78f0bd06245ef0655af4d8dea3d1b2357ab06b This adds awsy to linux64-qr/opt, windows10-64-qr/opt Do you think this is sufficient?

Flags: needinfo?(bob)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 3

•

7 years ago

(In reply to Bob Clary [:bc:] from comment #2) > An earlier Linux only run: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=5a1cc988139624e85faedb52f09204137b82e760 > Although the patch you have here looks correct, I'm not sure it's working as intended. Usually when Firefox starts up with WebRender enabled, it spits out a line that looks like this: WebRender - OpenGL version new 3.3 (Core Profile) Mesa 17.2.4 I don't see that in the log of the awsy job. Do you know if the firefox output is being suppressed from the log? If not it may be detecting the hardware as incapable of running WebRender and so might be falling back to non-WebRender.

Bob Clary [:bc] (inactive)

Assignee

Comment 4

•

7 years ago

I bet I need to specify the virtualization with a gpu. I'll look into that next.

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

7 years ago

Actually, looking at the gecko log for the first run I do see the WebRender message: https://public-artifacts.taskcluster.net/Elh8zEJjTji2nACtF4_q1g/0/public/test_info//gecko.log WebRender - OpenGL version new 3.3 (Core Profile) Mesa 17.2.4 My attempt to use virtual-with-gpu failed to start the test at least on linux: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a1d7104e24f769783f9534d6dda82609c9961f I think the original patch was good.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 6

•

7 years ago

Ah! I didn't realize the gecko log was in a separate artifact, I was looking at the live_backing.log. So yeah that looks good for Linux at least. On Windows you'll need a virtual-with-gpu I think; I had to use that for other windows QR test jobs.

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

7 years ago

Great. Thanks kats! I'll follow up after taskcluster gets the windows situation under control.

Assignee: nobody → bob

Status: NEW → ASSIGNED

Bob Clary [:bc] (inactive)

Assignee

Comment 8

•

7 years ago

The first try run finally completed on Windows and it does not show the WebRender message which probably means virtual-with-gpu is required there. My attempt in the second run to use virtual-with-gpu didn't work. I'll revisit and try to do normal virtual for linux and virtual-with-gpu for windows.

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7337c28a612bd457753ed5fd130d52d0a9b9397 I am getting an exception in the windows job which causes the windows job to be retried up to 5 times. Unfortunately, the logs are lost once an exception occurs and I am unable to diagnose. Jonas: This is with Generic Worker but I've also see this with Taskcluster Worker where exceptions cause the logs to be lost. What can I do to diagnose this failure?

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 10

•

7 years ago

@bc, ideally the reasonResolved should explain why there was an exception, or at-least who is to blame. For example reasonResolved: 'malformed-payload' that's almost always your fault, and the logs should explain why. And reasonResolved: 'internal-error' almost always a worker bug :) (generally, logs should be available, depending on the exception reason) In this case you seem to have: reasonResolved: 'claim-expired' See: https://tools.taskcluster.net/groups/KPo2saEETOiwWLS8na_68Q/tasks/FDSKyPtORqiMGUmWuodiYw/runs/1 Which implies that the worker just stopped reclaiming the task from the queue. So the queue assumed the worker was dead and retried the task. This suggests that your tasks is able to consistently kill the worker, the ec2 machine, or somehow decouple it from the internet. If this was docker-worker I would say that's a bug in the worker implementation. Because we try to make docker-worker as robust as possible using the container isolation (even if it's not always perfect). But this is Windows, and for Windows we know isolation is not perfect. You can probably ask pmoore or grenade to help you debug this... Or you can try to split the task until you figure out what part is killing the worker. For tc-worker I would argue that whether this is a bug or just bad usage depends on the configuration. For example tc-worker with native-engine can be configured to allow task to run as root, obviously then bad tasks can kill anything. But for tc-worker with qemu-engine (not in production yet) I would consider reliably crashing a serious bug. For tc-worker with script-engine it really depends on what script you've provided, most likely it's the script killing the worker somehow (or a tc-worker bug).

Flags: needinfo?(jopsen)

Bob Clary [:bc] (inactive)

Assignee

Comment 11

•

7 years ago

Jonas: Thanks. pmoore, grenade: Can you help me figure this out? Retriggering the sy job for Windows 10 x64 QuantumRender opt in https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7337c28a612bd457753ed5fd130d52d0a9b9397 will reproduce the problem for you. Once the job is running, you will be able to watch the live log in progress but once the problem appears the log disappears and there is nothing to help debug the problem.

Flags: needinfo?(rthijssen)

Flags: needinfo?(pmoore)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 12

•

7 years ago

on windows 10 gpu instances, some tests cause the instance to go into an impaired state. ec2 support tell us that the network card goes offline. my suspicion is that both the gpu and the network interface are sharing some hardware or address resource and that some gpu functions trigger the state (we only see this occur when specific gpu tests run). unfortunately while the instances are in this state, they lose comms with the world so we get no output from either the task logs or the aggregated system event logs which compounds the problem of debugging this issue. we run a hacky external script that rebbots instances which it finds to be in this state and this shows up as a blue exception in treeherder. the issue is tracked in bug 1372172. it's been a problem for many months and we have exhausted many resolution paths already. there is ongoing work that may get us a resolution but i don't have any time estimates for this.

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

7 years ago

Depends on: 1372172

Bob Clary [:bc] (inactive)

Assignee

Comment 13

•

7 years ago

Attached patch bug-1437215-awsy-webrender.patch — Details — Splinter Review

linux x86_64 qr only. leaving windows 10 until bug 1372172 is resolved. There does not appear to be any macos test types for qr.

Attachment #8952298 - Flags: review?(erahm)

Bob Clary [:bc] (inactive)

Assignee

Comment 14

•

7 years ago

try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=70cccc9a818df4ef31ba76b6a79d865c6302164e Note try is now closed due to failure to start Windows builds.

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Flags: needinfo?(pmoore)

Bob Clary [:bc] (inactive)

Assignee

Updated

•

7 years ago

Attachment #8952298 - Flags: review?(erahm) → review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 15

•

7 years ago

Comment on attachment 8952298 [details] [diff] [review] bug-1437215-awsy-webrender.patch Review of attachment 8952298 [details] [diff] [review]: ----------------------------------------------------------------- a few nits- happy to r+ with a bit more context. ::: taskcluster/ci/test/awsy.yml @@ +23,5 @@ > default: built-projects > + virtualization: > + by-test-platform: > + windows10-64-qr/.*: virtual-with-gpu > + default: virtual right now we won't be running this on win10- I am ok leaving this in, but it seems more reasonable in a patch when we turn it on, even if it was just for try server. ::: testing/mozharness/scripts/awsy_script.py @@ +47,5 @@ > + [["--enable-webrender"], > + {"action": "store_true", > + "dest": "enable_webrender", > + "default": False, > + "help": "Tries to enable the WebRender compositor.", I don't see any code using this- is there a magic transform somewhere that already exists?

Attachment #8952298 - Flags: review?(jmaher) → review-

Bob Clary [:bc] (inactive)

Assignee

Comment 16

•

7 years ago

https://hg.mozilla.org/try/rev/9cc96c7bd2de1b5add982d1450d390c5a87e91ed#l3.35 https://treeherder.mozilla.org/logviewer.html#?job_id=163135091&repo=try&lineNumber=29 https://treeherder.mozilla.org/logviewer.html#?job_id=163135091&repo=try&lineNumber=1259 https://treeherder.mozilla.org/logviewer.html#?job_id=163135091&repo=try&lineNumber=1253

Bob Clary [:bc] (inactive)

Assignee

Comment 17

•

7 years ago

Attached patch bug-1437215-awsy-webrender.patch — Details — Splinter Review

Removed the windows virtualization. Added support for --enable-webrender to mach command. https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f92fd164185bb708d67a531e226e347643414e7 Note that gecko.log will show a WebRender - OpenGL version line when webrender is enabled.

Attachment #8952616 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 18

•

7 years ago

Comment on attachment 8952616 [details] [diff] [review] bug-1437215-awsy-webrender.patch Review of attachment 8952616 [details] [diff] [review]: ----------------------------------------------------------------- looks good

Attachment #8952616 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 19

•

7 years ago

leave open for windows support.

Whiteboard: [MemShrink] → [MemShrink][leave-open]

Pulsebot

Comment 20

•

7 years ago

Pushed by bclary@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1162b0f53b85 Measure WebRender memory usage on Linux x86_64 using AWSY tests, r=jmaher

Raul Gurzau (:RaulG)

Comment 21

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1162b0f53b85

Eric Rahm [:erahm]

Reporter

Updated

•

7 years ago

Whiteboard: [MemShrink][leave-open] → [MemShrink:P1][leave-open]

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 22

•

7 years ago

For the record the windows version is still failing the same way (job turns blue on TreeHerder, logs disappear): https://treeherder.mozilla.org/#/jobs?repo=try&revision=92d4aa37f122a37f318883dd3becc07c66045626

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 23

•

7 years ago

Kartikaya, you could work with a one click loaner to maybe find out why something breaks the job. See here for details: https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards

Maire Reavy [:mreavy]

Comment 24

•

7 years ago

This shouldn't stop WR going to Beta, but I'd like to understand the effort to get this working (measuring the memory) on Windows before we get to Beta.

Priority: P1 → P2

Whiteboard: [MemShrink:P1][leave-open] → [MemShrink:P1][leave-open][needs-investigation]

Eric Rahm [:erahm]

Reporter

Comment 25

•

7 years ago

(In reply to Maire Reavy [:mreavy] Plz needinfo from comment #24) > This shouldn't stop WR going to Beta, but I'd like to understand the effort > to get this working (measuring the memory) on Windows before we get to Beta. Are we going to have a separate WebRender build or are we just mass enabling on Beta? The former would be okay, the latter will break all memory measurements on windows which should be a blocker.

Flags: needinfo?(mreavy)

Darkspirit

Comment 26

•

7 years ago

With gfx.webrender.all.qualified;true (instead of gfx.webrender.all;true) WebRender will be only enabled for modern Nvidia on a Win10 without battery. (bug 1475355, bug 1478150)

Eric Rahm [:erahm]

Reporter

Comment 27

•

7 years ago

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #26) > With gfx.webrender.all.qualified;true (instead of gfx.webrender.all;true) > WebRender will be only enabled for modern Nvidia on a Win10 without battery. > (bug 1475355, bug 1478150) Sorry I should have specified more clearly, I want to make sure that we don't break memory tests in automation. It sounds like we're just selectively enabling and that probably doesn't include automation?

Maire Reavy [:mreavy]

Comment 28

•

7 years ago

We currently run WebRender automated tests in parallel to our existing ones. We'll do the same thing when WR goes to Beta.

Flags: needinfo?(mreavy)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 29

•

7 years ago

Latest status: https://treeherder.mozilla.org/#/jobs?repo=try&bugfiler=&group_state=expanded&revision=b72f111a67c1188a06046fa2773a82fb6f1c8044 The base test works fine on Windows, but the real AWSY test still makes jobs go blue. Waiting on the same set of releng bugs that bug 1424755 is waiting on.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 30

•

7 years ago

Attached file Bug 1437215 - Run AWSY on windows10-qr builds. r=kats — Details

Run AWSY on windows10 QR config

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 31

•

7 years ago

try run with above changes: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=b1c41e607eb74e23031d04b1f46b62ad325b2f6b

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 32

•

7 years ago

Comment on attachment 9013466 [details] Bug 1437215 - Run AWSY on windows10-qr builds. r=kats Kartikaya Gupta (email:kats@mozilla.com) has approved the revision.

Attachment #9013466 - Flags: review+

Sylvestre Ledru [:Sylvestre]

Updated

•

7 years ago

Keywords: leave-open

Whiteboard: [MemShrink:P1][leave-open][needs-investigation] → [MemShrink:P1][needs-investigation]

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 33

•

7 years ago

Once Joel's patch lands this bug can be closed, so dropping the leave-open flag.

Keywords: leave-open

Pulsebot

Comment 34

•

7 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f44fb4c2bbde Run AWSY on windows10-qr builds. r=kats

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Updated

•

7 years ago

Depends on: 1495722

Raul Gurzau (:RaulG)

Comment 35

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f44fb4c2bbde

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 36

•

6 years ago

If we care enough about AWS costs we can try switching the awsy tests back to virtual-with-gpu instead of hardware, more that the new AMIs are deployed.

bug-1437215-awsy-webrender.patch 7 years ago Bob Clary [:bc] (inactive) 4.21 KB, patch	jmaher : review-	Details \| Diff \| Splinter Review
bug-1437215-awsy-webrender.patch 7 years ago Bob Clary [:bc] (inactive) 5.03 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
Bug 1437215 - Run AWSY on windows10-qr builds. r=kats 7 years ago Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) 46 bytes, text/x-phabricator-request	kats : review+	Details \| Review