Closed Bug 1300355 Opened 8 years ago Closed 5 years ago

intermittent blank reftest test or reference screenshots on win7 PGO

Categories

(Core :: Graphics, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox51 --- wontfix
firefox52 + wontfix
firefox-esr52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix

People

(Reporter: dbaron, Assigned: mattwoodrow)

References

(Depends on 2 open bugs, Blocks 18 open bugs)

Details

(Whiteboard: gfx-noted)

Attachments

(3 files, 3 obsolete files)

We've started seeing intermittent reftest failures with blank reftests (either the test or the reference is blank), sometimes with 1 failure in a run and sometimes three.

They all seem to be on Win7 PGO, and the failures are usually preceded by some sort of GFX error message (although the message varies).
The message seems to be one of:
INFO - [GFX1]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 
INFO - [GFX1]: Failed to create DrawTarget, Type: 5 Size: Size(800,1000)
Blocks: 1300361
Blocks: 1300362
Blocks: 1300363
Blocks: 1300566
Blocks: 1300608
Not sure whether the assertion stack will help anyone (the crash stack certainly won't), but bug 1300608 is "Failed to create DrawTarget" in a debug build, and thus an assertion.
Blocks: 1300609
Blocks: 1300627
Blocks: 1300635
Blocks: 1300955
Blocks: 1300325
Blocks: 1300966
Blocks: 1300967
Blocks: 1300968
We had some problems with skia content on Windows due to OOMing during the tests. We weren't GCing enough and that sometimes caused us to be unable to allocate memory to draw content. Terrence, any other ideas here?
Flags: needinfo?(terrence)
See Also: → 1299204, 1292218
Flags: needinfo?(terrence) → needinfo?(continuation)
Is this happening for specific tests? Is there any way to say add some pauses of a few seconds here and there for these tests? Or maybe the directory could be split up, assuming that reftests have run-by-directory, which I'm not sure if they do.
Flags: needinfo?(continuation)
They aren't even chunked, much less run-by-dir, and while there's a little affinity for hitting w3c-css/submitted/variables/ and svg/filters/, it's just a little affinity based on short sample time.
(In reply to Mason Chang [:mchang] from comment #3)
> We had some problems with skia content on Windows due to OOMing during the
> tests. We weren't GCing enough and that sometimes caused us to be unable to
> allocate memory to draw content. Terrence, any other ideas here?

That seems plausible.
Any other ideas for being more aggressive with the gc here? Thanks.
Flags: needinfo?(continuation)
Whiteboard: gfx-noted
See Also: → 1300310
Blocks: 1301447
(In reply to Mason Chang [:mchang] from comment #7)
> Any other ideas for being more aggressive with the gc here? Thanks.

Not really, sorry. Splitting up debug windows reftests might help. They take like 90 minutes. That must really fragment the address space.
Flags: needinfo?(continuation)
Blocks: 1301599
Blocks: 1301600
Blocks: 1301687
Blocks: 1301842
Blocks: 1301843
Blocks: 1301844
Blocks: 1301845
Blocks: 1301846
Blocks: 1301847
Blocks: 1301923
Blocks: 1301924
Blocks: 1301931
Blocks: 1301964
This particular case, or set of cases, is not 90 minute debug runs, it's 25 minute PGO runs.

I've been resisting our knee-jerk calls to split up the suite rather than fix our inability to run it, because we are desperately short on Windows hardware, routinely having 8-24+ hour waittimes for Windows tests on try, but since we run Windows 7 tests on AWS VMs it would be a little less horrible. Looks like when it has load, this pool can take another job within around 2 minutes after finishing one, and the actual test running time of the 25 minute job is 20 minutes, so we'd only be burning an additional 14 minutes of wasted time per reftest run, if we were willing to accept the awkwardness of running one chunk for WinXP and Win8 but three chunks for Win7.
Blocks: 1301990
Blocks: 1301991
Blocks: 1302115
Blocks: 1302116
Blocks: 1302438
So it sounds like we either can do chunking or just GC more often here. Since bug 1299204, we've been doing a full GC sweep every 3000 reftests. Maybe we can bump that up to something like 1000 or 100? Otherwise, are we left only with chunking?

I'm not sure what we can do on the graphics side. If we don't have memory to allocate a texture, then we can't display anything. We already try to clear out caches when we get a low memory notification, but I've been told that it's not really reliable. 

Do we have a preference for chunking versus just GC more often?
Flags: needinfo?(philringnalda)
Flags: needinfo?(continuation)
I'm fine with increasing how frequently we do them or chunking, but I don't think it will help. I assume the basic issue here is that there's a directory that has a ton of graphically intensive tests that all allocate a lot of memory. You need to somehow do a forceGC + forceCC in the middle of this directory.
Flags: needinfo?(continuation)
Chunking won't help because this directory will still all be in the same chunk, and forcing a GC/CC more frequently will only help if one of those happens to fall during this directory (and could easily change later if more tests are added).
Can we split up the directory?
Are there actually large chunks of graphics memory that require a GC to be deleted, rather than being deleted when the presentation is destroyed?  If that's the case, that sounds bad, and probably a regression.
Flags: needinfo?(philringnalda)
(In reply to David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) from comment #14)
> Are there actually large chunks of graphics memory that require a GC to be
> deleted, rather than being deleted when the presentation is destroyed?  If
> that's the case, that sounds bad, and probably a regression.

Nical, do you know? My understanding is that lots of gfx memory is just RefPtr. When do textures get cleaned up?
Flags: needinfo?(nical.bugzilla)
Textures are generally reference counted, and the ones that are directly associated with DOM elements are held alive by the things they represent (images, canvases, video elements, etc.). There's also a lot of texture memory that belongs to the layer tree which should go away when the presentation is destroyed (I think), except if they are held alive by a pool (tiling comes to mind, but it is not used on windows). We have a concept of memory pressure event, I don't remember if we use it at all on desktop, if so can we make sure it is used after every few test? I don't think it will make as much of a difference as running GC+CC but who knows...
Flags: needinfo?(nical.bugzilla)
Blocks: 1306907
Do we actually know what the problem is here? 

Have we run out of available memory (and if so, is it because of gfx)? Or have we just fragmented the heap to the point that the large allocations gfx tries to make fail?

Nick, is it possible to manually trigger an about:memory report (or any memory stats at all) when we fail an allocation during automated testing that goes to somewhere we can access? stdout is the easiest for tbpl log parsing, but I'm sure we could get ask to a file on disk if we asked the right people.

Ideally these would be synchronous so we can know about the state we're in right now, especially since it could change quite quickly under reftest load.
Flags: needinfo?(n.nethercote)
I did a try push with some diagnostics along with my patch for bug 1305326 which seems to make this more frequent.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b860233b71

When it fails:

[GFX1-]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 
Largest contiguous memory region: 5218304
Memory stats: Load: 52, Total physical: 3211976704, available: 1511645184, Total vmem: 2147352576, available: 449433600 available-extended: 0 

Looks like we're trying to allocate a ~3.2mb buffer and the largest contiguous memory chunk is ~5.2mb. Seems like it should have succeeded, but not really surprising that it didn't. Total available memory is fine.

Given that, I think just splitting the suite in half should work fine. There's no obvious huge peak of memory load, so we don't need to worry about splitting a specific directory.
Oh, and the hang happens when the failing allocation is one that reports an error back to javascript. The reftest harness doesn't catch them, so it just stops trying to do things.

If the failing allocation is during drawing then we just get a blank image.

It might be worth trying to profile the allocation and see if we're unnecessarily churning memory more than we need to.

Does anyone if we have tools for doing that?
Blocks: 1305326
> Nick, is it possible to manually trigger an about:memory report (or any
> memory stats at all) when we fail an allocation during automated testing
> that goes to somewhere we can access? stdout is the easiest for tbpl log
> parsing, but I'm sure we could get ask to a file on disk if we asked the
> right people.

You can use the methods in nsIMemoryInfoDumper/nsMemoryInfoDumper to save a memory report to disk. See nsThread::SaveMemoryReportNearOOM() (and the functions it calls) for an example of its use.
Flags: needinfo?(n.nethercote)
(In reply to Matt Woodrow (:mattwoodrow) from comment #17)
> stdout is the easiest for tbpl log parsing, but I'm sure we could get ask to a file on disk if we asked the right people.

In automation, the environment variable MOZ_UPLOAD_DIR is set to a directory you can store files to. There will be a link to the files saved to that directory in Treeherder.
Blocks: 1309753
Blocks: 1311878
Blocks: 1311879
Depends on: 1311937
Depends on: 1312612
Depends on: 1311896
No longer depends on: 1311896, 1311937, 1312612
The blank snapshot happened on Linux PGO too.  Bug 1312965, bug 1313785, bug 1313279.  Is it considered as the same as this?
(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)
> The blank snapshot happened on Linux PGO too.  Bug 1312965, bug 1313785, bug
> 1313279.  Is it considered as the same as this?

Maybe not.
I can not find gfx info log in these three Linux PGO test failures(see comment 1):
INFO - [GFX1]: Failed to create DrawTarget, Type: 5 Size: Size(800,1000)
See Also: → 1314838
Thanks, cjku.  Filed bug 1314838.
Blocks: 1315307
Blocks: 1318364
Blocks: 1319820
Depends on: 1302112
Blocks: 1302112
No longer depends on: 1302112
Can we just set duplicate for those bugs with same cause ? Or are we going to re-visit them one by one in the future ?
No, we can't, not unless someone fixes bug 1299274 first (which would then just have the effect of making this such a topfail that it would more quickly be identified as a "fix it, or lose right to run the tests."
Ah, I see. Thanks for your feedback. I'll keep watching intermittent bugs and set up dependency if they are in same cause.
Blocks: 1321537
It seems Win7 debug R1 will have high intermittent failure rate when reftests are added under layout/reftests/w3c-css/submitted. Both Bug 1311244 comment 43 and Bug 1316482 comment 1300355 are backout due to the symptom similar to this bug. We need help to fix or workaround the issue to prevent it from blocking feature developing and bug fixing.
I wonder if we could export an ability to restart the browser instance while running reftest? Something like a line "need-to-restart-browser" in reftest.list, and the browser could restart at this certain point. In this way, we could force a browser restart before/after running a huge set of tests. Maybe this could be an alternative if doing chunking is not an option at present.

We seem have a '--run-by-dir' option in mochitest (
http://searchfox.org/mozilla-central/rev/225ab0637ed51b8b3f9f4ee2f9c339a37a65b626/testing/mochitest/mochitest_options.py#231), not sure if we could implement this similar function in reftest...
There is recent Bug 1331371 hit the same fail, the only difference is the backend Type.

03:00:37     INFO - Assertion failure: [GFX1]: Failed to create DrawTarget, Type: 3 Size: Size(800,1000), at c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-firefox\dist\include\mozilla/gfx/Logging.h:515

I would try to approach on Bug 1331371 to see if it helps to this one.
See Also: → 1331371
Blocks: 1333287
Blocks: 1334451
Blocks: 1332156
Blocks: 1323926
Blocks: 1331464
Blocks: 1332632
Blocks: 1332158
Blocks: 1315767
Blocks: 1331207
Blocks: 1330800
Blocks: 1334390
Blocks: 1331840
Blocks: 1333755
Blocks: 1332157
Blocks: 1315859
Blocks: 1332038
Blocks: 1331644
Blocks: 1333643
Blocks: 1331641
Blocks: 1332846
Blocks: 1334738
Blocks: 1334739
Blocks: 1334740
Blocks: 1334741
(In reply to Vincent Liu[:vliu] from comment #30)
> There is recent Bug 1331371 hit the same fail, the only difference is the
> backend Type.
> 
> 03:00:37     INFO - Assertion failure: [GFX1]: Failed to create DrawTarget,
> Type: 3 Size: Size(800,1000), at
> c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-
> firefox\dist\include\mozilla/gfx/Logging.h:515
> 
> I would try to approach on Bug 1331371 to see if it helps to this one.

In recent weeks, I can't reproduce this issue even I tried to set up Win7 vm on local or looked into it on try. Some bugs like bug 1316482 and 1311244 disabled many reftests because of this. Based on this, I tried to enabled them and sent a try. Please see [1] for the try result. 

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d28066d93e0370b4b46fe8c3ee9e49812da58a0f

It seems that even I enabled them, the issue still not happens. From the above results, I will file a bug to enable those reftests in bug 1316482 and 1311244 and observe what we will see next.
(In reply to Vincent Liu[:vliu] from comment #32)
> (In reply to Vincent Liu[:vliu] from comment #30)
> > There is recent Bug 1331371 hit the same fail, the only difference is the
> > backend Type.
> > 
> > 03:00:37     INFO - Assertion failure: [GFX1]: Failed to create DrawTarget,
> > Type: 3 Size: Size(800,1000), at
> > c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-
> > firefox\dist\include\mozilla/gfx/Logging.h:515
> > 
> > I would try to approach on Bug 1331371 to see if it helps to this one.
> 
> In recent weeks, I can't reproduce this issue even I tried to set up Win7 vm
> on local or looked into it on try. Some bugs like bug 1316482 and 1311244
> disabled many reftests because of this. Based on this, I tried to enabled
> them and sent a try. Please see [1] for the try result. 
> 
> [1]:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d28066d93e0370b4b46fe8c3ee9e49812da58a0f
> 
> It seems that even I enabled them, the issue still not happens. From the
> above results, I will file a bug to enable those reftests in bug 1316482 and
> 1311244 and observe what we will see next.

From comment history from 20170127, I found the current status is it is more easier to reproduce it in opt build than in debug build. Since bug 1316482 and bug 1311244 only disable those reftests in debug build, maybe that's the reason why we saw it more easier in opt. Based on this, I will stop filing bug to enable those bugs in debug build until we have clear view.
Hi Phil,

(In reply to Matt Woodrow (:mattwoodrow) from comment #18)
> I did a try push with some diagnostics along with my patch for bug 1305326
> which seems to make this more frequent.
> 

> Given that, I think just splitting the suite in half should work fine.
> There's no obvious huge peak of memory load, so we don't need to worry about
> splitting a specific directory.

From tracking the comment history, I believe increasing the trunk splitting the suite worth to try to see the result. Do you or anyone or mdn link knows how to send patch to try to do this? I would like to do this to see if it can actually solve this issue. Really thanks.
Flags: needinfo?(philringnalda)
Windows, so jobs run by buildbot? My guess would be that you can't, but I wouldn't know, not something that's ever been anything to do with me. You might ask in #ateam during the US daytime.
Flags: needinfo?(philringnalda)
(In reply to Vincent Liu[:vliu] from comment #34)
> Hi Phil,
> 
> (In reply to Matt Woodrow (:mattwoodrow) from comment #18)
> > I did a try push with some diagnostics along with my patch for bug 1305326
> > which seems to make this more frequent.
> > 
> 
> > Given that, I think just splitting the suite in half should work fine.
> > There's no obvious huge peak of memory load, so we don't need to worry about
> > splitting a specific directory.
> 
> From tracking the comment history, I believe increasing the trunk splitting
> the suite worth to try to see the result. Do you or anyone or mdn link knows
> how to send patch to try to do this? I would like to do this to see if it
> can actually solve this issue. Really thanks.

Hi gbrown,

Is it possible I can do this on my own? Thanks.
Flags: needinfo?(gbrown)
You can't test buildbot-config changes on Try. I'd be willing to split the reftest jobs up on Ash and see what happens, though.
Yeah, sorry. Ryan's offer is your best bet.
Flags: needinfo?(gbrown)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #37)
> You can't test buildbot-config changes on Try. I'd be willing to split the
> reftest jobs up on Ash and see what happens, though.

I am not quite clear to know about *Ash* you mentioned. Is it a platform on try that the issue can reproduce? Because I never see this term on trychooser, maybe you can have input about this? Thanks
Flags: needinfo?(ryanvm)
Ash is the project branch we've been using for e10s-related testing. I sync it roughly daily with mozilla-central. You can see the results on Treeherder at https://treeherder.mozilla.org/#/jobs?repo=ash.
Flags: needinfo?(ryanvm)
 (In reply to Ryan VanderMeulen [:RyanVM] from comment #37)
> You can't test buildbot-config changes on Try. I'd be willing to split the
> reftest jobs up on Ash and see what happens, though.

From [1], I didn't see the way like increasing trunk to split the reftest jobs up on Ash. Have you started doing that? Or there was something I missed. Really thanks.

[1]: https://treeherder.mozilla.org/#/jobs?repo=ash
Flags: needinfo?(ryanvm)
The attached buildbot-configs patch, once landed and put into production, will split up the Windows reftests into 2 chunks. I will then do a new push to Ash from mozilla-central so we can see retrigger the runs and see how they do.

Note that Ash has a bit of a different configuration from mozilla-central in that it currently has 4 content processes enabled by default on it. That said, I can see these same failures happening there as well, so I don't think it'll have any meaningful impact on the results.

Jordan, the diff is a bit bigger than I'd hoped because I had to fight with all the logic related to Win7 AWS jobs and ended up moving and refactoring things a bit to make life easier. I've verified that the builder diff is sane, however. It also has the added benefit of turning off a bunch of non-e10s jobs that had been inadvertently enabled on Ash during all of that migration, so that's a nice plus :).
Flags: needinfo?(ryanvm)
Attachment #8835763 - Flags: review?(jlund)
Comment on attachment 8835763 [details] [diff] [review]
refactor Ash test scheduling and split Windows reftests into two chunks

Review of attachment 8835763 [details] [diff] [review]:
-----------------------------------------------------------------

overall looks good. I don't know much about the win7_vm_gfx and win7_vm platforms so I'll trust in you that our infra is setup to handle this. I only have one nit with regards to some of the condition statements in-line below. resetting r? for now until new patch is up or you push back :)

::: mozilla-tests/config.py
@@ +3009,5 @@
> +        if slave_platform not in BRANCHES['ash']['platforms'][platform]:
> +            continue
> +
> +        # Linux jobs are scheduled via in-tree Taskcluster configs, so no need for anything here.
> +        if slave_platform in BRANCHES['ash']['platforms'][platform] and slave_platform in ('yosemite_r7'):

to help with some of the complexity, could we remove every `if slave_platform not in BRANCHES['ash']['platforms'][platform]` from every condition statement? I believe that condition is checked above at the beginning of the loop.

Also, since every single one of these platform blocks are for individual platform, could we say something like: `if slave_platform == "yosemite_r7":` rather than check for presence in a list?
Attachment #8835763 - Flags: review?(jlund) → review-
Yeah, some of that logic was a carryover from another time when I was setting things for multiple platforms in those blocks. This patch makes the changes you requested and still produces a sane builder diff.

And yes, I've made sure to split the win7_ix/win7_vm/win7_vm_gfx jobs up in the same way they're scheduled in production as well.
Attachment #8835763 - Attachment is obsolete: true
Attachment #8835804 - Flags: review?(jlund)
Comment on attachment 8835804 [details] [diff] [review]
refactor Ash test scheduling and split Windows reftests into two chunks

Review of attachment 8835804 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm :)
Attachment #8835804 - Flags: review?(jlund) → review+
(In reply to Ryan VanderMeulen [:RyanVM] from comment #42)
> Created attachment 8835763 [details] [diff] [review]
> refactor Ash test scheduling and split Windows reftests into two chunks
> 

Thanks for your help to figure it out. One question, As I know currently we have two trucks on windows 7 debug build and one for opt build. does your patch work on debug or opt? Until now we have chance to see this fail on both debug and opt, is it possible we can observe it on both? Really thanks.
Flags: needinfo?(ryanvm)
From looked into the recent fails, it all heppens in sumitted/variables/*.html. To see the fails doesn't relative to test flaw itself but possibly the test environment, I tried send a try to remove all reftests in layout/ except sumitted/variables/*.html and reftests in bug 1316482/bug 1311244. It turns out that it got all green after so many jobs retriggered. Please see [1] for reference.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a009919c5d625d2ab6e5c19ae5674d62d27aa5dc
(In reply to Vincent Liu[:vliu] from comment #47)
> Thanks for your help to figure it out. One question, As I know currently we
> have two trucks on windows 7 debug build and one for opt build. does your
> patch work on debug or opt? Until now we have chance to see this fail on
> both debug and opt, is it possible we can observe it on both? Really thanks.

The Ash change went into production yesterday. Unfortunately, it doesn't appear to have helped. Take note of the red Ru1 jobs at the link below:
https://treeherder.mozilla.org/#/jobs?repo=ash&revision=c27dd6952505c5cb2733c0bd077f370f676c3536&filter-searchStr=win%20ref

Looks like adding more chunks isn't going to be a silver bullet here.
Flags: needinfo?(ryanvm)
Blocks: 1338749
Blocks: 1338379
Blocks: 1338751
Blocks: 1338912
Blocks: 1336341
This is basically permafailing on Aurora/Beta (and soon ESR52) at the moment where all Windows opt builds are PGO. Is this anybody's active priority at the moment or do we need to resort to hiding Windows PGO reftests across all trees until someone has more time to investigate? Leaving them in a nearly-permafailing state isn't doing any useful testing for us.
Flags: needinfo?(bugs)
Do we have a regression range for when this started permafailing?  (Speaking of which, do we have a regression range for when it started in the first place?)
Another question is whether this is really a continuation of bug 1167155 and it just started happening in this form after https://hg.mozilla.org/mozilla-central/rev/6f6ae4f2c283 ?
My recollection is that this was tied to the Skia switchover.

Per the email I'm going to be sending to release-drivers in a bit, I think this needs to block the 52 release.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #53)
> Per the email I'm going to be sending to release-drivers in a bit, I think
> this needs to block the 52 release.

hmm, looking at Treeherder it's not quite as permafail as I'd thought it was, though it's still pretty far from great. We seem to hit this failure roughly once per push across one of the Win7 reftest suites.
Another possibly-relevant changeset around when this started is https://hg.mozilla.org/mozilla-central/rev/9bbc7d520ccb from bug 1298345.
(In reply to David Baron :dbaron: ⌚️UTC-8 from comment #52)
> Another question is whether this is really a continuation of bug 1167155 and
> it just started happening in this form after
> https://hg.mozilla.org/mozilla-central/rev/6f6ae4f2c283 ?

I think it is, but it's hard to tell as the logs from that bug have expired.
I had a play with a few ideas to try reduce the memory usage:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c58b2496ad0683d7b8a275a477f870901a93d17c

Didn't do the right magic to get PGO builds unfortunately.

Ryan, how often do we expect it to fail on non-PGO builds?
Flags: needinfo?(ryanvm)
You'd have to look at Treeherder. Also, for PGO on Try:
https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build
Flags: needinfo?(ryanvm)
There is a try[1] I sent last Thursday also shows OOM happens to case the crash. 

 1. before crash , Largest contiguous memory region remains about : 5238784 (~5.2M)
 2. the log message I added "SkMallocPixelRef::NewUsing size=3200000, addr=00000000" shows that alloc() returns null addr

[1]: https://archive.mozilla.org/pub/firefox/try-builds/vliu@mozilla.com-db7f562916f4bc86770d5ac158c7554720b939d3/try-win32/try_win7_ix_test-reftest-no-accel-bm119-tests1-windows-build1392.txt.gz

Not only this meta, bug 1315037 also has OOM issue. Based on this, is it possible increasing installed memory can improve this?
Flags: needinfo?(ryanvm)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae62e77ea73229e2564bfbea927de783cf7d7f8f

PGO mozconfig on a mozilla-beta tree. Can't reproduce the failure any more.
(In reply to Vincent Liu[:vliu] from comment #60)
 
> Not only this meta, bug 1315037 also has OOM issue. Based on this, is it
> possible increasing installed memory can improve this?

It's more likely that we're running out of address space (3GB) rather than physical memory, since I assume we could use swap if it were the latter.
(In reply to Matt Woodrow (:mattwoodrow) from comment #62)
> (In reply to Vincent Liu[:vliu] from comment #60)
>  
> > Not only this meta, bug 1315037 also has OOM issue. Based on this, is it
> > possible increasing installed memory can improve this?
> 
> It's more likely that we're running out of address space (3GB) rather than
> physical memory, since I assume we could use swap if it were the latter.

The memory layout on 32bit windows should have 2GB of user space usage. I think that's why I always see addr is close to the upper bound of user space. Since available memory is larger enough, increasing memory may not helpful for it. 


23:43:17     INFO -  SkMallocPixelRef::NewUsing size=3200000, addr=7F800000
......
23:43:17     INFO -  SkMallocPixelRef::NewUsing size=3200000, addr=7EC00000
23:43:17     INFO - REFTEST TEST-LOAD | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/svg/filters/svg-filter-chains/different-StrokePaint-filter-regions-ref.svg | 11992 / 15159 (79%)
23:43:17     INFO -  SkMallocPixelRef::NewUsing size=3200000, addr=00000000
Flags: needinfo?(ryanvm)
(In reply to Matt Woodrow (:mattwoodrow) from comment #61)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=ae62e77ea73229e2564bfbea927de783cf7d7f8f
> 
> PGO mozconfig on a mozilla-beta tree. Can't reproduce the failure any more.

Please tell me you're getting this reviewed somewhere? :)
Flags: needinfo?(matt.woodrow)
Blocks: 1341118
Blocks: 1336845
Blocks: 1341662
Blocks: 1341630
Blocks: 1336565
Blocks: 1341124
Blocks: 1341661
Blocks: 1341632
Blocks: 1341140
Blocks: 1341141
Blocks: 1336883
Blocks: 1338849
Blocks: 1341629
Blocks: 1336347
Blocks: 1339297
Blocks: 1336773
Blocks: 1339296
Blocks: 1332420
Blocks: 1341659
Blocks: 1338041
Blocks: 1341297
Blocks: 1341267
Blocks: 1340492
Blocks: 1340345
Blocks: 1339838
Blocks: 1338712
Blocks: 1337573
Blocks: 1337095
Blocks: 1330518
Blocks: 1326070
Blocks: 1321538
Blocks: 1341965
Assignee: nobody → matt.woodrow
Flags: needinfo?(matt.woodrow)
Attachment #8840669 - Flags: review?(dbaron)
Attachment #8835804 - Attachment is obsolete: true
The canvas cache keeps around copies of reftest results for any urls that are repeated multiple times during the run.

This saves time (don't need to repaint them) at the cost of extra memory usage.

Apparently we really can't afford this extra memory on win7, so this disables it there.

It'll make the reftest run take longer to complete, but shouldn't be as bad as splitting into into multiple runs.
Attachment #8840671 - Flags: review?(jmaher)
Comment on attachment 8840669 [details] [diff] [review]
Part 1: Increase frequency of GC/CC during reftests

>try: -b do -p win32 -u reftest[Windows 7],crashtest[Windows 7] -t none[Windows 7]

I'd suggest a better commit message is the patch description you used in bugzilla, "Increase frequency of GC/CC during reftests".  :-)
Attachment #8840669 - Flags: review?(dbaron) → review+
Comment on attachment 8840671 [details] [diff] [review]
Part 3: Disable reftest canvas cache for win 7.

Review of attachment 8840671 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for the patch and explanation.  As win7 is the only 32 bit we test on, are there no concerns on win8 x64?
Attachment #8840671 - Flags: review?(jmaher) → review+
Comment on attachment 8840670 [details] [diff] [review]
Part 2: Fix issues in reftest harness with canvas cache disabled

Could you explain what this is fixing?  The AddURIUseCount and UpdateCanvasCache calls seem symmetric both before and after the patch.
Flags: needinfo?(matt.woodrow)
I guess it means that will will call ReleaseCanvas() and thus recycle canvases that were drawn when we had prefs set -- but why not do that by just calling ReleaseCanvas directly as an alternative to UpdateCanvasCache?

(When the canvas cache is off, it seems better to maintain the use counts (which are for the cache) the same as we do when the cache is off, or not to bother with them at all -- this seems to maintain them differently, which seems like unnecessary complexity to me.)
Attachment #8840670 - Attachment is obsolete: true
Attachment #8840670 - Flags: review?(dbaron)
Flags: needinfo?(matt.woodrow)
Attachment #8841292 - Flags: review?(dbaron)
Attachment #8841292 - Flags: review?(dbaron) → review+
Can we please get this landed?
Flags: needinfo?(matt.woodrow)
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a519934c735b
Part 1: Increase frequency of GC/CC during reftests. r=dbaron
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bd400108a21
Part 2: Don't track use counts in the reftest harness when the cache is disabled. r=dbaron
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3437d0e66b7
Part 3: Disable reftest canvas cache for win 7. r=jmaher
Since the root cause here is memory fragmentation, can we try to fix that? I don't have experience with memory fragmentation profiling tools - njn, can massif or DMD be used to provide this info? In particular I'm wondering if putting certain allocations into an arena or something along those lines would help reduce the fragmentation. That would help both these reftests and Firefox performance in general. I recall recently that Bas did some profiling and discovered that not using arenas in some places was impacting our performance due to CPU cache misses.
Flags: needinfo?(n.nethercote)
Hopefully it gets rid of the failures entirely on older branches, it seemed to on my try runs. It's believable that memory usage is worse on Nightly than it is there, but it's also believable that I just didn't do enough reruns to see the failures.

I like Kats' suggestions, applying some memory profiling tools is probably the best way to try figure out what's causing the remaining fragmentation and then we can try figure out a solution.
Flags: needinfo?(matt.woodrow)
I don't know any tools aimed at diagnosing memory fragmentation, sorry. At a minimum you need to know where in memory all the blocks are, which is something memory profilers don't normally show. Beyond that you could take into account lifetimes of all the blocks. It's a hard problem in general.
Flags: needinfo?(n.nethercote)
After bug 1344991 landed, we'll be able to restart Firefox after an crash and continue with the rest of the test. Maybe we can modify it a bit and restart from the intermittent test? But then we need to be able to identify the cause of that kind of failure reliably, and we need to think about how to report that on treeherder.
That might make the consequences of this slightly less awful, but ultimately it won't do anything to reduce the overall failure rate of the test suite. Seems to me that something akin to run-by-dir mode for reftests (closing and restarting the browser between directories like we do for mochitests) is more likely to make a meaningful difference here if we're unable to fix this at the platform level.
++ for a run-by-dir mode, makes chunking more consistent
Blocks: 1349811
Blocks: 1349827
The only reason I haven't hidden them is because they run on a precious and limited sort of AWS instance, and I know from experience that hiding them just means we would leave them running hidden and ignored and consuming limited resources for months longer. But I would completely support either shutting them off, or moving them back onto hardware to see whether we've also broken our ability to run them there.
I don't see any green jobs in the last 4 weeks on m-c, but inbound and autoland have plenty of green and much fewer failures.

We do have limited instances of the gfx machines, are we getting value from these jobs running?  Can we find regressions, or do we just assume they always fail and ignore the fine details?
Looking through recent failures, it looks like this only happens on non-e10s runs. If that's not true, would appreciate seeing a counter-example.

Similarly, I see lots of win7-opt and win7-pgo examples, but no win7-debug. Is that consistently true?

If those observations are correct, do we feel that is consistent with our memory fragmentation theory?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #83)
> Seems to me that something akin to run-by-dir mode for reftests (closing and
> restarting the browser between directories like we do for mochitests) is
> more likely to make a meaningful difference here if we're unable to fix this
> at the platform level.

Examples like 

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32/1491151822/mozilla-central_win7_vm_gfx_test-reftest-bm139-tests1-windows-build309.txt.gz

make me wonder if that would be effective. The job was only 12% complete when we hit this failure.

On the other hand, the majority (based on a very small sample size!) of such failures seem to happen in the last 25% of the job...so maybe.


win7 opt/pgo reftests run in a single chunk. Has anyone even tried simply running in multiple chunks?
I think 8 chunks would be a good place to start :)  Unfortunately this is still 100% via buildbot, this makes it harder to experiment with on try server.
So I wasn't aware that things had gotten worse -- that information doesn't appear to have previously been in this bug.  We really need a better way to get information about failures that happen across multiple tests into a single bug so that we can understand their severity correctly.  This bug doesn't have any comments from OrangeFactor Robot, and neither do any of the other bugs feeding into it.

Given that it has gotten worse, it would be useful to get a regression range on when it got worse.  But comment 87 is very disturbing -- why do we see different statistics on mozilla-central vs. the integration repositories that lead into it.  We could also use a regression range on when it started.  Getting any regression ranges here probably requires bisection on try given that the starring is probably inaccurate.  But does comment 87 suggest that maybe we won't get correct data on try?

If we do need to disable the reftest suite on Windows, I think we can't land changes to layout.  So if you want to disable the suite, I'd ask that you also add a repository hook on all trees where it's disabled (and integration branches leading to those trees) that disallows pushes with changes in layout/.
Flags: needinfo?(dbaron)
Jet, this probably needs a new assignee.
Blocks: 1348179
Blocks: 1351922
Blocks: 1352904
Blocks: 1350647
Blocks: 1342691
Blocks: 1350658
Blocks: 1344997
Blocks: 1352630
Blocks: 1351144
Blocks: 1352727
Blocks: 1346106
Blocks: 1348091
Blocks: 1351549
Blocks: 1345317
Blocks: 1343160
Blocks: 1348312
Blocks: 1346088
Blocks: 1350750
Blocks: 1346097
Blocks: 1349090
Blocks: 1350710
Blocks: 1352728
Blocks: 1352721
Blocks: 1352725
Blocks: 1351923
Blocks: 1352905
Blocks: 1348094
Blocks: 1350763
Blocks: 1331371
Blocks: 1331519
Blocks: 1330865
Blocks: 1332098
Blocks: 1306514
Blocks: 1307641
Blocks: 1341439
Blocks: 1334896
Blocks: 1338317
Blocks: 1342077
Blocks: 1344228
Blocks: 1346087
Blocks: 1341295
Blocks: 1351956
Blocks: 1352906
Blocks: 1336348
Blocks: 1351532
Blocks: 1338174
Blocks: 1351530
Blocks: 1340042
Blocks: 1351519
Blocks: 1339453
Blocks: 1338839
Blocks: 1342099
Blocks: 1342830
Blocks: 1346176
Blocks: 1333290
Blocks: 1341644
Blocks: 1341664
Blocks: 1342926
Blocks: 1342100
Blocks: 1350498
Blocks: 1352733
Blocks: 1343423
Blocks: 1340427
Blocks: 1350757
Blocks: 1351430
Blocks: 1338904
Blocks: 1336774
Blocks: 1343868
Blocks: 1350505
Blocks: 1352071
Blocks: 1347781
Blocks: 1341663
Blocks: 1351168
Blocks: 1305372
Blocks: 1311341
(In reply to Geoff Brown [:gbrown] from comment #89)
> win7 opt/pgo reftests run in a single chunk. Has anyone even tried simply
> running in multiple chunks?

We did in comment 49. Today we also tried increasing the number of chunks to 4 on Ash. It does in fact appear to help quite a bit, though we still ended up with a couple 800000 pixel failures on the w3c-css directory. It would also add roughly 6-7min *per chunk* of extra overhead to each push as well, so the costs from a resource utilization standpoint would be non-trivial.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #91)
> If we do need to disable the reftest suite on Windows, I think we can't land
> changes to layout.  So if you want to disable the suite, I'd ask that you
> also add a repository hook on all trees where it's disabled (and integration
> branches leading to those trees) that disallows pushes with changes in
> layout/.

Let's please not disable the reftest suite.

(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #92)
> Jet, this probably needs a new assignee.

Bas: can you help here? We're getting 32-bit Windows surface allocation failures for the canvas2d used for reftest screenshots (reports show SKIA & DIRECT2D1_1 failures.) We've tried various hacks to split up the tests and force GC between runs but it still happens.
Flags: needinfo?(bugs) → needinfo?(bas)
Blocks: 1353265
(In reply to Jet Villegas (:jet) from comment #94)
> (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #91)
> > If we do need to disable the reftest suite on Windows, I think we can't land
> > changes to layout.  So if you want to disable the suite, I'd ask that you
> > also add a repository hook on all trees where it's disabled (and integration
> > branches leading to those trees) that disallows pushes with changes in
> > layout/.
> 
> Let's please not disable the reftest suite.
> 
> (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #92)
> > Jet, this probably needs a new assignee.
> 
> Bas: can you help here? We're getting 32-bit Windows surface allocation
> failures for the canvas2d used for reftest screenshots (reports show SKIA &
> DIRECT2D1_1 failures.) We've tried various hacks to split up the tests and
> force GC between runs but it still happens.

If these failures happen both for Skia and D2D1, and especially if they manifest in 32-bit, it seems likely these are just OOMs? At least when D2D fails we should be reporting the failure codes for the surface creation in the gfx error log.
Flags: needinfo?(bas)
Blocks: 1353761
Whiteboard: gfx-noted → gfx-noted, [Memshrink]
MemShrink review is OK with a periodic browser restart. I'll look into getting that wired up.
Flags: needinfo?(bugs)
Whiteboard: gfx-noted, [Memshrink] → gfx-noted
running the tests one directory at a time is a logical split up, I filed bug 1353461 for that, might not get done in the next week or so though.
Blocks: 1354169
Blocks: 1354285
Blocks: 1354286
Blocks: 1354397
Blocks: 1354398
Blocks: 1354755
Blocks: 1350114
Blocks: 1354825
Blocks: 1355276
Blocks: 1356080
Blocks: 1356081
Blocks: 1356082
Blocks: 1356108
Blocks: 1331344
Blocks: 1357609
Blocks: 1357610
Blocks: 1358006
Blocks: 1359687
Blocks: 1359688
Blocks: 1360086
Blocks: 1361606
Blocks: 1364326
Blocks: 1364449
Blocks: 1365453
FWIW, I've been seeing these fail on win7 opt VM without a GFX failure message, but with:

JavaScript error: chrome://reftest/content/reftest.jsm, line 1748: NS_ERROR_FAILURE:

The line number in reftest.jsm seems a bit bogus, though.

For example:

https://treeherder.mozilla.org/logviewer.html#?job_id=102242688&repo=mozilla-central&lineNumber=61669
If you look at the reftest.jsm file that ends up in the objdir

https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/_tests/reftest/reftest/chrome/reftest/content/reftest.jsm#1748

that line number points to

  var image1 = gCanvas1.toDataURL();

which seems plausible.
Blocks: 1369962
Blocks: 1370299
Blocks: 1370761
Blocks: 1371413
Blocks: 1371416
Blocks: 1371507
Blocks: 1371128
Blocks: 1371460
Blocks: 1371134
Blocks: 1371415
Blocks: 1371419
Blocks: 1371423
Blocks: 1371517
Blocks: 1371418
Blocks: 1371302
Blocks: 1371182
Blocks: 1371165
Blocks: 1371137
Blocks: 1371135
Blocks: 1377686
Blocks: 1377685
Blocks: 1377624
Blocks: 1377378
Blocks: 1377236
Blocks: 1377107
Blocks: 1377035
Blocks: 1376377
Blocks: 1376374
Blocks: 1376373
Blocks: 1376366
Blocks: 1376365
Blocks: 1374907
Blocks: 1374905
Blocks: 1374529
Blocks: 1374186
Blocks: 1374143
Blocks: 1374130
Blocks: 1369593
Blocks: 1368604
Blocks: 1368586
Blocks: 1367883
Blocks: 1366561
Blocks: 1366560
Blocks: 1366443
Blocks: 1366440
Blocks: 1366433
Blocks: 1366432
Blocks: 1366431
Blocks: 1366430
Blocks: 1377631
Blocks: 1377625
Blocks: 1377623
Blocks: 1377622
Blocks: 1377399
Blocks: 1377235
Blocks: 1377233
Blocks: 1377231
Blocks: 1377038
Blocks: 1377033
Blocks: 1376918
Blocks: 1376661
Blocks: 1376660
Blocks: 1376389
Blocks: 1376387
Blocks: 1376378
Blocks: 1376375
Blocks: 1376362
Blocks: 1375860
Blocks: 1375516
Blocks: 1375065
Blocks: 1375062
Blocks: 1375047
Blocks: 1374903
Blocks: 1374793
Blocks: 1374517
Blocks: 1374418