Closed Bug 1445922 Opened 6 years ago Closed 6 years ago

Intermittent z:\build\build\src\third_party\aom\aom_dsp\simd\v64_intrinsics_c.h(754) : fatal error C1002: compiler is out of heap space in pass 2

Categories

(Core :: Audio/Video: Playback, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: away)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:product])

Attachments

(1 file)

From bug 1312238 comment 38

> 12:22:13     INFO - 
> z:\build\build\src\third_party\aom\aom_dsp\simd\v64_intrinsics_c.h(719) :
> fatal error C1002: compiler is out of heap space in pass 2
> 
> Since this is in AOM code, I suspect it is due to the large function sizes
> seen in bug 1412889. It is fixed upstream but our efforts to update (bug
> 1445683) have hit roadblocks.
> 
> If this is blocking you, we can probably just disable PGO in the affected
> code (AOM won't be hit in a profile anyway).
Oops, I mean bug 1412238 comment 38
Depends on: 1445683
There are 30 failures in the past 7 days, all occurrences happened on windows2012-32 pgo.
Recent log failure: 
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=171998012&lineNumber=37516

Relevant part of the log: 
02:04:15     INFO -  z:\build\build\src\third_party\aom\aom_dsp\simd\v64_intrinsics_c.h(754) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  z:\build\build\src\third_party\aom\aom_dsp\simd\v256_intrinsics_c.h(101) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  z:\build\build\src\third_party\aom\aom_dsp\simd\v64_intrinsics_c.h(719) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  z:\build\build\src\third_party\aom\av1\common\cdef_block_simd.h(252) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  z:\build\build\src\third_party\aom\aom_dsp\simd\v64_intrinsics_c.h(271) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  LINK : fatal error LNK1257: code generation failed
02:04:15     INFO -  z:\build\build\src\third_party\aom\aom_dsp\simd\v128_intrinsics_c.h(83) : fatal error C1002: compiler is out of heap space in pass 2
02:04:15     INFO -  z:/build/build/src/config/rules.mk:679: recipe for target 'xul.dll' failed
02:04:15     INFO -  mozmake.EXE[5]: *** [xul.dll] Error 1257
02:04:15     INFO -  mozmake.EXE[5]: Leaving directory 'z:/build/build/src/obj-firefox/toolkit/library'
02:04:15     INFO -  z:/build/build/src/config/recurse.mk:73: recipe for target 'toolkit/library/target' failed
02:04:15     INFO -  mozmake.EXE[4]: *** [toolkit/library/target] Error 2
02:04:15     INFO -  z:/build/build/src/config/recurse.mk:32: recipe for target 'compile' failed
02:04:15     INFO -  mozmake.EXE[3]: *** [compile] Error 2
02:04:15     INFO -  z:/build/build/src/config/rules.mk:418: recipe for target 'default' failed
02:04:15     INFO -  mozmake.EXE[2]: *** [default] Error 2
02:04:15     INFO -  Makefile:237: recipe for target 'profiledbuild' failed
02:04:15     INFO -  mozmake.EXE[1]: *** [profiledbuild] Error 2
02:04:15     INFO -  client.mk:168: recipe for target 'build' failed
02:04:15     INFO -  mozmake.EXE: *** [build] Error 2
02:04:15     INFO -  125 compiler warnings present.
02:04:15    ERROR - Return code: 2
02:04:15  WARNING - setting return code to 2
02:04:15    FATAL - 'mach build' did not run successfully. Please check log for errors.
02:04:15    FATAL - Running post_fatal callback...
02:04:15    FATAL - Exiting -1
02:04:15     INFO - [mozharness: 2018-04-05 02:04:15.668000Z] Finished build step (failed)
02:04:15     INFO - Running post-run listener: _summarize
02:04:15     INFO - [mozharness: 2018-04-05 02:04:15.668000Z] FxDesktopBuild summary:
02:04:15     INFO - Running post-run listener: copy_logs_to_upload_dir
02:04:15     INFO - Copying logs to upload dir...
02:04:15     INFO - mkdir: z:\build\build\upload\logs
[taskcluster:error]    Exit Code: 4294967295
[taskcluster:error]    User Time: 0s
[taskcluster:error]  Kernel Time: 15.625ms
[taskcluster:error]    Wall Time: 1h4m35.327078s
[taskcluster:error]       Result: FAILED
[taskcluster 2018-04-05T02:04:15.883Z] === Task Finished ===
[taskcluster 2018-04-05T02:04:15.883Z] Task Duration: 1h11m23.565004s
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1522887787\\public\\build'", reason "file-missing-on-worker" and expiry 2019-04-05T00:51:41.733Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1522887787\public\build'
[taskcluster 2018-04-05T02:04:16.735Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-04-05T00:51:41.733Z
[taskcluster 2018-04-05T02:04:18.260Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-04-05T00:51:41.733Z
[taskcluster 2018-04-05T02:04:18.977Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/GnRhnrFCSQ6NrgY2l5eB8w/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-04-05T00:51:41.733Z
[taskcluster:error] Task not successful due to following exception(s):
[taskcluster:error] Exception 1)
[taskcluster:error] exit status 4294967295
[taskcluster:error] Exception 2)
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1522887787\public\build'
[taskcluster:error] 
:drno Can you please take a look here?
Flags: needinfo?(drno)
Whiteboard: [stockwell needswork]
(In reply to David Major [:dmajor] from comment #4)
> Alternatively we could cherry pick
> https://aomedia-review.googlesource.com/c/aom/+/39401
> 
> and https://chromium-review.googlesource.com/c/webm/libvpx/+/841103 for good measure

Drno, what do you think?
(In reply to David Major [:dmajor] from comment #8)
> (In reply to David Major [:dmajor] from comment #4)
> > Alternatively we could cherry pick
> > https://aomedia-review.googlesource.com/c/aom/+/39401
> > 
> > and https://chromium-review.googlesource.com/c/webm/libvpx/+/841103 for good measure
> 
> Drno, what do you think?

So the alternatives are:
- either disable PGO on win32 for this code
- wait for the libaom update to land in bug 1445683
- or cherry pick build fixes from upstream

Since bug 1445683 might still be a little bit out I guess either disable PGO or cherry pick. I don't have a preference for either. David feel free to chose either one.
Flags: needinfo?(drno)
I tried cherry-picking those two fixes and my try push still hit this intermittent.

Code outside of aom is failing too: 
15:24:55     INFO -  z:\build\build\src\js\src\vm\interpreter.cpp(4339) : fatal error C1002: compiler is out of heap space in pass 2

I'm starting to think this is less about particular functions and more about xul.dll simply growing larger by the day.

Can we add more memory to the win32 pgo builders?
Flags: needinfo?(catlee)
See Also: → 1453061
(In reply to David Major [:dmajor] from comment #11)
> Can we add more memory to the win32 pgo builders?

If what I looked up was up-to-date, we're building Windows on c4.4xlarge AWS instances, so the next step up is c4.8xlarge, at slightly more than twice the price because screw you. I don't know how much we're spending on Windows builds, but every number I've ever heard about our AWS spend has turned another chunk of my hair white, so I doubt that would sound like a good investment.

OTOH, because my sheriff colleagues are madmen and madwomen, very often the response to hitting this intermittent is to retrigger 5 builds, thus triggering 5 sets of tests, so it might be a close calculation to determine which would be cheaper, based on how frequently this hits, how frequently we do that, how much we could reduce the frequency of over-retriggering, and how much it would cost us in merged-around bustage to say "please stop retriggering PGO builds more than once" and wind up getting fewer retriggers of permaorange tests as a result.

Or, you know, we could just disable PGO for code that won't wind up being optimized anyway, and wind up spending *less* and failing less.
As a little measure of frequency, of the last 12 Win32 PGO builds to finish on mozilla-inbound, 11 failed this way in AOM code, and 1 hit an infra failure.
Oh, nevermind that frequency measure, that's just because

(In reply to David Major [:dmajor] from comment #11)
> I tried cherry-picking those two fixes and my try push still hit this
> intermittent.

Actually, they make this very nearly permanent.
(In reply to Phil Ringnalda (:philor) from comment #13)
Ok, ok -- I wasn't aware of the configuration and pricing situation.
Flags: needinfo?(catlee)
Assignee: nobody → dmajor
Attachment #8967090 - Flags: review?(core-build-config-reviews)
Attachment #8967090 - Flags: review?(core-build-config-reviews) → review+
Keywords: checkin-needed
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c4e17dd68065
Disable PGO for Win32 libaom due to compiler OOMs. r=froydnj
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c4e17dd68065
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
This improved build times on Windows 7.

== Change summary for alert #12663 (as of Wed, 11 Apr 2018 21:37:14 GMT) ==

Improvements:

 11%  build times windows2012-32 pgo taskcluster-c4.4xlarge     4,682.35 -> 4,173.92

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=12663
See Also: → 1456500
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: