Closed
Bug 1047621
Opened 10 years ago
Closed 10 years ago
There hasn't been a single successful Windows opt/nightly build off a b-2008-sm slave since the uplift
Categories
(Release Engineering :: General, defect)
Tracking
(firefox32 fixed, firefox33 fixed, firefox34 fixed)
RESOLVED
FIXED
People
(Reporter: RyanVM, Assigned: mshal)
References
Details
Attachments
(2 files, 5 obsolete files)
5.49 KB,
patch
|
gps
:
review+
|
Details | Diff | Splinter Review |
5.85 KB,
patch
|
gps
:
review+
ted
:
feedback+
|
Details | Diff | Splinter Review |
Every attempt to build a Windows opt/nightly build on a b-2008-sm slave since the uplift has timed out like the log below:
https://tbpl.mozilla.org/php/getParsedLog.php?id=44531250&tree=Mozilla-Aurora
Beta 32 doesn't appear to have this issue. We don't see this on trunk PGO builds probably because unified builds consume significantly fewer resources and run so much faster.
Can we only use b-2008-ix slaves for non-debug Windows builds on Aurora?
Reporter | ||
Comment 1•10 years ago
|
||
For a quick view of things since the uplift:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=WINNT%205.2%20mozilla-aurora%20%28build|nightly%29&fromchange=a104ddcd4cdb&tochange=tip
Assignee | ||
Comment 2•10 years ago
|
||
Callek suspected that bug 1045133 might be in play, but it looks like xul is only linked twice (not a third time like in that bug), so I don't think that is currently the issue. It might still be related, though.
I tried to reproduce this by logging into b-2008-sm-0046 and running the final xul pgo link manually, but I hit some weird linker errors (it couldn't find some symbols that should be present in gkmedias). I think that may have been from some files being in a funky state when buildbot killed the previous link job, though.
I did kick off a new full build manually at the commandline, and this time it linked correctly. However, the final pgo link took approximately 3h:06m, which is just a few minutes above the 3-hour no-output buildbot timeout that we hit. So, I think we just added enough new stuff to be over the limit.
However, there is supposed to be a keep-alive message spit out by link.py every 30 minutes, which I don't actually see displayed while it's running. Maybe if we fix that, that'll buy us enough time to get a build? (At least up to the 5.5 full runtime timeout).
I still don't understand why the sm machine takes over 3 hours to do the pgo link, whereas on an ix machine it appears to take 1.5 - 2hrs. Are they really that different? It looks like they are both 2.4GHz, but different processor families (E21265L vs X3430).
Assignee | ||
Comment 3•10 years ago
|
||
> However, there is supposed to be a keep-alive message spit out by link.py
> every 30 minutes, which I don't actually see displayed while it's running.
> Maybe if we fix that, that'll buy us enough time to get a build? (At least
> up to the 5.5 full runtime timeout).
According to ehsan on IRC, the keep-alive message (from bug 915735) never actually worked in a real build, only when he invoked link.py manually. It turns out that the "Still linking, 30 minutes passed" messages from link.py are getting buffered by expandlibs_exec.py (at least since bug 837665 when expandlibs_exec was switched from subprocess.call() to subprocess.Popen()/communicate()). We should be able to fix this by printing the linker's output as it comes in, which would allow us up to the 5.5 hour max runtime.
Assignee | ||
Comment 4•10 years ago
|
||
Does something like this seem reasonable? Or is there a better way to get unbuffered output from the subprocess? It looks like the .communicate() is still needed to get the return code at least, though I'm not sure if we still need the stdout from that function.
Also as an FYI, since the failure code is determined after the output is displayed, if the linker command fails, the display might be a little misleading. Eg:
Output line 1
Output line 2
Output line 3
Output line 4
Output line 5
Executing: mybuild.sh /home/mshal/mozilla-central-git/obj-x86_64-unknown-linux-gnu/toolkit/library/tmpcAzdov.list
/home/mshal/mozilla-central-git/obj-x86_64-unknown-linux-gnu/toolkit/library/tmpcAzdov.list:
INPUT("../components/perf/PerfMeasurement.o")
(The output lines from the script come before the "Executing: " line, which could be a little confusing. Maybe it could just be re-worded.)
Attachment #8467898 -
Flags: feedback?(mh+mozilla)
Attachment #8467898 -
Flags: feedback?(gps)
Comment 5•10 years ago
|
||
Comment on attachment 8467898 [details] [diff] [review]
expandlibs_stdout.patch
Review of attachment 8467898 [details] [diff] [review]:
-----------------------------------------------------------------
This approach works... on Linux.
Windows has this awful pipe buffering behavior where output isn't emitted until the buffer fills. Even if you are doing proper POSIX flush() and readline(), Windows will gladly hold on to your data for you. You essentially can't do line buffering with subprocess + pipes on Windows. Well, you can, but it requires a lot of extra code to do it right. This extra code is already written and part of mozprocess.
The correct patch is to modify expandlibs_exec.py to use mozprocess.
Attachment #8467898 -
Flags: feedback?(mh+mozilla)
Attachment #8467898 -
Flags: feedback?(gps)
Attachment #8467898 -
Flags: feedback-
Assignee | ||
Comment 6•10 years ago
|
||
That's odd, I tested it on Windows too and it seemed to work as expected (each line was displayed as it was printed by the subprocess). Maybe since I was running through a shell?
In any case, using mozprocess seems nicer because we can send the keep-alive messages straight to stdout, and still buffer the rest of the output for display underneath the command-line in case the command fails (so it is more consistent with the behavior from bug 837665).
Attachment #8467898 -
Attachment is obsolete: true
Attachment #8468001 -
Flags: review?(gps)
Reporter | ||
Comment 7•10 years ago
|
||
Comment 8•10 years ago
|
||
Comment on attachment 8468001 [details] [diff] [review]
expandlibs_stdout.patch
Review of attachment 8468001 [details] [diff] [review]:
-----------------------------------------------------------------
Maybe I was wrong about the exact Windows buffering issue. I know for sure there is a giant red flag around line-buffered I/O on Windows and that "use mozprocess" is the answer. Ted could tell you the exact issue.
::: config/expandlibs_exec.py
@@ +339,5 @@
> + def outputHandler(line):
> + if line.startswith('Still linking'):
> + # Keep-alive messages from build/link.py - print these right
> + # away
> + print line
We should probably throw a sys.stdout.flush() in here, just to be safe.
@@ +354,5 @@
> if proc.returncode and not options.verbose:
> print_command(sys.stderr, args)
> + if output_lines:
> + sys.stderr.write('\n'.join(output_lines) + '\n')
> + sys.stderr.flush()
You /could/ use sys.stderr.writelines(). But we're not dealing with massive buffers or tight loops, so meh.
Attachment #8468001 -
Flags: review?(gps) → review+
Comment 9•10 years ago
|
||
Note there's another bug about this issue, that had a (wrong) patch and surrounding discussion.
Comment 11•10 years ago
|
||
Assignee | ||
Comment 12•10 years ago
|
||
Ahh, should we go the route of moving the 30-minute timer to expandlibs_exec.py? My most recent attempt failed on try. Turns out processhandler.py *also* has a timeout (180 seconds?) which we obviously hit. At least, that's what I think is happening:
Traceback (most recent call last):
File "c:/builds/moz2_slave/try-w32-0000000000000000000000/build/config/expandlibs_exec.py", line 364, in <module>
main()
File "c:/builds/moz2_slave/try-w32-0000000000000000000000/build/config/expandlibs_exec.py", line 354, in main
proc.wait()
File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 824, in wait
self.returncode = self.proc.wait()
File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 162, in wait
self.returncode = self._wait()
File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 462, in _wait
raise OSError(err)
OSError: IO Completion Port failed to signal process shutdown
I was just going to try passing in a higher timeout from expandlibs_exec, though it currently isn't a parameter for mozprocess.
Comment 13•10 years ago
|
||
It's just simpler to copy the 30-minute timer (don't move it, copy it, because expandlibs_exec is probably going to die soonish)
Comment 14•10 years ago
|
||
Actually, there might be something even simpler: invert the wrapping. That is, instead of doing expandlib_exec.py link.py, do link.py expandlib_exec.py. This should only require adding an include of config.mk and wrapping EXPAND_LD instead of LD in toolkit/library.
Comment 15•10 years ago
|
||
Oh, that's a good plan, I like that.
Assignee | ||
Comment 16•10 years ago
|
||
Turns out we want EXPAND_MKSHLIB, not EXPAND_LD. But EXPAND_MKSHLIB uses $@ so it needs lazy evaluation. I ended up going with modifying EXPAND_LIBS_EXEC itself, which seems to be working on a build machine (at least, I saw a "Still linking..." message after 30 mins). I still need to run it through try once that's open again, though.
Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.
Attachment #8468001 -
Attachment is obsolete: true
Attachment #8468781 -
Flags: review?(gps)
Comment 17•10 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #16)
> Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it
> from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.
I'd rather do something like bug 1040649 for config.mk.
Comment 18•10 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #17)
> (In reply to Michael Shal [:mshal] from comment #16)
> > Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it
> > from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.
>
> I'd rather do something like bug 1040649 for config.mk.
Filed bug 1050029. I am testing a patch.
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #18)
> I'd rather do something like bug 1040649 for config.mk.
>
> Filed bug 1050029. I am testing a patch.
Since we'll presumably need to uplift this to aurora, will adding an extra dependent bug cause unnecessary risk? I'm not too familiar with that process, so I don't know if we'd just be cherry-picking the single patch, or merging from m-c.
For m-c itself I'm fine with not removing LIBXUL_LIBRARY from _DEPRECATED_VARIABLES and being dependent on 1050029 - just wondering how we handle the uplift.
Assignee | ||
Comment 20•10 years ago
|
||
I managed to sneak a pgo try run in today in between the hg.m.o crazyness:
https://tbpl.mozilla.org/?tree=Try&rev=cd041b6be923
Comment 21•10 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #19)
> For m-c itself I'm fine with not removing LIBXUL_LIBRARY from
> _DEPRECATED_VARIABLES and being dependent on 1050029 - just wondering how we
> handle the uplift.
We can uplift both, although uplifting 1050029 requires manual work. It's probably fine to have the version that removes LIBXUL_LIBRARY for uplift.
Comment 22•10 years ago
|
||
> We can uplift both, although uplifting 1050029 requires manual work.
To clarify, the manual work is for the lists of variables, which are likely different on each branch.
Assignee | ||
Comment 23•10 years ago
|
||
The patch in this bug also requires some manual work, since aurora has toolkit/library/build vs m-c's toolkit/library.
Updated•10 years ago
|
Attachment #8468781 -
Flags: review?(gps) → review+
Assignee | ||
Comment 24•10 years ago
|
||
Rebased patch now that it no longer needs to remove LIBXUL_LIBRARY from _DEPRECATED_VARIABLES (see #c17).
try: https://tbpl.mozilla.org/?tree=Try&rev=d02592374be6
Attachment #8468781 -
Attachment is obsolete: true
Attachment #8471042 -
Flags: review+
Assignee | ||
Comment 25•10 years ago
|
||
Comment 26•10 years ago
|
||
Assignee: nobody → mshal
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 27•10 years ago
|
||
I tried uplifting this patch (and the dependent bug 1050029) to aurora, and pushed that to try, but I'm hitting a "Not enough space" when running the LTCG:PGINSTRUMENT link. The clean_old_builds step claims there's 171.51GB disk space free, and bug 582335 seems to suggest this is a problem that should have been fixed in vs2010. I have PGO enabled in mozconfig.common.override, per: https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build
Here's the try:
https://tbpl.mozilla.org/?tree=Try&rev=1ba27541691c
Curiously, I hit the same errors even if I don't have my patches uplifted (just the mozconfig.common.override change to flip on PGO: https://tbpl.mozilla.org/?tree=Try&rev=a1f71583f9d9), so I'm guessing this is just not the right way to test aurora? Anyone have any suggestions here?
Comment 28•10 years ago
|
||
I disable sccache (comment out http://mxr.mozilla.org/mozilla-aurora/source/browser/config/mozconfigs/win32/nightly#9) to get aurora PGO to run on try.
Comment 29•10 years ago
|
||
It likely means OOM. Which means one more way unified builds is hiding problems, and also that we're screwed.
Comment 30•10 years ago
|
||
Are we not doing unified builds on Aurora?
Reporter | ||
Comment 31•10 years ago
|
||
Correct, unified builds are trunk-only.
Comment 32•10 years ago
|
||
... but that was a try push of an aurora revision, so it /should/ build fine.
Assignee | ||
Comment 33•10 years ago
|
||
Backed out because of bug 1053188: https://hg.mozilla.org/integration/mozilla-inbound/rev/dd3712012832
Assignee | ||
Comment 34•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #28)
> I disable sccache (comment out
> http://mxr.mozilla.org/mozilla-aurora/source/browser/config/mozconfigs/win32/
> nightly#9) to get aurora PGO to run on try.
Thanks - that seemed to help. Though I don't understand why sccache would affect linking...
In any case, I'll use that when we get a new patch out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 35•10 years ago
|
||
This is untested, but I think this approach should work. This replaces the call with expandlibs_exec to call link.py instead, and then makes link.py import expandlibs_exec and act as a very thin wrapper around it. mshal said he'd test this patch.
Updated•10 years ago
|
Assignee: mshal → ted
Status: REOPENED → ASSIGNED
Updated•10 years ago
|
Assignee: ted → mshal
Reporter | ||
Comment 36•10 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #33)
> Backed out because of bug 1053188:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/dd3712012832
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/dd3712012832
Assignee | ||
Comment 37•10 years ago
|
||
Move build/link.py into config/. Otherwise the next patch imports buildconfig.py from the srcdir, which means it can't find config.status. I'm separating this out to make reviewing the next patch easier.
Attachment #8473142 -
Flags: review?(gps)
Assignee | ||
Comment 38•10 years ago
|
||
Updated version of ted's patch that seems to work.
try with PGO: https://tbpl.mozilla.org/?tree=Try&rev=d1c0c62ed69e
general try without PGO: https://tbpl.mozilla.org/?tree=Try&rev=f9b6c9b71a7e
ted, can you provide feedback on the changes I've made to your patch? I believe this is a rough summary:
- use t[0] to modify t from callback
- use args[1:] to avoid passing link.py into expandlibs_exec
- return the returncode from expandlibs_exec.main()
- include config.mk to override EXPAND_LIBS_EXEC
Attachment #8471042 -
Attachment is obsolete: true
Attachment #8472344 -
Attachment is obsolete: true
Attachment #8473143 -
Flags: review?(gps)
Attachment #8473143 -
Flags: feedback?(ted)
Assignee | ||
Comment 39•10 years ago
|
||
FYI, the linker vsize in the PGO job is 3064848384, which seems to be in line with existing builds (vs 76M :)
Comment 40•10 years ago
|
||
Comment on attachment 8473143 [details] [diff] [review]
0002-Bug-1047621-Have-link.py-import-and-call-expandlibs_.patch
Review of attachment 8473143 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for taking this and fixing it up!
::: config/link.py
@@ +43,5 @@
> + def callback(proc):
> + t[0] = threading.Thread(target=measure_vsize_threadfunc,
> + args=(proc, output_file))
> + t[0].start()
> + exitcode = expandlibs_exec.main(args[1:], proc_callback=callback)
nit: I might put the args[1:] bit down in the sys.exit line.
Attachment #8473143 -
Flags: feedback?(ted) → feedback+
Updated•10 years ago
|
Attachment #8473142 -
Flags: review?(gps) → review+
Updated•10 years ago
|
Attachment #8473143 -
Flags: review?(gps) → review+
Assignee | ||
Comment 41•10 years ago
|
||
Reporter | ||
Comment 42•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/552170c05e9f
https://hg.mozilla.org/mozilla-central/rev/ff64d9bbc03c
Status: ASSIGNED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 43•10 years ago
|
||
FWIW, these patches will need some rebasing for Aurora uplift.
Reporter | ||
Comment 44•10 years ago
|
||
b-2008-ix slaves are routinely hitting 19800 second timeouts now too:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46140402&tree=Mozilla-Aurora
Assignee | ||
Comment 45•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #44)
> b-2008-ix slaves are routinely hitting 19800 second timeouts now too:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=46140402&tree=Mozilla-Aurora
This is a different timeout - it's the 5.5 hour overall buildbot timer. Unfortunately the patches in this bug won't do anything to affect the full runtime (though bug 1050508 will speed things up for sm machines). We'd probably need a separate bug for this, with a simple short-term workaround of just upping the timeout in buildbot.
Assignee | ||
Comment 46•10 years ago
|
||
Though I should note that the second PGO link is taking about an extra hour in that log vs. what I usually see (the ix machines normally have 3 "Still linking, 30 minutes have passed" messages, but this one has 5). So, maybe something else is up.
Assignee | ||
Comment 47•10 years ago
|
||
Comment 48•10 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #46)
> Though I should note that the second PGO link is taking about an extra hour
> in that log vs. what I usually see (the ix machines normally have 3 "Still
> linking, 30 minutes have passed" messages, but this one has 5). So, maybe
> something else is up.
I assume this is entirely due to non-unified builds. Can anyone explain to me why we don't do unified builds on branches? Are we just gluttons for punishment?
Comment 49•10 years ago
|
||
See bug 945496 comment 0.
Comment 50•10 years ago
|
||
I had to track down said dev-platform discussion, I think this is the most relevant bit:
https://lists.mozilla.org/pipermail/dev-platform/2013-December/002393.html
"I believe that we should disable unified
builds on mozilla-aurora and all branches past that point, perhaps
indefinitely. Unified builds don't really buy us that much for code
that gets compiled on those branches anyway, and if developers want to
have them locally then they can always pref it on in their mozconfigs."
Given that that claim is demonstrably false, we could definitely revisit this. We have --disable-unified builds on TBPL (I've been backed out for it!) which should be a safeguard against things breaking in weird ways.
Comment 51•10 years ago
|
||
Except afaik we still have, on unified builds, problems with crash reports and probably some other stuff still not figured.
Comment 52•10 years ago
|
||
Or maybe there's no problem anymore (cf. bug 943695 and the other things mentioned in https://lists.mozilla.org/pipermail/dev-platform/2013-November/002367.html)
Reporter | ||
Comment 53•10 years ago
|
||
We're seeing some green Windows builds now, yay :)
Landing this on beta too as we've started seeing instances there as well now.
https://hg.mozilla.org/releases/mozilla-beta/rev/2a617532286d
https://hg.mozilla.org/releases/mozilla-beta/rev/a09c51fcbd98
https://hg.mozilla.org/releases/mozilla-beta/rev/bd02db1d22d0
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•