Closed Bug 1047621 Opened 10 years ago Closed 10 years ago

There hasn't been a single successful Windows opt/nightly build off a b-2008-sm slave since the uplift

Categories

(Release Engineering :: General, defect)

x86
Windows 8.1
defect
Not set
critical

Tracking

(firefox32 fixed, firefox33 fixed, firefox34 fixed)

RESOLVED FIXED
Tracking Status
firefox32 --- fixed
firefox33 --- fixed
firefox34 --- fixed

People

(Reporter: RyanVM, Assigned: mshal)

References

Details

Attachments

(2 files, 5 obsolete files)

Every attempt to build a Windows opt/nightly build on a b-2008-sm slave since the uplift has timed out like the log below:
https://tbpl.mozilla.org/php/getParsedLog.php?id=44531250&tree=Mozilla-Aurora

Beta 32 doesn't appear to have this issue. We don't see this on trunk PGO builds probably because unified builds consume significantly fewer resources and run so much faster.

Can we only use b-2008-ix slaves for non-debug Windows builds on Aurora?
Depends on: 1048329
Callek suspected that bug 1045133 might be in play, but it looks like xul is only linked twice (not a third time like in that bug), so I don't think that is currently the issue. It might still be related, though.

I tried to reproduce this by logging into b-2008-sm-0046 and running the final xul pgo link manually, but I hit some weird linker errors (it couldn't find some symbols that should be present in gkmedias). I think that may have been from some files being in a funky state when buildbot killed the previous link job, though.

I did kick off a new full build manually at the commandline, and this time it linked correctly. However, the final pgo link took approximately 3h:06m, which is just a few minutes above the 3-hour no-output buildbot timeout that we hit. So, I think we just added enough new stuff to be over the limit.

However, there is supposed to be a keep-alive message spit out by link.py every 30 minutes, which I don't actually see displayed while it's running. Maybe if we fix that, that'll buy us enough time to get a build? (At least up to the 5.5 full runtime timeout).

I still don't understand why the sm machine takes over 3 hours to do the pgo link, whereas on an ix machine it appears to take 1.5 - 2hrs. Are they really that different? It looks like they are both 2.4GHz, but different processor families (E21265L vs X3430).
> However, there is supposed to be a keep-alive message spit out by link.py
> every 30 minutes, which I don't actually see displayed while it's running.
> Maybe if we fix that, that'll buy us enough time to get a build? (At least
> up to the 5.5 full runtime timeout).

According to ehsan on IRC, the keep-alive message (from bug 915735) never actually worked in a real build, only when he invoked link.py manually. It turns out that the "Still linking, 30 minutes passed" messages from link.py are getting buffered by expandlibs_exec.py (at least since bug 837665 when expandlibs_exec was switched from subprocess.call() to subprocess.Popen()/communicate()). We should be able to fix this by printing the linker's output as it comes in, which would allow us up to the 5.5 hour max runtime.
Attached patch expandlibs_stdout.patch (obsolete) — Splinter Review
Does something like this seem reasonable? Or is there a better way to get unbuffered output from the subprocess? It looks like the .communicate() is still needed to get the return code at least, though I'm not sure if we still need the stdout from that function.

Also as an FYI, since the failure code is determined after the output is displayed, if the linker command fails, the display might be a little misleading. Eg:

Output line 1
Output line 2
Output line 3
Output line 4
Output line 5
Executing: mybuild.sh /home/mshal/mozilla-central-git/obj-x86_64-unknown-linux-gnu/toolkit/library/tmpcAzdov.list
/home/mshal/mozilla-central-git/obj-x86_64-unknown-linux-gnu/toolkit/library/tmpcAzdov.list:
    INPUT("../components/perf/PerfMeasurement.o")

(The output lines from the script come before the "Executing: " line, which could be a little confusing. Maybe it could just be re-worded.)
Attachment #8467898 - Flags: feedback?(mh+mozilla)
Attachment #8467898 - Flags: feedback?(gps)
Comment on attachment 8467898 [details] [diff] [review]
expandlibs_stdout.patch

Review of attachment 8467898 [details] [diff] [review]:
-----------------------------------------------------------------

This approach works... on Linux.

Windows has this awful pipe buffering behavior where output isn't emitted until the buffer fills. Even if you are doing proper POSIX flush() and readline(), Windows will gladly hold on to your data for you. You essentially can't do line buffering with subprocess + pipes on Windows. Well, you can, but it requires a lot of extra code to do it right. This extra code is already written and part of mozprocess.

The correct patch is to modify expandlibs_exec.py to use mozprocess.
Attachment #8467898 - Flags: feedback?(mh+mozilla)
Attachment #8467898 - Flags: feedback?(gps)
Attachment #8467898 - Flags: feedback-
Attached patch expandlibs_stdout.patch (obsolete) — Splinter Review
That's odd, I tested it on Windows too and it seemed to work as expected (each line was displayed as it was printed by the subprocess). Maybe since I was running through a shell?

In any case, using mozprocess seems nicer because we can send the keep-alive messages straight to stdout, and still buffer the rest of the output for display underneath the command-line in case the command fails (so it is more consistent with the behavior from bug 837665).
Attachment #8467898 - Attachment is obsolete: true
Attachment #8468001 - Flags: review?(gps)
Comment on attachment 8468001 [details] [diff] [review]
expandlibs_stdout.patch

Review of attachment 8468001 [details] [diff] [review]:
-----------------------------------------------------------------

Maybe I was wrong about the exact Windows buffering issue. I know for sure there is a giant red flag around line-buffered I/O on Windows and that "use mozprocess" is the answer. Ted could tell you the exact issue.

::: config/expandlibs_exec.py
@@ +339,5 @@
> +        def outputHandler(line):
> +            if line.startswith('Still linking'):
> +                # Keep-alive messages from build/link.py - print these right
> +                # away
> +                print line

We should probably throw a sys.stdout.flush() in here, just to be safe.

@@ +354,5 @@
>          if proc.returncode and not options.verbose:
>              print_command(sys.stderr, args)
> +        if output_lines:
> +            sys.stderr.write('\n'.join(output_lines) + '\n')
> +            sys.stderr.flush()

You /could/ use sys.stderr.writelines(). But we're not dealing with massive buffers or tight loops, so meh.
Attachment #8468001 - Flags: review?(gps) → review+
Note there's another bug about this issue, that had a (wrong) patch and surrounding discussion.
Ahh, should we go the route of moving the 30-minute timer to expandlibs_exec.py? My most recent attempt failed on try. Turns out processhandler.py *also* has a timeout (180 seconds?) which we obviously hit. At least, that's what I think is happening:

Traceback (most recent call last):
  File "c:/builds/moz2_slave/try-w32-0000000000000000000000/build/config/expandlibs_exec.py", line 364, in <module>
    main()
  File "c:/builds/moz2_slave/try-w32-0000000000000000000000/build/config/expandlibs_exec.py", line 354, in main
    proc.wait()
  File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 824, in wait
    self.returncode = self.proc.wait()
  File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 162, in wait
    self.returncode = self._wait()
  File "c:\builds\moz2_slave\try-w32-0000000000000000000000\build\testing\mozbase\mozprocess\mozprocess\processhandler.py", line 462, in _wait
    raise OSError(err)
OSError: IO Completion Port failed to signal process shutdown

I was just going to try passing in a higher timeout from expandlibs_exec, though it currently isn't a parameter for mozprocess.
It's just simpler to copy the 30-minute timer (don't move it, copy it, because expandlibs_exec is probably going to die soonish)
Actually, there might be something even simpler: invert the wrapping. That is, instead of doing expandlib_exec.py link.py, do link.py expandlib_exec.py. This should only require adding an include of config.mk and wrapping EXPAND_LD instead of LD in toolkit/library.
Oh, that's a good plan, I like that.
Turns out we want EXPAND_MKSHLIB, not EXPAND_LD. But EXPAND_MKSHLIB uses $@ so it needs lazy evaluation. I ended up going with modifying EXPAND_LIBS_EXEC itself, which seems to be working on a build machine (at least, I saw a "Still linking..." message after 30 mins). I still need to run it through try once that's open again, though.

Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.
Attachment #8468001 - Attachment is obsolete: true
Attachment #8468781 - Flags: review?(gps)
(In reply to Michael Shal [:mshal] from comment #16)
> Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it
> from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.

I'd rather do something like bug 1040649 for config.mk.
(In reply to Mike Hommey [:glandium] from comment #17)
> (In reply to Michael Shal [:mshal] from comment #16)
> > Also since config.mk ends up defining LIBXUL_LIBRARY, I had to remove it
> > from the _DEPRECATED_VARIABLES so rules.mk wouldn't complain.
> 
> I'd rather do something like bug 1040649 for config.mk.

Filed bug 1050029. I am testing a patch.
(In reply to Mike Hommey [:glandium] from comment #18)
> I'd rather do something like bug 1040649 for config.mk.
> 
> Filed bug 1050029. I am testing a patch.

Since we'll presumably need to uplift this to aurora, will adding an extra dependent bug cause unnecessary risk? I'm not too familiar with that process, so I don't know if we'd just be cherry-picking the single patch, or merging from m-c.

For m-c itself I'm fine with not removing LIBXUL_LIBRARY from _DEPRECATED_VARIABLES and being dependent on 1050029 - just wondering how we handle the uplift.
I managed to sneak a pgo try run in today in between the hg.m.o crazyness:

https://tbpl.mozilla.org/?tree=Try&rev=cd041b6be923
(In reply to Michael Shal [:mshal] from comment #19)
> For m-c itself I'm fine with not removing LIBXUL_LIBRARY from
> _DEPRECATED_VARIABLES and being dependent on 1050029 - just wondering how we
> handle the uplift.

We can uplift both, although uplifting 1050029 requires manual work. It's probably fine to have the version that removes LIBXUL_LIBRARY for uplift.
> We can uplift both, although uplifting 1050029 requires manual work.

To clarify, the manual work is for the lists of variables, which are likely different on each branch.
The patch in this bug also requires some manual work, since aurora has toolkit/library/build vs m-c's toolkit/library.
Depends on: 1050508
Attachment #8468781 - Flags: review?(gps) → review+
Rebased patch now that it no longer needs to remove LIBXUL_LIBRARY from _DEPRECATED_VARIABLES (see #c17).

try: https://tbpl.mozilla.org/?tree=Try&rev=d02592374be6
Attachment #8468781 - Attachment is obsolete: true
Attachment #8471042 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/5e9b13c3595b
Assignee: nobody → mshal
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1050029
I tried uplifting this patch (and the dependent bug 1050029) to aurora, and pushed that to try, but I'm hitting a "Not enough space" when running the LTCG:PGINSTRUMENT link. The clean_old_builds step claims there's 171.51GB disk space free, and bug 582335 seems to suggest this is a problem that should have been fixed in vs2010. I have PGO enabled in mozconfig.common.override, per: https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build

Here's the try:

https://tbpl.mozilla.org/?tree=Try&rev=1ba27541691c

Curiously, I hit the same errors even if I don't have my patches uplifted (just the mozconfig.common.override change to flip on PGO: https://tbpl.mozilla.org/?tree=Try&rev=a1f71583f9d9), so I'm guessing this is just not the right way to test aurora? Anyone have any suggestions here?
I disable sccache (comment out http://mxr.mozilla.org/mozilla-aurora/source/browser/config/mozconfigs/win32/nightly#9) to get aurora PGO to run on try.
It likely means OOM. Which means one more way unified builds is hiding problems, and also that we're screwed.
Are we not doing unified builds on Aurora?
Correct, unified builds are trunk-only.
... but that was a try push of an aurora revision, so it /should/ build fine.
Depends on: 1053188
(In reply to Phil Ringnalda (:philor) from comment #28)
> I disable sccache (comment out
> http://mxr.mozilla.org/mozilla-aurora/source/browser/config/mozconfigs/win32/
> nightly#9) to get aurora PGO to run on try.

Thanks - that seemed to help. Though I don't understand why sccache would affect linking...

In any case, I'll use that when we get a new patch out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is untested, but I think this approach should work. This replaces the call with expandlibs_exec to call link.py instead, and then makes link.py import expandlibs_exec and act as a very thin wrapper around it. mshal said he'd test this patch.
Assignee: mshal → ted
Status: REOPENED → ASSIGNED
Assignee: ted → mshal
Move build/link.py into config/. Otherwise the next patch imports buildconfig.py from the srcdir, which means it can't find config.status. I'm separating this out to make reviewing the next patch easier.
Attachment #8473142 - Flags: review?(gps)
Updated version of ted's patch that seems to work.

try with PGO: https://tbpl.mozilla.org/?tree=Try&rev=d1c0c62ed69e
general try without PGO: https://tbpl.mozilla.org/?tree=Try&rev=f9b6c9b71a7e

ted, can you provide feedback on the changes I've made to your patch? I believe this is a rough summary:

 - use t[0] to modify t from callback
 - use args[1:] to avoid passing link.py into expandlibs_exec
 - return the returncode from expandlibs_exec.main()
 - include config.mk to override EXPAND_LIBS_EXEC
Attachment #8471042 - Attachment is obsolete: true
Attachment #8472344 - Attachment is obsolete: true
Attachment #8473143 - Flags: review?(gps)
Attachment #8473143 - Flags: feedback?(ted)
FYI, the linker vsize in the PGO job is 3064848384, which seems to be in line with existing builds (vs 76M :)
Comment on attachment 8473143 [details] [diff] [review]
0002-Bug-1047621-Have-link.py-import-and-call-expandlibs_.patch

Review of attachment 8473143 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking this and fixing it up!

::: config/link.py
@@ +43,5 @@
> +    def callback(proc):
> +        t[0] = threading.Thread(target=measure_vsize_threadfunc,
> +                             args=(proc, output_file))
> +        t[0].start()
> +    exitcode = expandlibs_exec.main(args[1:], proc_callback=callback)

nit: I might put the args[1:] bit down in the sys.exit line.
Attachment #8473143 - Flags: feedback?(ted) → feedback+
Attachment #8473142 - Flags: review?(gps) → review+
Attachment #8473143 - Flags: review?(gps) → review+
https://hg.mozilla.org/mozilla-central/rev/552170c05e9f
https://hg.mozilla.org/mozilla-central/rev/ff64d9bbc03c
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
FWIW, these patches will need some rebasing for Aurora uplift.
b-2008-ix slaves are routinely hitting 19800 second timeouts now too:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46140402&tree=Mozilla-Aurora
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #44)
> b-2008-ix slaves are routinely hitting 19800 second timeouts now too:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=46140402&tree=Mozilla-Aurora

This is a different timeout - it's the 5.5 hour overall buildbot timer. Unfortunately the patches in this bug won't do anything to affect the full runtime (though bug 1050508 will speed things up for sm machines). We'd probably need a separate bug for this, with a simple short-term workaround of just upping the timeout in buildbot.
Though I should note that the second PGO link is taking about an extra hour in that log vs. what I usually see (the ix machines normally have 3 "Still linking, 30 minutes have passed" messages, but this one has 5). So, maybe something else is up.
(In reply to Michael Shal [:mshal] from comment #46)
> Though I should note that the second PGO link is taking about an extra hour
> in that log vs. what I usually see (the ix machines normally have 3 "Still
> linking, 30 minutes have passed" messages, but this one has 5). So, maybe
> something else is up.

I assume this is entirely due to non-unified builds. Can anyone explain to me why we don't do unified builds on branches? Are we just gluttons for punishment?
I had to track down said dev-platform discussion, I think this is the most relevant bit:
https://lists.mozilla.org/pipermail/dev-platform/2013-December/002393.html

"I believe that we should disable unified 
builds on mozilla-aurora and all branches past that point, perhaps 
indefinitely.  Unified builds don't really buy us that much for code 
that gets compiled on those branches anyway, and if developers want to 
have them locally then they can always pref it on in their mozconfigs."

Given that that claim is demonstrably false, we could definitely revisit this. We have --disable-unified builds on TBPL (I've been backed out for it!) which should be a safeguard against things breaking in weird ways.
Except afaik we still have, on unified builds, problems with crash reports and probably some other stuff still not figured.
Or maybe there's no problem anymore (cf. bug 943695 and the other things mentioned in https://lists.mozilla.org/pipermail/dev-platform/2013-November/002367.html)
See Also: → 1055876
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: