Last Comment Bug 730196 - Frequent Windows build timeouts during make_pkg or make package pretty buildstep, often with a "--- Press `Q' to quit, or any other key to continue ---" prompt
: Frequent Windows build timeouts during make_pkg or make package pretty builds...
Status: RESOLVED FIXED
[qa-]
: intermittent-failure
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: x86 Windows Server 2003
: -- critical (vote)
: mozilla14
Assigned To: Nathan Froyd [:froydnj]
:
Mentors:
: 735302 (view as bug list)
Depends on:
Blocks: 438871 711297 726656
  Show dependency treegraph
 
Reported: 2012-02-23 21:23 PST by Phil Ringnalda (:philor)
Modified: 2012-11-25 19:31 PST (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
patch (1.19 KB, patch)
2012-03-12 08:36 PDT, Nathan Froyd [:froydnj]
ted: review+
bugzilla: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Phil Ringnalda (:philor) 2012-02-23 21:23:33 PST
Six of one whether this is releng, or packaging. Personally, I'm betting on some Windows zip limitation.

https://tbpl.mozilla.org/php/getParsedLog.php?id=9584873&tree=Firefox
WINNT 5.2 mozilla-central pgo-build on 2012-02-23 18:00:09 PST for push 6fe899b9b372

Compressing...
cd ../../dist && (cd firefox && rm -f omni.ja components/binary.manifest && grep -h '^binary-component' components/*.manifest > binary.manifest ; for m in components/*.manifest; do sed -e 's/^binary-component/#binary-component/' $m > tmp.manifest && mv tmp.manifest $m; done; zip -r9m omni.ja chrome chrome.manifest components/*.js components/*.xpt components/*.manifest modules res defaults greprefs.js jsloader hyphenation update.locale  -x chrome/icons/\* defaults/pref/channel-prefs.js res/cursors/\* res/MainMenu.nib/\*  &&  e:/builds/moz2_slave/m-cen-w32-pgo/build/obj-firefox/dist/bin/xpcshell.exe -g "$PWD" -a "$PWD" -f /e/builds/moz2_slave/m-cen-w32-pgo/build/toolkit/mozapps/installer/precompile_cache.js -e "populate_startupcache('GreD', 'omni.ja', 'startupCache.zip');" && rm -rf jsloader && unzip startupCache.zip && rm startupCache.zip && zip -r9m omni.ja jsloader/resource/gre && d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/m-cen-w32-pgo/build/config/optimizejars.py --optimize /e/builds/moz2_slave/m-cen-w32-pgo/build/obj-firefox/browser/installer/../../jarlog//en-US ./ ./ && mv binary.manifest components && printf "manifest components/binary.manifest\n" > chrome.manifest) && (cd firefox && d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/m-cen-w32-pgo/build/config/createprecomplete.py) && zip -r9D win32/en-US/nightly-13.0a1.zip firefox
...
  adding: update.locale (188 bytes security) (stored 0%) <-- where most of them die
...
Archive:  startupCache.zip
�\Q
--- Press `Q' to quit, or any other key to continue ---
command timed out: 1200 seconds without output, attempting to kill

We had six or eight of them on inbound yesterday, so I hounded a probably-guiltless patch out of the tree, then we had two more overnight, and now this one (first PGO I remember, the previous ones were split among Win32 opt and debug and Win64 opt).

Unfortunately, we're going to have a Mexican standoff between releng and build over who should figure out what is happening, because nobody's going to want to sit watching a slave make package over and over until it happens.
Comment 1 Phil Ringnalda (:philor) 2012-02-23 21:47:55 PST
That was w32-ix-slave27 (and possibly the only one that was in make package pretty rather than make_pkg, or possibly make package pretty is a PGO thing).

https://tbpl.mozilla.org/php/getParsedLog.php?id=9584748&tree=Mozilla-Inbound - w32-ix-slave30

https://tbpl.mozilla.org/php/getParsedLog.php?id=9560154&tree=Mozilla-Inbound - w32-ix-slave26

https://tbpl.mozilla.org/php/getParsedLog.php?id=9549136&tree=Mozilla-Inbound - w64-ix-slave15

https://tbpl.mozilla.org/php/getParsedLog.php?id=9542348&tree=Mozilla-Inbound - w32-ix-slave40
https://tbpl.mozilla.org/php/getParsedLog.php?id=9541533&tree=Mozilla-Inbound - w32-ix-slave25

https://tbpl.mozilla.org/php/getParsedLog.php?id=9538661&tree=Mozilla-Inbound - w32-ix-slave44

so it doesn't look at all good for blame-the-slave.
Comment 2 Phil Ringnalda (:philor) 2012-02-24 00:29:19 PST
At first, even though I wanted to blame optimizejars.py, I believed mak that the timing didn't work. However, bug 726656 landed on m-c on the 17th, so m-i then or maybe the 18th, either way I took last weekend off, and it's purple: we ignore purple. It could easily have happened ten or twenty times between last Friday/Saturday and yesterday's explosion without anyone ever paying it any real mind.

With any "luck" whatever broke during the time it wasn't working is also broken on m-a and m-b, and we'll get confirmation by them starting to fail too, though something only intermittent on our most-coalesced builds may take time to show up on branches like those that don't do many builds.

First shot fired :)

<nthomas>: I feel a slave loan coming on
Comment 8 Phil Ringnalda (:philor) 2012-02-24 23:14:38 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9618613&tree=Mozilla-Inbound went a bit longer - the usual is stopping at the "adding: update.locale (188 bytes security) (stored 0%)" point, at the end of zipping up apparently, since this one continued on to

WARNING: CheckLinkStatus called on main thread! No check performed. Assuming link is up, status is unknown.: file e:/builds/moz2_slave/m-in-w32-dbg/build/netwerk/system/win32/nsNotifyAddrListener.cpp, line 414
resource:///components/txEXSLTRegExFunctions.js
...
resource:///modules/CSPUtils.jsm
WARNING: nsExceptionService ignoring thread destruction after shutdown: file e:/builds/moz2_slave/m-in-w32-dbg/build/xpcom/base/nsExceptionService.cpp, line 199
WARNING: NS_ENSURE_TRUE(mMainThread) failed: file e:/builds/moz2_slave/m-in-w32-dbg/build/xpcom/threads/nsThreadManager.cpp, line 292
WARNING: NS_ENSURE_TRUE(thread) failed: file e:/builds/moz2_slave/m-in-w32-dbg/build/obj-firefox/xpcom/build/nsThreadUtils.cpp, line 181
WARNING: NS_ENSURE_TRUE(thread) failed: file e:/builds/moz2_slave/m-in-w32-dbg/build/obj-firefox/xpcom/build/nsThreadUtils.cpp, line 181
WARNING: NS_ENSURE_TRUE(mMainThread) failed: file e:/builds/moz2_slave/m-in-w32-dbg/build/xpcom/threads/nsThreadManager.cpp, line 283
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0xC1F30001: file e:/builds/moz2_slave/m-in-w32-dbg/build/obj-firefox/xpcom/build/nsThreadUtils.cpp, line 168
nsStringStats
 => mAllocCount:          17261
 => mReallocCount:         1595
 => mFreeCount:           17261
 => mShareCount:          17419
 => mAdoptCount:           2410
 => mAdoptFreeCount:       2410
Archive:  startupCache.zip
Pú
--- Press `Q' to quit, or any other key to continue ---
command timed out: 1200 seconds without output, attempting to kill

so one zip file later.
Comment 9 Phil Ringnalda (:philor) 2012-02-24 23:46:40 PST
And, oddly enough, the build I started locally at about that time, on OS X, stopped at exactly that point too. I couldn't think of anything useful to do to diagnose it (other than to note that the terminal tab said I was running unzip), so I pressed the any key, and it happily continued packaging.
Comment 12 Phil Ringnalda (:philor) 2012-02-25 11:09:14 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9627748&tree=Try
Comment 13 Phil Ringnalda (:philor) 2012-02-25 11:21:19 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9617953&tree=Try
Comment 15 Phil Ringnalda (:philor) 2012-02-25 22:58:14 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9632219&tree=Try
Comment 16 Phil Ringnalda (:philor) 2012-02-25 23:03:57 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9634539&tree=Try
Comment 17 Phil Ringnalda (:philor) 2012-02-26 11:23:44 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9637948&tree=Firefox

If you cause the mozilla-central nightly to fail, your severity is something other than "normal."
Comment 34 Ryan VanderMeulen [:RyanVM] 2012-02-28 16:17:37 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9703210&tree=Try
Comment 62 Chris AtLee [:catlee] 2012-03-04 06:21:14 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/8937b91b8435 shows up as suspect after a bunch of bisecting
Comment 63 Phil Ringnalda (:philor) 2012-03-04 16:16:35 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9813395&tree=Mozilla-Inbound

That's a pretty crazy thing to be sending unzip into spewing bits to the terminal.
Comment 72 Chris AtLee [:catlee] 2012-03-06 08:16:24 PST
I can reproduce it here:

https://tbpl.mozilla.org/?tree=Try&rev=80230f9c3fc8 (based on 8937b91b8435)

but not here:

https://tbpl.mozilla.org/?tree=Try&rev=1ba37a8e8b98 (based on 8ca3fd79d50c)
Comment 73 Chris AtLee [:catlee] 2012-03-06 09:10:47 PST
stupid bugzilla
Comment 74 Chris AtLee [:catlee] 2012-03-06 10:23:15 PST
after more retrying, looks like 8ca3fd79d50c also fails...which shifts the blame onto https://hg.mozilla.org/integration/mozilla-inbound/rev/8ca3fd79d50c
Comment 75 Nick Thomas [:nthomas] 2012-03-06 14:12:56 PST
CC' gavin as assignee of bug 729767.
Comment 77 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-03-07 08:44:49 PST
(In reply to Chris AtLee [:catlee] from comment #74)
> after more retrying, looks like 8ca3fd79d50c also fails...which shifts the
> blame onto
> https://hg.mozilla.org/integration/mozilla-inbound/rev/8ca3fd79d50c

That would surprise me.  I think we should try backing out the omnijar ordering thing and seeing if that fixes the problem.
Comment 93 Phil Ringnalda (:philor) 2012-03-10 21:57:33 PST
https://tbpl.mozilla.org/php/getParsedLog.php?id=9979892&tree=Try
Comment 102 Nick Thomas [:nthomas] 2012-03-11 20:24:04 PDT
* All the errors occur in the 'make installer' step, ending with
    Archive:  startupCache.zip
    <some binary data>
    --- Press `Q' to quit, or any other key to continue ---
    command timed out: 1200 seconds without output, attempting to kill
* the relevant code is 
  http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/packager.mk#486  (CC mhommey)
* it affects both win32 and win64 compile jobs; pgo, non-pgo and debug

Possible regressors:
* bug 726656 started optimizing jars again at https://hg.mozilla.org/mozilla-central/rev/f88a05e00f47. This was broken by renaming omni.jar to omni.ja (bug 701875) at https://hg.mozilla.org/mozilla-central/rev/3f0b94325b80, some two months earlier

* there were a five landings in startupCache/ on Feb. 22nd from bug 711297

The latter fits pretty well and is going to be much simpler to investigate, so throwing this over to froydnj to do that. If RelEng can help out with access to a windows machine you have only to ask.
Comment 105 Nathan Froyd [:froydnj] 2012-03-12 06:22:00 PDT
(In reply to Nick Thomas [:nthomas] from comment #102)
> * All the errors occur in the 'make installer' step, ending with
>     Archive:  startupCache.zip
>     <some binary data>
>     --- Press `Q' to quit, or any other key to continue ---
>     command timed out: 1200 seconds without output, attempting to kill
>
> * there were a five landings in startupCache/ on Feb. 22nd from bug 711297

This looks like unzip is attempting to print out the binary data we stash in the zipfile comment, which is not going to end well.  Fixing that is easy enough.
Comment 107 Nathan Froyd [:froydnj] 2012-03-12 08:36:27 PDT
Created attachment 604967 [details] [diff] [review]
patch

Add -q when unzipping the startupCache.zip file so we don't print the binary data in the comment field.
Comment 108 Ted Mielczarek [:ted.mielczarek] 2012-03-12 08:41:45 PDT
Comment on attachment 604967 [details] [diff] [review]
patch

Review of attachment 604967 [details] [diff] [review]:
-----------------------------------------------------------------

Is there a reason this only happens intermittently? Does the data we jam into the comment field change, so it only causes failures sometimes?

::: toolkit/mozapps/installer/packager.mk
@@ +479,5 @@
>  PRECOMPILE_RESOURCE=gre
>  PRECOMPILE_GRE=$$PWD
>  endif
>  
> +# Silence the unzip step so we don't print any binary data from the comment field.

This comment would probably fit better right above the unzip line itself.
Comment 109 Mozilla RelEng Bot 2012-03-12 08:42:13 PDT
Autoland Patchset:
	Patches: 604967
	Branch: mozilla-central => try
	Destination: http://hg.mozilla.org/try/pushloghtml?changeset=be056fdb0132
Try run started, revision be056fdb0132. To cancel or monitor the job, see: https://tbpl.mozilla.org/?tree=Try&rev=be056fdb0132
Comment 110 Nathan Froyd [:froydnj] 2012-03-12 08:44:37 PDT
(In reply to Ted Mielczarek [:ted] from comment #108)
> Is there a reason this only happens intermittently? Does the data we jam
> into the comment field change, so it only causes failures sometimes?

We're stuffing an 8-byte timestamp in there; presumably the bytes in the timestamp every so often cause heartache for the Windows shell (cmd.exe?) when it's trying to decode them.

> > +# Silence the unzip step so we don't print any binary data from the comment field.
> 
> This comment would probably fit better right above the unzip line itself.

Duly noted.  I wasn't sure that'd work, but I guess that's what try is for!
Comment 111 Ted Mielczarek [:ted.mielczarek] 2012-03-12 09:34:09 PDT
(In reply to Nathan Froyd (:froydnj) from comment #110)
> We're stuffing an 8-byte timestamp in there; presumably the bytes in the
> timestamp every so often cause heartache for the Windows shell (cmd.exe?)
> when it's trying to decode them.

Ah, that makes sense.

> > This comment would probably fit better right above the unzip line itself.
> 
> Duly noted.  I wasn't sure that'd work, but I guess that's what try is for!

The pedantically-correct way to do this is to put the # at the left margin, so it gets parsed as a Makefile comment, but I'm not 100% sure how that works inside a continuation. You might need to just have <tab># blah \ to get it to parse properly. Or it's possible that that just won't parse and you have to live with what you have.
Comment 112 Nathan Froyd [:froydnj] 2012-03-12 10:38:50 PDT
(In reply to Ted Mielczarek [:ted] from comment #111)
> The pedantically-correct way to do this is to put the # at the left margin,
> so it gets parsed as a Makefile comment, but I'm not 100% sure how that
> works inside a continuation. You might need to just have <tab># blah \ to
> get it to parse properly. Or it's possible that that just won't parse and
> you have to live with what you have.

Try (windows only) seems to indicate this doesn't work:

https://tbpl.mozilla.org/?tree=Try&rev=dbeb08eb7573

I'm just going to call the current comment-not-quite-close-to-code version good enough.
Comment 113 Nathan Froyd [:froydnj] 2012-03-12 11:19:14 PDT
Try looks good (green builds for almost all major platforms), so marking checkin-needed.
Comment 114 Chris AtLee [:catlee] 2012-03-12 11:31:18 PDT
I needed to rebuild the windows builds on try 6 or 7 times to make sure it wasn't still happening. Can you do that?
Comment 116 Nathan Froyd [:froydnj] 2012-03-12 13:00:56 PDT
(In reply to Chris AtLee [:catlee] from comment #114)
> I needed to rebuild the windows builds on try 6 or 7 times to make sure it
> wasn't still happening. Can you do that?

Pushed a 3rd attempt to try:

https://tbpl.mozilla.org/?tree=Try&rev=cdab92f07ac7

which along with an earlier windows-only run:

https://tbpl.mozilla.org/?tree=Try&rev=b58312ae711d

and the autoland run is three runs, at least.  I'm pretty confident this will fix things, though.
Comment 117 Chris AtLee [:catlee] 2012-03-12 13:20:26 PDT
(In reply to Nathan Froyd (:froydnj) from comment #116)
> (In reply to Chris AtLee [:catlee] from comment #114)
> > I needed to rebuild the windows builds on try 6 or 7 times to make sure it
> > wasn't still happening. Can you do that?
> 
> Pushed a 3rd attempt to try:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=cdab92f07ac7

You know you can trigger rebuilds via tbpl, right? Click on the win32 opt/debug build, and on the bottom left pane there will be a blue plus icon. Click that to re-trigger it.
Comment 118 Nathan Froyd [:froydnj] 2012-03-12 13:28:26 PDT
(In reply to Chris AtLee [:catlee] from comment #117)
> You know you can trigger rebuilds via tbpl, right? Click on the win32
> opt/debug build, and on the bottom left pane there will be a blue plus icon.
> Click that to re-trigger it.

I am enlightened.  Thanks!
Comment 130 Mozilla RelEng Bot 2012-03-12 18:04:26 PDT
Try run for be056fdb0132 is complete.
Detailed breakdown of the results available here:
	https://tbpl.mozilla.org/?tree=Try&rev=be056fdb0132
Results (out of 16 total builds):
    success: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/autolanduser@mozilla.com-be056fdb0132
Comment 135 Mike Hommey [:glandium] 2012-03-13 10:44:21 PDT
*** Bug 735302 has been marked as a duplicate of this bug. ***
Comment 136 Mike Hommey [:glandium] 2012-03-13 12:03:04 PDT
*** Bug 735302 has been marked as a duplicate of this bug. ***
Comment 137 Mike Hommey [:glandium] 2012-03-13 12:22:59 PDT
*** Bug 735302 has been marked as a duplicate of this bug. ***
Comment 138 Nathan Froyd [:froydnj] 2012-03-13 12:25:47 PDT
(In reply to Chris AtLee [:catlee] from comment #114)
> I needed to rebuild the windows builds on try 6 or 7 times to make sure it
> wasn't still happening. Can you do that?

https://tbpl.mozilla.org/?tree=Try&rev=b58312ae711d

says that we've been green for about that many builds with the patch.  WDYT?
Comment 139 Chris AtLee [:catlee] 2012-03-13 12:26:54 PDT
LGTM
Comment 144 Ted Mielczarek [:ted.mielczarek] 2012-03-14 05:13:26 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/c7d9848ffff5
Comment 147 Marco Bonardo [::mak] 2012-03-15 07:59:06 PDT
https://hg.mozilla.org/mozilla-central/rev/c7d9848ffff5
Comment 149 Phil Ringnalda (:philor) 2012-03-17 18:06:20 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/abf433daab89

Note You need to log in before you can comment on or make changes to this bug.