Closed Bug 1028304 Opened 11 years ago Closed 10 years ago

Test jobs on OSX on Cedar are busted.

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgraham, Unassigned)

References

Details

Lots of errors of the form: 3 not in success codes: [0, 11] Halting on failure while running ['unzip', '-q', '-o', '/builds/slave/talos-slave/test/build/firefox-33.0a1.en-US.mac.tests.zip'] I can't see anything obvious in the mozharness changes from production.
It's possible the upload was borked. Try a new osx build.
A build failed yesterday and another one today in the same way, so unless something changed it seems unlikely retrying will work (it also appears that debug builds were broken in this way as early as Monday).
Then I think the bug here is probably that osx *builds* are broken on cedar.
this might hint at something being wrong with osx builds on cedar - https://tbpl.mozilla.org/php/getParsedLog.php?id=42094331&full=1&branch=cedar although upload and sendchange appears 'normal'
actually this unexpected args for for log spans more than one platform and more than one day/rev ?
I grabbed the build and tests package from one of the problematic builds and — I assuming I got the right one — they seemed to work (or at least decompress) fine.
Assuming you're downloading the same installer+test zip that the test is, I would then suspect potential disk space related bustage. I would suspect proxxy but that seems to not be live.
Do we have some ETA on fixing this? Unfortunately it is the OSX builds that I'm most interested in :(
Have you asked your team or checked Cedar's changelog to see if someone changed something that broke them? I don't think any other tree is having issues.
We could reset cedar again if needed...
I guess https://hg.mozilla.org/projects/cedar/rev/48cb1e27d9a3 is a reasonable guess, as it seems to be the only cedar-specific change implicated in this breakage. Having said that, other than making the zipfiles larger it isn't obvious to me why it should have broken anything. It was also backed out from cedar, although later reintroduced in a merge. But it wasn't in the merge on 16th June when the debug tests started failing. The error message on tbpl is: 13:15:10 INFO - error [/builds/slave/talos-slave/test/build/firefox-33.0a1.en-US.mac64.tests.zip]: reported length of central directory is 13:15:10 INFO - -76 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1 13:15:10 INFO - zipfile?). Compensating... 13:15:16 INFO - error: expected central file header signature not found (file #14081). 13:15:16 INFO - (please check that you have transferred or created the zipfile in the 13:15:16 INFO - appropriate BINARY mode and that you have compiled UnZip properly) Which does suggest that the problem is that the zip doesn't get created correctly. But the file seemed OK when I tried it. Maybe I need to test OSX's command line unzip.
I agree that testing an osx unzip problem on non-osx is a non-valid test.
So... I found: Bug 971687 with the code at: http://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/config.py#1646 Which is "enable mozbase unit tests on cedar" I note the failing code in the log I was shared today: mozversion.mozversion.LocalB2GVersion WARNING | Error pulling gaia file ok ====================================================================== ERROR: test_save_path (test.TestCrash) ---------------------------------------------------------------------- Traceback (most recent call last): File "/builds/slave/ced-osx64-00000000000000000000/build/testing/mozbase/mozcrash/tests/test.py", line 97, in test_save_path quiet=True)) File "/builds/slave/ced-osx64-00000000000000000000/build/testing/mozbase/mozcrash/mozcrash/mozcrash.py", line 92, in check_for_crashes save_dump_file(dump_save_path, dump["minidump_path"], dump["minidump_extra"]) File "/builds/slave/ced-osx64-00000000000000000000/build/testing/mozbase/mozcrash/mozcrash/mozcrash.py", line 109, in save_dump_file os.path.join(dump_save_path, os.path.basename(dump_path))) TypeError: log() takes exactly 2 arguments (3 given) Thats a failure of "log" in a mozbase test. Is this anything to worry about dan?
To be clear, we're also failing on android atm with: File "/builds/tegra-129/test/build/tests/mozbase/mozlog/mozlog/structured/commandline.py", line 5, in <module> import argparse ImportError: No module named argparse Which again, is the in-tree mozlog
taking http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/cedar-macosx64/1403318743/firefox-33.0a1.en-US.mac.tests.zip I hit a similar error: snippet: inflating: xpcshell/tests/xpcom/tests/unit/TestStringAPI inflating: xpcshell/tests/xpcom/tests/unit/TestTArray inflating: xpcshell/tests/xpcom/tests/unit/TestTextFormatter inflating: xpcshell/tests/xpcom/tests/unit/TestThreadPoolListener error: expected central file header signature not found (file #66185). (please check that you have transferred or created the zipfile in the appropriate BINARY mode and that you have compiled UnZip properly) inflating: xpcshell/tests/xpcom/tests/unit/TestThreadUtils inflating: xpcshell/tests/xpcom/tests/unit/TestTimers inflating: xpcshell/tests/xpcom/tests/unit/TestUnicodeArguments inflating: xpcshell/tests/xpcom/tests/unit/xpcshell.ini inflating: xpcshell/tests/xpcshell.ini I am not sure if that is in xpcshell or if the error just made it to output at that point. Either way, I think this suggests that the zip is corrupt AKA their is something wrong with the 'build/compile' job. Looking at the build job that uploaded the zip, 'make upload' appears to have uploaded fine and sha's match. So not sure where the corruption is happening. Also, isn't it a sign the cedar tree is unhealthy that 'make -k check' is failing on all our cedar 'build' jobs?
(tested on an os x machine)
There are mozbuild and mozlog differences between m-c and cedar. I'm going to resolve the mozbuild changes and see if that helps.
......additionally based on Bug 989583 and https://tbpl.mozilla.org/?tree=Cedar&rev=48cb1e27d9a3 with https://tbpl.mozilla.org/?tree=Cedar&rev=bc4d904c46b8 it looks like its a possible that "number of things in the zip" that is at fault. I think I've exhausted all the efforts I can expel as buildduty, my suggestions: * Do a try run of cedar tip, and with a handful of testss disabled (in a way that doesn't add them to zip) * Reset cedar again[?] * Request an OSX build and test loaner and have someone figure out what needs to change in order to fix this I'm not sure relative priorities here, so which method we go for will likely depend on those factors.
(In reply to Jonathan Griffin (:jgriffin) from comment #17) > There are mozbuild and mozlog differences between m-c and cedar. I'm going > to resolve the mozbuild changes and see if that helps. Let's see how this fares: https://tbpl.mozilla.org/?tree=Cedar&showall=1&rev=f747fc6077ea
There are 66185 files in the archive; this seems suspiciously close to the Zip (not Zip64) limit of 65535. A working build had 63909 files. I strongly suspect this is a file number limit and that resetting cedar won't help. Other trees don't have web-platform-tests, so they are presumably well inside the limit. A little Googling suggests that the shipped unzip with OSX might not support Zip64 (although the shipped zip does). So I guess we need a newer version of unzip on these machines; presumably it's only a matter of time before we hit this limit in other places.
(In reply to James Graham [:jgraham] from comment #20) > There are 66185 files in the archive; this seems suspiciously close to the > Zip (not Zip64) limit of 65535. A working build had 63909 files. > > I strongly suspect this is a file number limit and that resetting cedar > won't help. Other trees don't have web-platform-tests, so they are > presumably well inside the limit. A little Googling suggests that the > shipped unzip with OSX might not support Zip64 (although the shipped zip > does). So I guess we need a newer version of unzip on these machines; > presumably it's only a matter of time before we hit this limit in other > places. Sounds logical. This is somewhat related: http://serverfault.com/questions/454935/zip-3-0-not-backwardly-compatible-with-zip-2-3-1 although unlike that situation, I believe we zip and unzip our tests.zip against version 5.52 on our build and test machines respectively. The key thing here is, we are using < 6.0 (6.0 supports zip64)
I imagine backing out bug 989583 on Cedar will help get things going in the short term.
Yeah, that works for me if there isn't a short term solution. I don't know if anyone else has plans for cedar that particularly depend on the work in that bug. Getting a proper fix here is, at least, a prerequisite for landing web-platform-tests on m-c, and not doing backouts will make keeping Cedar largely consistent with m-c that much easier, so it would be great if we could find some better solution relatively quickly (more quickly than we can implement the real solution of not shipping every test for every test job, for example).
(In reply to Jordan Lund (:jlund) from comment #21) > (In reply to James Graham [:jgraham] from comment #20) > > There are 66185 files in the archive; this seems suspiciously close to the > > Zip (not Zip64) limit of 65535. A working build had 63909 files. > > > > I strongly suspect this is a file number limit and that resetting cedar > > won't help. Other trees don't have web-platform-tests, so they are > > presumably well inside the limit. A little Googling suggests that the > > shipped unzip with OSX might not support Zip64 (although the shipped zip > > does). So I guess we need a newer version of unzip on these machines; > > presumably it's only a matter of time before we hit this limit in other > > places. > > Sounds logical. > > This is somewhat related: > http://serverfault.com/questions/454935/zip-3-0-not-backwardly-compatible- > with-zip-2-3-1 > > although unlike that situation, I believe we zip and unzip our tests.zip > against version 5.52 on our build and test machines respectively. The key > thing here is, we are using < 6.0 (6.0 supports zip64) How big a task is it to update the slaves with a new version of zip?
(In reply to Jonathan Griffin (:jgriffin) from comment #24) > (In reply to Jordan Lund (:jlund) from comment #21) > > (In reply to James Graham [:jgraham] from comment #20) > > > There are 66185 files in the archive; this seems suspiciously close to the > > > Zip (not Zip64) limit of 65535. A working build had 63909 files. > > > > > > I strongly suspect this is a file number limit and that resetting cedar > > > won't help. Other trees don't have web-platform-tests, so they are > > > presumably well inside the limit. A little Googling suggests that the > > > shipped unzip with OSX might not support Zip64 (although the shipped zip > > > does). So I guess we need a newer version of unzip on these machines; > > > presumably it's only a matter of time before we hit this limit in other > > > places. > > > > Sounds logical. > > > > This is somewhat related: > > http://serverfault.com/questions/454935/zip-3-0-not-backwardly-compatible- > > with-zip-2-3-1 > > > > although unlike that situation, I believe we zip and unzip our tests.zip > > against version 5.52 on our build and test machines respectively. The key > > thing here is, we are using < 6.0 (6.0 supports zip64) > > How big a task is it to update the slaves with a new version of zip? I see we have things like http://mxr.mozilla.org/build/source/puppet/modules/toplevel/manifests/slave/releng/build/standard.pp#13 but I'm not sure if we manage unzip itself with puppet. My guess is we use the default one that's installed with the machine. We should upgrade both our build and test osx slaves if we go this route. dustin - any ideas or thoughts WRT upgrading unzip on osx in terms of how and feasibility?
Flags: needinfo?(dustin)
If updating zip on the OSX slaves is non-trivial, we're going to try to switch tests.zip to tar files, which should get around the 64k limitation.
At least on OS X, yes, we're using the zip/unzip that ship with OS X: [root@bld-lion-r5-068.build.releng.scl3.mozilla.com ~]# which zip /usr/bin/zip [root@bld-lion-r5-068.build.releng.scl3.mozilla.com ~]# which unzip /usr/bin/unzip Updating that just means building a PKG/DMG and installing it. You'd need to make sure the updated version is earlier in the PATH (which should be as easy as putting the results in /usr/local/bin). It does seem like tar is a better choice, though.
I did a backout of the package-all-tests patch for now, so hopefully that will be enough to unblock me. How hard is it going to be to switch to tar?
This is a partial list of where we have a tests.zip hardcoded: http://mxr.mozilla.org/build/search?string=tests.zip There may be others where we detect if it endswith('zip') and has 'tests' in the name, or go by regex. We probably need to change all of these to allow for either a tarball or zip, reconfig, make the change on one branch, and roll it out.
As Aki noted, this change would have to ride the trains, so for several releases we'd have to have logic in many places that could deal with a zip or a tarball. As Callek pointed out on irc, switching to tar files may have implications for projects like Seamonkey and Thunderbird which do not use mozharness. As Catlee pointed out, we don't use tarballs currently because of problems handling these on Windows; these problems were a couple of years ago and we don't know if they exist today. We also do not have any visibility into how this change could impact downstream consumers that aren't part of buildbot, if any. So, I don't think is option is particularly well-scoped, and I think this is likely to be at least a moderate pain, with lots of room to break things due to all the moving parts. Upgrading zip on OSX seems like a less painful option in the short term, although we may want to investigate switching to tar files for other reasons.
Note that even with the backout of Armen's patch, this is still breaking all OSX debug jobs on cedar.
Depends on: 1032391
I am currently trying to wrap up my Q2 so I won't be able to get to this this week. But to help move along, I filed 1032391. There is arguments both ways but the comment 30 here sums up why this might be best for the short term.
Flags: needinfo?(dustin)
Blocks: 945222
It looks like the zip upgrade on mac was successful in bug 1032391 - can you confirm this fixed the underlying problem, and this bug can now be closed? Thanks, Pete
Flags: needinfo?(jlund)
Flags: needinfo?(james)
Yes this works fine now.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(james)
Resolution: --- → FIXED
Thanks James!
Flags: needinfo?(jlund)
You need to log in before you can comment on or make changes to this bug.