Closed
Bug 693192
Opened 14 years ago
Closed 14 years ago
Nightly builds infrequently fail with "make: *** [uploadsymbols] Error 50"
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: nthomas)
References
Details
(Whiteboard: [buildduty][symbols])
Attachments
(2 files)
|
11.53 KB,
text/plain
|
Details | |
|
1.05 KB,
patch
|
nthomas
:
review+
akeybl
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
Last time I saw one of these, bhearsum looked at it briefly, and said that unzip exiting with error 50 ought to be "the disk is (or was) full during extraction" which ought to not be possible. Somehow, either we manage it, or manage to fool unzip into thinking it is, or unzip lies.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6744315&tree=Fx-Team
| Reporter | ||
Comment 1•14 years ago
|
||
| Assignee | ||
Comment 2•14 years ago
|
||
Currently there's 717G free on a 2T partition, and the nagios check has been in an OK state for the last 27 days.
(In reply to Phil Ringnalda (:philor) from comment #0)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6744315&tree=Fx-Team
[ffxbld@dm-symbolpush01 symbols_ffx]$ ls -l TestSTLWrappers/9E7CBDE9D06B303E91884DDE9F2B59230/TestSTLWrappers.dbg.gz
-rwxrwxr-x 1 ffxbld users 67161 Oct 9 07:48 TestSTLWrappers/9E7CBDE9D06B303E91884DDE9F2B59230/TestSTLWrappers.dbg.gz
so a subsequent build was able to create this file (original failure was ~4am on the 8th), and it's not very large.
Anyone got any ideas ?
| Reporter | ||
Comment 3•14 years ago
|
||
| Reporter | ||
Comment 4•14 years ago
|
||
Comment 5•14 years ago
|
||
Something else for buildduty to keep an eye on.
Priority: -- → P3
Whiteboard: [buildduty][symbols]
| Reporter | ||
Comment 6•14 years ago
|
||
| Reporter | ||
Comment 7•14 years ago
|
||
| Reporter | ||
Comment 8•14 years ago
|
||
| Reporter | ||
Comment 9•14 years ago
|
||
| Assignee | ||
Comment 10•14 years ago
|
||
ffxbld@dm-symbolpush01.m.o now has a cron job that looks up the free space on /mnt/netapp/breakpad/ every minutes, and logs it to ~/disk_usage.log. The 690G free right now means I'm not very hopeful it'll find anything though.
| Assignee | ||
Comment 11•14 years ago
|
||
Just for kicks, the clock on this VM is 12 and a half minutes slow.
| Reporter | ||
Comment 12•14 years ago
|
||
| Assignee | ||
Comment 13•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #12)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=7226258&tree=Fx-Team
The upload started at Sat Nov 5 04:55:36 2011, and failed 1m 21s later. In the logging from comment #10 there's ~ 680G free at all times, although the sampling interval is 60 seconds.
The slave was linux-ix-slave19. Unfortunately it's done another fx-team nightly since so we can't go poke around at the bits.
| Reporter | ||
Comment 14•14 years ago
|
||
| Reporter | ||
Comment 15•14 years ago
|
||
| Reporter | ||
Comment 16•14 years ago
|
||
| Reporter | ||
Comment 17•14 years ago
|
||
| Assignee | ||
Comment 18•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #17)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=7280300&tree=Fx-Team
mv-moz2-linux-ix-slave05 - upload between Nov 8 04:55:47 and 05:00:54. Log of disk usage looks fine.
extraction log:
...
inflating: test_StatementCache/5DC87E5E7229151F93F05407F9DA91CB0/test_StatementCache.sym
inflating: TestParser/CC4E5DC62B9CD4B61F7C62022B380F640/TestParser.sym
inflating: TestParser/CC4E5DC62B9CD4B61F7C62022B380F640/TestParser.dbg.gz
error: cannot create test_statement_scoper/D8B9486FFD62DC7EF742B9D8E09EFB570/test_statement_scoper.sym
inflating: test_statement_scoper/D8B9486FFD62DC7EF742B9D8E09EFB570/test_statement_scoper.dbg.gz
inflating: TestThreadPoolListener/B31B05BA9CFDCB68E61809EED5CC7EC90/TestThreadPoolListener.sym
inflating: TestThreadPoolListener/B31B05BA9CFDCB68E61809EED5CC7EC90/TestThreadPoolListener.dbg.gz
...
Both files in test_statement_scoper/D*/ are 150K, total size after unpacking dist/bin/firefox-10.0a1.en-US.linux-i686.crashreporter-symbols-full.zip is 281M.
| Assignee | ||
Comment 19•14 years ago
|
||
I pulled this data from the status db on the hunch there might be a problem if many builds upload symbols at the same time, but haven't had time to crunch it. Status of 2 is builds which have failed, times are all in UTC.
| Reporter | ||
Comment 20•14 years ago
|
||
Comment 22•14 years ago
|
||
From the duplicate bug I just filed & then duped:
https://tbpl.mozilla.org/php/getParsedLog.php?id=7600705&tree=Mozilla-Inbound
| Reporter | ||
Comment 23•14 years ago
|
||
| Reporter | ||
Comment 24•14 years ago
|
||
| Reporter | ||
Comment 25•14 years ago
|
||
| Reporter | ||
Comment 26•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=7977975&tree=Mozilla-Inbound
Dunno if it's An Clue or just noise, but the error line in the log for this one was the mushed-together
(warning) cannot set timesmake: *** [uploadsymbols] Error 50
| Reporter | ||
Comment 27•14 years ago
|
||
| Reporter | ||
Comment 28•14 years ago
|
||
Comment 29•14 years ago
|
||
| Reporter | ||
Comment 30•14 years ago
|
||
Comment 31•14 years ago
|
||
| Reporter | ||
Comment 32•14 years ago
|
||
| Reporter | ||
Comment 33•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=8231223&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=8231227&tree=Firefox
And 2011-12-30 wins the coveted "Take Out Both Android Nightlies Award"!
| Reporter | ||
Comment 34•14 years ago
|
||
Whoa, the "Take Out Both Android Nightlies On Both Trunk And Aurora Award"!
https://tbpl.mozilla.org/php/getParsedLog.php?id=8232194&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=8232188&tree=Mozilla-Aurora
| Reporter | ||
Comment 35•14 years ago
|
||
Comment 36•14 years ago
|
||
according to http://linux.die.net/man/1/unzip, error 50 means the disk was full during extraction.
nagios claims that the main mount point for dm-symbolpush01.mozilla.org has plenty of room, so there must be something else going on here.
Comment 37•14 years ago
|
||
I should have coffee before commenting on bugs...none of my previous comment is new information.
Comment 38•14 years ago
|
||
I think unzip is lying about out-of-disk. I'm looking at the source from unzip 6, and open_outfile returns the same failure if
- it can't allocate memory
- it can't move the unpacked file into the final location
- it can't remove some tmp file
In all of these cases the calling function receives the same return code from open_outfile, and responds by exiting with PK_DISK (5).
The latter cases should generate an error message to stderr, but the memory allocation failure doesn't. It's conceivable that on a box with 512MB RAM and 2GB swap that we're running out of memory.
| Reporter | ||
Comment 39•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=8262753&tree=Mozilla-Inbound (which, like comment 35 as well as 34/33, is Android).
| Reporter | ||
Comment 40•14 years ago
|
||
Comment 41•14 years ago
|
||
Is this buildstep already set to retry, or do we risk hammering dm-symbolpush01 even worse if we do this?
| Reporter | ||
Comment 42•14 years ago
|
||
| Reporter | ||
Comment 43•14 years ago
|
||
| Reporter | ||
Comment 44•14 years ago
|
||
| Reporter | ||
Comment 45•14 years ago
|
||
Comment 46•14 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #19)
> Created attachment 573434 [details]
> list of recent make uploadsymbols steps
>
> I pulled this data from the status db on the hunch there might be a problem
> if many builds upload symbols at the same time, but haven't had time to
> crunch it. Status of 2 is builds which have failed, times are all in UTC.
It looks to me like all cases of failures happen when there's more than one upload symbols running at the same time. A recent check of the database reveals the same thing.
Would two zip files contain files that try to get extracted to the same path? It looks like maybe:
from http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012-01-07-04-02-21-mozilla-inbound-android/mozilla-inbound-android-nightly-build5.txt.gz:
extracting: PrimitiveTest/F7931A07995B6DAB65D1F722A5B96D480/PrimitiveTest.dbg.gz
make: *** [uploadsymbols] Error 50
from http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012-01-07-04-02-21-mozilla-inbound-android-xul/mozilla-inbound-android-xul-nightly-build4.txt.gz:
extracting: PrimitiveTest/F7931A07995B6DAB65D1F722A5B96D480/PrimitiveTest.dbg.gz
make: *** [uploadsymbols] Error 50
Comment 47•14 years ago
|
||
...-mozilla-inbound-andoid/...
...-mozilla-inbound-android-xul/...
Look like different paths, but still possible :-)
Comment 48•14 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #47)
> ...-mozilla-inbound-andoid/...
> ...-mozilla-inbound-android-xul/...
>
> Look like different paths, but still possible :-)
I believe they're all extracted to the same root directory regardless of branch/builder names.
Comment 49•14 years ago
|
||
Right, they're both going to symbols_mob on the symbol server.
Presumably the only fix here is to stagger the Native and XUL nightly builds.
Comment 50•14 years ago
|
||
(In reply to Ted Mielczarek [:ted, :luser] from comment #49)
> Right, they're both going to symbols_mob on the symbol server.
>
> Presumably the only fix here is to stagger the Native and XUL nightly builds.
That's not a great general solution, particularly since there's no guarantee that if you schedule them to start at different times they won't end up running upload symbols at the same time anyway.
Here's a few other ideas:
- Run 'make uploadsymbols' in a loop N times until it works with some random backoff.
- unzip into a temporary directory first and then use some other tool to copy into the final location
Comment 51•14 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #50)
> - unzip into a temporary directory first and then use some other tool to
> copy into the final location
This is feasible. The upload script is pretty simple currently:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/upload_symbols.sh
| Assignee | ||
Comment 52•14 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #50)
> (In reply to Ted Mielczarek [:ted, :luser] from comment #49)
> > Presumably the only fix here is to stagger the Native and XUL nightly builds.
> That's not a great general solution, particularly since there's no guarantee
> that if you schedule them to start at different times they won't end up
> running upload symbols at the same time anyway.
Some of the failures are regular linux builds too. The disk space win
> - unzip into a temporary directory first and then use some other tool to
> copy into the final location
I like this too, but perhaps we can just switch it from
'unzip -o': overwrite existing files without prompting
to
'unzip -n': never overwrite existing files. If a file already exists, skip the
extraction of that file without prompting
Comment 53•14 years ago
|
||
Sure. rs=me to make that change. Can't break it any more than it already is, right?
| Assignee | ||
Comment 54•14 years ago
|
||
Famous last words! Carrying r+ from ted.
http://hg.mozilla.org/integration/mozilla-inbound/rev/366ca21f60e0
Attachment #587215 -
Flags: review+
Comment 55•14 years ago
|
||
Comment 56•14 years ago
|
||
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Assignee: nobody → nrthomas
Comment 57•14 years ago
|
||
Occurred again on inbound after this landed:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=8ffacd1ba6a8
https://tbpl.mozilla.org/php/getParsedLog.php?id=8443641&tree=Mozilla-Inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 58•14 years ago
|
||
Hmm, looks like the same error:
error: cannot create TestAutoPtr/D328362BBC4FB8F7747D7378A63683850/TestAutoPtr.dbg.gz
to
make: *** [uploadsymbols] Error 50
That 'Android XUL mozilla-inbound nightly' job might have collided with the m-i or birch native android build nightly, based on the times in the db. Both of those completed OK.
Maybe there was a race there with birch, so we should let the unzip -n change propagate around the branches a bit and see if we still have issues. Incidentally, bug 717029 will disable the birch nightlies which mobile had enabled.
| Reporter | ||
Comment 59•14 years ago
|
||
| Reporter | ||
Comment 60•14 years ago
|
||
| Assignee | ||
Comment 61•14 years ago
|
||
Comment on attachment 587215 [details] [diff] [review]
Swap to unzip -n
[Approval Request Comment]
Regression caused by (bug #): failing to upload symbols for some Linux & Android nightly builds, likely due to more branches doing nightlies, which causes the job to stop before uploading the bits. If someone notices they may rebuild it, otherwise we go a day without a nightly
User impact if declined: Missing nightlies
Testing completed (on m-c, etc.): On m-c since Jan 10, no regressions known
Risk to taking this patch (and alternatives if risky): this is a build time change that doesn't affect the bits users see
Attachment #587215 -
Flags: approval-mozilla-aurora?
| Reporter | ||
Comment 62•14 years ago
|
||
Comment 63•14 years ago
|
||
Comment on attachment 587215 [details] [diff] [review]
Swap to unzip -n
[Triage Comment]
Approved for Aurora.
Attachment #587215 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
| Assignee | ||
Comment 64•14 years ago
|
||
http://hg.mozilla.org/releases/mozilla-aurora/rev/8456e8edda6a
I looked in the buildbot db for other failures. Apart from the small number on aurora already mentioned here, there is one on elm on the 18th. That branch hasn't merged the fix from m-c yet, so neither place had the unzip -n fix when they failed. So lets call this fixed.
bbondy (and other branch owners!), if failing nightlies causes you problems you can merge from m-c or transplant the fix from this bug.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•