Closed Bug 693192 Opened 10 years ago Closed 9 years ago

Nightly builds infrequently fail with "make: *** [uploadsymbols] Error 50"

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: nthomas)

References

Details

(Whiteboard: [buildduty][symbols])

Attachments

(2 files)

Last time I saw one of these, bhearsum looked at it briefly, and said that unzip exiting with error 50 ought to be "the disk is (or was) full during extraction" which ought to not be possible. Somehow, either we manage it, or manage to fool unzip into thinking it is, or unzip lies.

https://tbpl.mozilla.org/php/getParsedLog.php?id=6744315&tree=Fx-Team
Currently there's 717G free on a 2T partition, and the nagios check has been in an OK state for the last 27 days.

(In reply to Phil Ringnalda (:philor) from comment #0)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6744315&tree=Fx-Team

[ffxbld@dm-symbolpush01 symbols_ffx]$ ls -l TestSTLWrappers/9E7CBDE9D06B303E91884DDE9F2B59230/TestSTLWrappers.dbg.gz
-rwxrwxr-x 1 ffxbld users 67161 Oct  9 07:48 TestSTLWrappers/9E7CBDE9D06B303E91884DDE9F2B59230/TestSTLWrappers.dbg.gz

so a subsequent build was able to create this file (original failure was ~4am on the 8th), and it's not very large.

Anyone got any ideas ?
Something else for buildduty to keep an eye on.
Priority: -- → P3
Whiteboard: [buildduty][symbols]
ffxbld@dm-symbolpush01.m.o now has a cron job that looks up the free space on /mnt/netapp/breakpad/ every minutes, and logs it to ~/disk_usage.log. The 690G free right now means I'm not very hopeful it'll find anything though.
Just for kicks, the clock on this VM is 12 and a half minutes slow.
(In reply to Phil Ringnalda (:philor) from comment #12)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=7226258&tree=Fx-Team

The upload started at Sat Nov 5 04:55:36 2011, and failed 1m 21s later. In the logging from comment #10 there's ~ 680G free at all times, although the sampling interval is 60 seconds. 

The slave was linux-ix-slave19. Unfortunately it's done another fx-team nightly since so we can't go poke around at the bits.
(In reply to Phil Ringnalda (:philor) from comment #17)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=7280300&tree=Fx-Team

mv-moz2-linux-ix-slave05 - upload between Nov 8 04:55:47 and 05:00:54. Log of disk usage looks fine.

extraction log:
  ...
  inflating: test_StatementCache/5DC87E5E7229151F93F05407F9DA91CB0/test_StatementCache.sym  
  inflating: TestParser/CC4E5DC62B9CD4B61F7C62022B380F640/TestParser.sym  
  inflating: TestParser/CC4E5DC62B9CD4B61F7C62022B380F640/TestParser.dbg.gz  
error:  cannot create test_statement_scoper/D8B9486FFD62DC7EF742B9D8E09EFB570/test_statement_scoper.sym
  inflating: test_statement_scoper/D8B9486FFD62DC7EF742B9D8E09EFB570/test_statement_scoper.dbg.gz  
  inflating: TestThreadPoolListener/B31B05BA9CFDCB68E61809EED5CC7EC90/TestThreadPoolListener.sym  
  inflating: TestThreadPoolListener/B31B05BA9CFDCB68E61809EED5CC7EC90/TestThreadPoolListener.dbg.gz  
  ...

Both files in test_statement_scoper/D*/ are 150K, total size after unpacking dist/bin/firefox-10.0a1.en-US.linux-i686.crashreporter-symbols-full.zip is 281M.
I pulled this data from the status db on the hunch there might be a problem if many builds upload symbols at the same time, but haven't had time to crunch it. Status of 2 is builds which have failed, times are all in UTC.
Duplicate of this bug: 705599
https://tbpl.mozilla.org/php/getParsedLog.php?id=7977975&tree=Mozilla-Inbound

Dunno if it's An Clue or just noise, but the error line in the log for this one was the mushed-together

(warning) cannot set timesmake: *** [uploadsymbols] Error 50
according to http://linux.die.net/man/1/unzip, error 50 means the disk was full during extraction.

nagios claims that the main mount point for dm-symbolpush01.mozilla.org has plenty of room, so there must be something else going on here.
I should have coffee before commenting on bugs...none of my previous comment is new information.
I think unzip is lying about out-of-disk. I'm looking at the source from unzip 6, and open_outfile returns the same failure if
- it can't allocate memory
- it can't move the unpacked file into the final location
- it can't remove some tmp file

In all of these cases the calling function receives the same return code from open_outfile, and responds by exiting with PK_DISK (5).

The latter cases should generate an error message to stderr, but the memory allocation failure doesn't. It's conceivable that on a box with 512MB RAM and 2GB swap that we're running out of memory.
Is this buildstep already set to retry, or do we risk hammering dm-symbolpush01 even worse if we do this?
(In reply to Nick Thomas [:nthomas] from comment #19)
> Created attachment 573434 [details]
> list of recent make uploadsymbols steps
> 
> I pulled this data from the status db on the hunch there might be a problem
> if many builds upload symbols at the same time, but haven't had time to
> crunch it. Status of 2 is builds which have failed, times are all in UTC.

It looks to me like all cases of failures happen when there's more than one upload symbols running at the same time. A recent check of the database reveals the same thing.

Would two zip files contain files that try to get extracted to the same path? It looks like maybe:
from http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012-01-07-04-02-21-mozilla-inbound-android/mozilla-inbound-android-nightly-build5.txt.gz:
 extracting: PrimitiveTest/F7931A07995B6DAB65D1F722A5B96D480/PrimitiveTest.dbg.gz  
make: *** [uploadsymbols] Error 50

from http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012-01-07-04-02-21-mozilla-inbound-android-xul/mozilla-inbound-android-xul-nightly-build4.txt.gz:
 extracting: PrimitiveTest/F7931A07995B6DAB65D1F722A5B96D480/PrimitiveTest.dbg.gz  
make: *** [uploadsymbols] Error 50
...-mozilla-inbound-andoid/...
...-mozilla-inbound-android-xul/...

Look like different paths, but still possible :-)
(In reply to Justin Wood (:Callek) from comment #47)
> ...-mozilla-inbound-andoid/...
> ...-mozilla-inbound-android-xul/...
> 
> Look like different paths, but still possible :-)

I believe they're all extracted to the same root directory regardless of branch/builder names.
Right, they're both going to symbols_mob on the symbol server.

Presumably the only fix here is to stagger the Native and XUL nightly builds.
(In reply to Ted Mielczarek [:ted, :luser] from comment #49)
> Right, they're both going to symbols_mob on the symbol server.
> 
> Presumably the only fix here is to stagger the Native and XUL nightly builds.

That's not a great general solution, particularly since there's no guarantee that if you schedule them to start at different times they won't end up running upload symbols at the same time anyway.

Here's a few other ideas:
- Run 'make uploadsymbols' in a loop N times until it works with some random backoff. 
- unzip into a temporary directory first and then use some other tool to copy into the final location
(In reply to Chris AtLee [:catlee] from comment #50)
> - unzip into a temporary directory first and then use some other tool to
> copy into the final location

This is feasible. The upload script is pretty simple currently:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/upload_symbols.sh
(In reply to Chris AtLee [:catlee] from comment #50)
> (In reply to Ted Mielczarek [:ted, :luser] from comment #49)
> > Presumably the only fix here is to stagger the Native and XUL nightly builds.
> That's not a great general solution, particularly since there's no guarantee
> that if you schedule them to start at different times they won't end up
> running upload symbols at the same time anyway.

Some of the failures are regular linux builds too. The disk space win 
 
> - unzip into a temporary directory first and then use some other tool to
> copy into the final location

I like this too, but perhaps we can just switch it from 
  'unzip -o': overwrite existing files without prompting
to
  'unzip -n': never overwrite existing files.  If a file already exists, skip the
              extraction of that file without prompting
Sure. rs=me to make that change. Can't break it any more than it already is, right?
Attached patch Swap to unzip -nSplinter Review
Famous last words! Carrying r+ from ted.

http://hg.mozilla.org/integration/mozilla-inbound/rev/366ca21f60e0
Attachment #587215 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/366ca21f60e0
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → nrthomas
Occurred again on inbound after this landed:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=8ffacd1ba6a8
https://tbpl.mozilla.org/php/getParsedLog.php?id=8443641&tree=Mozilla-Inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hmm, looks like the same error:
 error:  cannot create TestAutoPtr/D328362BBC4FB8F7747D7378A63683850/TestAutoPtr.dbg.gz
to 
 make: *** [uploadsymbols] Error 50

That 'Android XUL mozilla-inbound nightly' job might have collided with the m-i or birch native android build nightly, based on the times in the db. Both of those completed OK. 

Maybe there was a race there with birch, so we should let the unzip -n change propagate around the branches a bit and see if we still have issues. Incidentally, bug 717029 will disable the birch nightlies which mobile had enabled.
Comment on attachment 587215 [details] [diff] [review]
Swap to unzip -n

[Approval Request Comment]
Regression caused by (bug #): failing to upload symbols for some Linux & Android nightly builds, likely due to more branches doing nightlies, which causes the job to stop before uploading the bits. If someone notices they may rebuild it, otherwise we go a day without a nightly
User impact if declined: Missing nightlies
Testing completed (on m-c, etc.): On m-c since Jan 10, no regressions known
Risk to taking this patch (and alternatives if risky): this is a build time change that doesn't affect the bits users see
Attachment #587215 - Flags: approval-mozilla-aurora?
Comment on attachment 587215 [details] [diff] [review]
Swap to unzip -n

[Triage Comment]
Approved for Aurora.
Attachment #587215 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
http://hg.mozilla.org/releases/mozilla-aurora/rev/8456e8edda6a

I looked in the buildbot db for other failures. Apart from the small number on aurora already mentioned here, there is one on elm on the 18th. That branch hasn't merged the fix from m-c yet, so neither place had the unzip -n fix when they failed. So lets call this fixed.

bbondy (and other branch owners!), if failing nightlies causes you problems you can merge from m-c or transplant the fix from this bug.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.