Closed
Bug 985898
Opened 11 years ago
Closed 11 years ago
Investigate why more than two third of try builds are recloning the shared repo
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: glandium, Unassigned)
References
Details
Attachments
(1 file, 1 obsolete file)
1.31 MB,
text/plain
|
Details |
Today, I did a bunch of try builds, and one thing that struck me, is that, on one particular push i did, that just failed at the very beginning of make -f client.mk, all builds failed in more than 15 minutes. Which seemed a lot. So I looked at the logs, and they were all re unbundling the try bundle, which takes about 10 minutes alone. (the remaining time is mock initialization, taking about 5 minutes)
Reading the logs and the hgtool.py code, it would seem the shared directory doesn't even exist when that happens! Even on slaves that presumably are not new (but maybe i'm misreading the slave history)
Now, some scary numbers. I looked at the last 23832 build logs that do run hgtool with the try.hg bundle. 9110 of them are unbundling it. That's 38.2%.
Worse, if i look at linux builds only:
6565 unbundle in 9839 logs (66.7%)
and android builds:
2439 unbundle in 3399 logs (71.8%)
that totals 9004 unbundles, so only 106 unbundles happened in the remaining 10594 logs for other platforms.
Can someone please look into this?
Example log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-a050000071ce/try-android-debug/try-android-debug-bm78-try1-build157.txt.gz ; not that if /builds/hg-shared/try existed, according to the hgtool.py code, we should see a "hg path default" run in /builds/hg-shared/try after the "/builds/slave/try-and-d-00000000000000000000/build doesn't appear to be a valid hg directory; clobbering". The slave was idle about 20 minutes before starting this build after failing a b2g one.
Another log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/masayuki@d-toybox.com-b892f88012af/try-linux64/try-linux64-bm78-try1-build224.txt.gz
The slave succeeded a valgrind build about 4 minutes before starting this one.
(I can't seem to get slave history reliably on slave health, btw ; I would have looked at more slaves/logs correlation if I could)
Reporter | ||
Comment 1•11 years ago
|
||
If you can correlate with this list... I was only able to see slave histories for 3 logs (tried to look at about 10). 2 of those builds were after a failed b2g build, one after a successful valgrind build.
Reporter | ||
Comment 2•11 years ago
|
||
Oops, wrong file. Here's the right one, hopefully.
Attachment #8394067 -
Attachment is obsolete: true
Comment 3•11 years ago
|
||
My guess is most of these are going to be spot nodes, so slave history doesn't mean much.
Rail, are we populating /builds/hg-shared on the AMI used for try spot nodes? If not, then we should!
Flags: needinfo?(rail)
Comment 4•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #3)
> My guess is most of these are going to be spot nodes, so slave history
> doesn't mean much.
>
> Rail, are we populating /builds/hg-shared on the AMI used for try spot
> nodes? If not, then we should!
We snapshot existing try on-demand without deleting the files under hg-share and git-share. There is no explicit step to pre-populate those directories.
Flags: needinfo?(rail)
Reporter | ||
Comment 5•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #3)
> My guess is most of these are going to be spot nodes, so slave history
> doesn't mean much.
>
> Rail, are we populating /builds/hg-shared on the AMI used for try spot
> nodes? If not, then we should!
If /builds/hg-shared is not in the AMI, 70% of the builds without /builds/hg-shared mean 70% of the builds are done on fresh slaves?!? That seems awfully high.
Comment 6•11 years ago
|
||
So I just checked the AMI, and it does have /builds/hg-shared/try present and populated.
Comment 7•11 years ago
|
||
So I just watched a fresh slave do this:
python tools/buildfarm/maintenance/purge_builds.py -s 12 -n info -n 'rel-*:45d' -n 'tb-rel-*:45d' .. /mock/users/cltbld/home/cltbld/build
in dir /builds/slave/try-and-0000000000000000000000/. (timeout 3600 secs)
watching logfiles {}
argv: ['python', 'tools/buildfarm/maintenance/purge_builds.py', '-s', '12', '-n', 'info', '-n', 'rel-*:45d', '-n', 'tb-rel-*:45d', '..', '/mock/users/cltbld/home/cltbld/build']
environment:
CCACHE_COMPRESS=1
CCACHE_DIR=/builds/ccache
CCACHE_HASHDIR=
CCACHE_UMASK=002
CVS_RSH=ssh
DISPLAY=:2
G_BROKEN_FILENAMES=1
HG_SHARE_BASE_DIR=/builds/hg-shared
HISTCONTROL=ignoredups
HISTSIZE=1000
HOME=/home/cltbld
HOSTNAME=try-linux64-spot-095.try.releng.use1.mozilla.com
LANG=en_US.UTF-8
LC_ALL=C
LESSOPEN=|/usr/bin/lesspipe.sh %s
LOGNAME=cltbld
MAIL=/var/spool/mail/cltbld
MOZ_CRASHREPORTER_NO_REPORT=1
MOZ_OBJDIR=obj-firefox
MOZ_SIGN_CMD=python /builds/slave/try-and-0000000000000000000000/tools/release/signing/signtool.py --cachedir /builds/slave/try-and-0000000000000000000000/signing_cache -t /builds/slave/try-and-0000000000000000000000/token -n /builds/slave/try-and-0000000000000000000000/nonce -c /builds/slave/try-and-0000000000000000000000/tools/release/signing/host.cert -H signing4.srv.releng.scl3.mozilla.com:9110 -H signing5.srv.releng.scl3.mozilla.com:9110 -H signing6.srv.releng.scl3.mozilla.com:9110
PATH=/opt/local/bin:/tools/python/bin:/tools/buildbot/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/
POST_SYMBOL_UPLOAD_CMD=/usr/local/bin/post-symbol-upload.py
PWD=/builds/slave/try-and-0000000000000000000000
SHELL=/bin/bash
SHIP_LICENSED_FONTS=1
SHLVL=1
SYMBOL_SERVER_HOST=symbolpush.mozilla.org
SYMBOL_SERVER_PATH=/mnt/netapp/breakpad/symbols_mob/
SYMBOL_SERVER_SSH_KEY=/home/mock_mozilla/.ssh/ffxbld_dsa
SYMBOL_SERVER_USER=ffxbld
TERM=linux
TINDERBOX_OUTPUT=1
TMOUT=86400
USER=cltbld
_=/tools/buildbot/bin/python
using PTY: False
Deleting /builds/hg-shared/try/.hg
Cleaning up /builds/hg-shared/try
70.49 GB of space available
program finished with exit code 0
elapsedTime=8.439207
Comment 8•11 years ago
|
||
/builds/hg-shared/try/.hg was dated Mar 31 11:22, and we had plenty of disk space free...
Comment 9•11 years ago
|
||
So purge_builds doesn't know that /builds/slave and /builds are on different partitions.
It's trying to free up 12GB of space, and sees that /builds only has 9.2GB free. /builds/slave has 71GB free.
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•