Closed Bug 985898 Opened 8 years ago Closed 8 years ago

Investigate why more than two third of try builds are recloning the shared repo

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Unassigned)

References

Details

Attachments

(1 file, 1 obsolete file)

Today, I did a bunch of try builds, and one thing that struck me, is that, on one particular push i did, that just failed at the very beginning of make -f client.mk, all builds failed in more than 15 minutes. Which seemed a lot. So I looked at the logs, and they were all re unbundling the try bundle, which takes about 10 minutes alone. (the remaining time is mock initialization, taking about 5 minutes)

Reading the logs and the hgtool.py code, it would seem the shared directory doesn't even exist when that happens! Even on slaves that presumably are not new (but maybe i'm misreading the slave history)

Now, some scary numbers. I looked at the last 23832 build logs that do run hgtool with the try.hg bundle. 9110 of them are unbundling it. That's 38.2%.
Worse, if i look at linux builds only:
  6565 unbundle in 9839 logs (66.7%)
and android builds:
  2439 unbundle in 3399 logs (71.8%)
that totals 9004 unbundles, so only 106 unbundles happened in the remaining 10594 logs for other platforms.

Can someone please look into this?

Example log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-a050000071ce/try-android-debug/try-android-debug-bm78-try1-build157.txt.gz ; not that if /builds/hg-shared/try existed, according to the hgtool.py code, we should see a "hg path default" run in /builds/hg-shared/try after the "/builds/slave/try-and-d-00000000000000000000/build doesn't appear to be a valid hg directory; clobbering". The slave was idle about 20 minutes before starting this build after failing a b2g one.

Another log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/masayuki@d-toybox.com-b892f88012af/try-linux64/try-linux64-bm78-try1-build224.txt.gz
The slave succeeded a valgrind build about 4 minutes before starting this one.

(I can't seem to get slave history reliably on slave health, btw ; I would have looked at more slaves/logs correlation if I could)
If you can correlate with this list... I was only able to see slave histories for 3 logs (tried to look at about 10). 2 of those builds were after a failed b2g build, one after a successful valgrind build.
Oops, wrong file. Here's the right one, hopefully.
Attachment #8394067 - Attachment is obsolete: true
My guess is most of these are going to be spot nodes, so slave history doesn't mean much.

Rail, are we populating /builds/hg-shared on the AMI used for try spot nodes? If not, then we should!
Flags: needinfo?(rail)
(In reply to Chris AtLee [:catlee] from comment #3)
> My guess is most of these are going to be spot nodes, so slave history
> doesn't mean much.
> 
> Rail, are we populating /builds/hg-shared on the AMI used for try spot
> nodes? If not, then we should!

We snapshot existing try on-demand without deleting the files under hg-share and git-share. There is no explicit step to pre-populate those directories.
Flags: needinfo?(rail)
(In reply to Chris AtLee [:catlee] from comment #3)
> My guess is most of these are going to be spot nodes, so slave history
> doesn't mean much.
> 
> Rail, are we populating /builds/hg-shared on the AMI used for try spot
> nodes? If not, then we should!

If /builds/hg-shared is not in the AMI, 70% of the builds without /builds/hg-shared mean 70% of the builds are done on fresh slaves?!? That seems awfully high.
See Also: → 990344
So I just checked the AMI, and it does have /builds/hg-shared/try present and populated.
So I just watched a fresh slave do this:

python tools/buildfarm/maintenance/purge_builds.py -s 12 -n info -n 'rel-*:45d' -n 'tb-rel-*:45d' .. /mock/users/cltbld/home/cltbld/build
 in dir /builds/slave/try-and-0000000000000000000000/. (timeout 3600 secs)
 watching logfiles {}
 argv: ['python', 'tools/buildfarm/maintenance/purge_builds.py', '-s', '12', '-n', 'info', '-n', 'rel-*:45d', '-n', 'tb-rel-*:45d', '..', '/mock/users/cltbld/home/cltbld/build']
 environment:
  CCACHE_COMPRESS=1
  CCACHE_DIR=/builds/ccache
  CCACHE_HASHDIR=
  CCACHE_UMASK=002
  CVS_RSH=ssh
  DISPLAY=:2
  G_BROKEN_FILENAMES=1
  HG_SHARE_BASE_DIR=/builds/hg-shared
  HISTCONTROL=ignoredups
  HISTSIZE=1000
  HOME=/home/cltbld
  HOSTNAME=try-linux64-spot-095.try.releng.use1.mozilla.com
  LANG=en_US.UTF-8
  LC_ALL=C
  LESSOPEN=|/usr/bin/lesspipe.sh %s
  LOGNAME=cltbld
  MAIL=/var/spool/mail/cltbld
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_OBJDIR=obj-firefox
  MOZ_SIGN_CMD=python /builds/slave/try-and-0000000000000000000000/tools/release/signing/signtool.py --cachedir /builds/slave/try-and-0000000000000000000000/signing_cache -t /builds/slave/try-and-0000000000000000000000/token -n /builds/slave/try-and-0000000000000000000000/nonce -c /builds/slave/try-and-0000000000000000000000/tools/release/signing/host.cert -H signing4.srv.releng.scl3.mozilla.com:9110 -H signing5.srv.releng.scl3.mozilla.com:9110 -H signing6.srv.releng.scl3.mozilla.com:9110
  PATH=/opt/local/bin:/tools/python/bin:/tools/buildbot/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/
  POST_SYMBOL_UPLOAD_CMD=/usr/local/bin/post-symbol-upload.py
  PWD=/builds/slave/try-and-0000000000000000000000
  SHELL=/bin/bash
  SHIP_LICENSED_FONTS=1
  SHLVL=1
  SYMBOL_SERVER_HOST=symbolpush.mozilla.org
  SYMBOL_SERVER_PATH=/mnt/netapp/breakpad/symbols_mob/
  SYMBOL_SERVER_SSH_KEY=/home/mock_mozilla/.ssh/ffxbld_dsa
  SYMBOL_SERVER_USER=ffxbld
  TERM=linux
  TINDERBOX_OUTPUT=1
  TMOUT=86400
  USER=cltbld
  _=/tools/buildbot/bin/python
 using PTY: False
Deleting /builds/hg-shared/try/.hg
Cleaning up /builds/hg-shared/try
70.49 GB of space available
program finished with exit code 0
elapsedTime=8.439207
/builds/hg-shared/try/.hg was dated Mar 31 11:22, and we had plenty of disk space free...
So purge_builds doesn't know that /builds/slave and /builds are on different partitions.

It's trying to free up 12GB of space, and sees that /builds only has 9.2GB free. /builds/slave has 71GB free.
Depends on: 991230
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.