Closed Bug 1112262 Opened 10 years ago Closed 10 years ago

https://github.com/mozilla/gecko-projects is out-of-sync with hg project repos due to no free inodes on disk of vcssync2.srv.releng.usw2.mozilla.com

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kgrandon, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4235] )

Attachments

(1 file)

We're currently using cypress for gecko feature work, and I'd like to be able to access it from git. I could not find a branch under gecko-dev, so I'm wondering if I missed it of it's tracked somewhere else. I've found this, but it looks old, and I'm not sure if it's the correct repo to track: https://github.com/mozilla/gecko-projects/tree/cypress
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4235]
Correct - "disposable branches" (aka "twigs" & "project branches") are converted and pushed to the gecko-projects repository on github. It does look as if there are problems with that mirror -- cypress was only recently returned as a normal twig, and may have been over looked. Moving to the correct product & component for the git mirroring.
Component: Mercurial: hg.mozilla.org → Tools
OS: Mac OS X → All
Product: Developer Services → Release Engineering
Hardware: x86 → All
Blocks: spark
Pete: cypress is configured, but not converting - seems like debug is warranted before following reset (<https://wiki.mozilla.org/ReleaseEngineering/VCSSync/HowTo#How_to_deal_with_project_branch_reset>)
Flags: needinfo?(pmoore)
So this is quite bizarre for several reasons. 1) the project branches are not syncing due to the following error, and have not been since Dec 08:39 PT on 8 December 2014: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com vcs2vcs]$ cat /opt/vcs2vcs/projects.log 2014-12-17 05:43:01 pid-27007 Acquiring lock 2014-12-17 05:43:01 pid-27007 Updating mozharness pulling from http://hg.mozilla.org/build/mozharness abort: could not lock repository /opt/vcs2vcs/mozharness: No space left on device However the disk seems fine: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com vcs2vcs]$ df -h /opt/vcs2vcs/mozharness/ Filesystem Size Used Avail Use% Mounted on /dev/xvdj 99G 57G 37G 61% /opt [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com vcs2vcs]$ The run script is: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ cat /opt/vcs2vcs/run_projects.sh #!/bin/bash # This Source Code Form is subject to the terms of the Mozilla Public # License, v. 2.0. If a copy of the MPL was not distributed with this # file, You can obtain one at http://mozilla.org/MPL/2.0/. # # This file is managed by puppet set -e cd /opt/vcs2vcs exec > projects.log 2>&1 function log() { echo "$(date '+%Y-%m-%d %H:%M:%S') pid-$$ $*" } log "Acquiring lock" lockfile -s60 -r5 projects.lock trap "rm -f $PWD/projects.lock" EXIT # Get mozharness updated / checked out and working log "Updating mozharness" (timeout 20 hg --cwd mozharness pull -u) log "Running hg_git.py" python mozharness/scripts/vcs-sync/vcs_sync.py -c mozharness/configs/vcs_sync/project-branches.py # Touch our timestamp file so nagios can check if we're fresh touch projects.stamp log "Done" [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ So the curious parts are: 1) Why does the hg pull think there is "No space left on device" ? 2) Why are we not getting nagios alerts? 3) If no projects are syncing for a week, how come we haven't heard about this? I can probably fix this by blowing away the mozharness checkout, and recloning, but it is not clear what caused this problem. I have also checked permissions are ok (i.e. mozharness is owned by vcs2vcs): [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ ls -ltrA /opt/vcs2vcs/ total 3750100 -rw-r----- 1 asasaki asasaki 3840072439 Oct 7 2013 initial3.tar.bz2 drwxr-xr-x 9 vcs2vcs vcs2vcs 4096 Oct 11 2013 git-mozharness drwxr-x--- 6 vcs2vcs vcs2vcs 4096 Oct 11 2013 build -rwxr-x--- 1 vcs2vcs vcs2vcs 789 Jan 7 2014 run_projects.sh drwxrwxr-x 13 vcs2vcs vcs2vcs 4096 Dec 5 18:48 mozharness -rw-r--r-- 1 vcs2vcs vcs2vcs 0 Dec 8 08:30 projects.stamp drwxrwxr-x 2 vcs2vcs vcs2vcs 4096 Dec 8 08:31 logs -rw-r--r-- 1 vcs2vcs vcs2vcs 229 Dec 17 05:48 projects.log A manual update attempt resulted in the following: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ hg -R /opt/vcs2vcs/mozharness pull -u pulling from http://hg.mozilla.org/build/mozharness searching for changes abort: No space left on device: /opt/vcs2vcs/mozharness/.hg/journal.dirstate [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$
Flags: needinfo?(pmoore)
A fresh clone exhibits the same problem: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ hg clone -r production http://hg.mozilla.org/build/mozharness /opt/vcs2vcs/mozharness abort: No space left on device: /opt/vcs2vcs/mozharness/.hg [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ df -h /opt/vcs2vcs Filesystem Size Used Avail Use% Mounted on /dev/xvdj 99G 57G 37G 61% /opt [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ :/
Thanks bhearsum for the tip! Out of inodes... [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ df -i /opt/vcs2vcs Filesystem Inodes IUsed IFree IUse% Mounted on /dev/xvdj 6553600 6553599 1 100% /opt
So possible solutions I see at the moment: 1) Rebuild partition with more inodes, and reinstall. It would make sense to do this in bug 927199 where this machine is being puppetised. 2) Blow stuff away that is no longer needed (no obvious candidates I see at the moment). 3) Split gecko-projects vcs sync jobs across other machines. 4) Shrink existing partition and create a new partition. 5) Since this is EC2 hosted, maybe attaching additional storage is possible.
See Also: → 927199
OK I'm currently running a report to double check why we have so many small files / high inode usage, to check there is not some fundamental problem there, and then if all looks in order and so many inodes are really needed, I will create a new volume in usw2 with the same disk size (99GB) via: https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Volumes:sort=desc:createTime with double the number of inodes (13107200) and I will rsync the the data across, and then swap out the old volume with the new one. I've temporarily disabled the vcs sync cron job for gecko projects, until this is done. It hasn't run since December 8th due to this problem anyway.
It turns out to be genuine usage, e.g. the report identified directories with over 10,000 files, such as: [vcs2vcs@vcssync2.srv.releng.usw2.mozilla.com ~]$ ls /opt/vcs2vcs/build/conversion/project-branches/.git/objects/67 | wc -l 10097 So I'll proceed as proposed above by migrating to a new volume with more inodes.
Top inode eaters: 10017 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/71 10021 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/00 10022 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/0c 10022 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/25 10036 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/bc 10060 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/2c 10099 /opt/vcs2vcs/build/conversion/project-branches/.git/objects/67
Summary: Git mirror for cypress branch → https://github.com/mozilla/gecko-projects is out-of-sync with hg project repos due to no free inodes on disk of vcssync2.srv.releng.usw2.mozilla.com
Created 100GB Volume vol-3ff24e2e (Magenetic, not encrypted) in us-west-2b and attached as /dev/sdg to instance i-b0d76287 (vcssync2.srv.releng.usw2.mozilla.com).
bash-4.1# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvde1 202:65 0 10G 0 disk / xvdj 202:144 0 100G 0 disk /opt xvdk 202:160 0 100G 0 disk bash-4.1# mkfs.ext4 -N 13107200 /dev/xvdk mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 13107200 inodes, 26214400 blocks 1310720 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 800 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 33 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. bash-4.1#
bash-4.1# mkdir /opt_new bash-4.1# ls -ltrd /opt* drwxr-xr-x 4 root root 4096 Oct 11 2013 /opt drwxr-xr-x 2 root root 4096 Dec 22 05:03 /opt_new bash-4.1# sudo mount /dev/xvdk /opt_new/ bash-4.1# cat /etc/fstab LABEL=root_dev / ext4 defaults,noatime 1 1 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 /dev/xvdj /opt ext4 defaults,noatime 1 2 bash-4.1# ls /opt_new/ lost+found bash-4.1# rsync -gloptruc /opt /opt_new
It looks like this rsync process could take a couple of days... It is running in a screen session as root.
The rsync has completed, and I remounted: bash-4.1# mount -l /dev/xvde1 on / type ext4 (rw,noatime) [root_dev] none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) none on /dev/shm type tmpfs (rw) /dev/xvdj on /opt type ext4 (rw,noatime) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/xvdk on /opt_new type ext4 (rw) bash-4.1# umount /dev/xvdj bash-4.1# umount /dev/xvdk bash-4.1# cat /etc/fstab LABEL=root_dev / ext4 defaults,noatime 1 1 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 /dev/xvdj /opt ext4 defaults,noatime 1 2 bash-4.1# sed -i 's/\/dev\/xvdj/\/dev\/xvdj/' /etc/fstab bash-4.1# mount -a bash-4.1# mount -l /dev/xvde1 on / type ext4 (rw,noatime) [root_dev] none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) none on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/xvdk on /opt type ext4 (rw,noatime) bash-4.1# cd /opt bash-4.1# ls lost+found opt bash-4.1# mv opt/vcs2vcs/ . bash-4.1# rm -rf opt bash-4.1# ls -ltrA total 20 drwxr-xr-x 6 vcs2vcs root 4096 Dec 17 05:53 vcs2vcs drwx------ 2 root root 16384 Dec 22 03:12 lost+found I also then su'd to vcs2vcs user, and reenabled crontab. I'm now watching job to check all is ok.
I've also deleted /opt_new
All seems to be working. I've detached and deleted the previous 100Gb volume, so now only the new volume is attached, and the old storage has been given back. I'll close this bug when we've had a successful run, and all the project branches have been brought up-to-date.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: cypress
No longer blocks: spark
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: