Closed
Bug 1090885
Opened 10 years ago
Closed 10 years ago
B2G Bumper nagios alerts in #buildduty /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp due to hg commit failing due to no changes in working directory to commit
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Unassigned)
References
Details
Errors, like this in all the b2g bumper log files (/builds/b2g_bumper/v*.log on buildbot-master66.srv.releng.usw2.mozilla.com): 04:06:06 INFO - Running command: ['/usr/local/bin/hg', '--config', 'web.cacerts=/etc/pki/tls/certs/ca-bundle.crt', 'commit', '-u', 'B2G Bumper Bot <release+b2gbumper@mozilla.com>', '-m', 'Bumping manifests a=b2g-bump'] in /builds/b2g_bumper/v1.4/build/mozilla-b2g30_v1_4 04:06:06 INFO - Copy/paste: /usr/local/bin/hg --config web.cacerts=/etc/pki/tls/certs/ca-bundle.crt commit -u "B2G Bumper Bot <release+b2gbumper@mozilla.com>" -m "Bumping manifests a=b2g-bump" 04:06:06 INFO - Using env: {'HOME': '/home/cltbld', 04:06:06 INFO - 'LANG': 'en_US.UTF-8', 04:06:06 INFO - 'LOGNAME': 'cltbld', 04:06:06 INFO - 'OLDPWD': '/builds/b2g_bumper', 04:06:06 INFO - 'PATH': '/usr/local/bin:/usr/bin:/bin', 04:06:06 INFO - 'PWD': '/builds/b2g_bumper/v1.4', 04:06:06 INFO - 'SHELL': '/bin/sh', 04:06:06 INFO - 'SHLVL': '2', 04:06:06 INFO - 'USER': 'cltbld', 04:06:06 INFO - '_': '/usr/bin/python'} 04:06:14 INFO - nothing changed 04:06:14 ERROR - Return code: 1 04:06:14 INFO - No changes - all done 04:06:14 INFO - Copying logs to upload dir... Not clear why b2g bumper is trying to commit when there are no changes to commit. It looks like this is the reason the stamp file is not getting touched, i.e. last lines of /usr/local/bin/run_b2g_bumper.sh look like: # Now wait for them to finish retval=0 for pid in $pids; do wait $pid if [ $? != 0 ]; then retval=$? fi done if [ $retval = 0 ]; then # Touch our timestamp file so nagios can check if we're fresh touch b2g_bumper.stamp log "Done" fi And /usr/local/bin/run_b2g_bumper.sh is called from cron: [cltbld@buildbot-master66.srv.releng.usw2.mozilla.com b2g_bumper]$ cat /etc/cron.d/run_b2g_bumper # This file is managed by puppet */5 * * * * cltbld /usr/local/bin/run_b2g_bumper.sh [cltbld@buildbot-master66.srv.releng.usw2.mozilla.com b2g_bumper]$
Reporter | ||
Comment 1•10 years ago
|
||
This is therefore causing alerts in #buildduty such as: nagios-releng 12:02:50 Wed 04:02:50 PDT [4444] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp is 4814 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
Reporter | ||
Comment 2•10 years ago
|
||
So it looks like the ERROR entries might be a red herring, as in the code, we simply test whether there were changes (http://hg.mozilla.org/build/mozharness/file/190eb44f4d0e/scripts/b2g_bumper.py#l567) and then exit the push loop if there weren't any. So the exit code from the mozharness script should not be affected by this ERROR entry in the log, I believe. Now trying to work out why the b2g_bumper.stamp file is not getting touched.
Reporter | ||
Comment 3•10 years ago
|
||
I've had a look through the mozharness framework (mostly script.py) to see if I can find anything that would cause the mozharness framework to throw a non-zero return code, but not found anything obvious. I gunzipped previous b2g log file archives to see if I could find similar ERROR messages in there, but could not. I cd'd into the working directory where the commit was being attempted for one of the log files, and could confirm that the working directory was clean, and in sync with hg.mozilla.org, and that commits had been made and pushed since the b2g bumper stamp file was generated, so there have been successful commits/pushes to hg.m.o without b2g bumper stamp getting updated.
Reporter | ||
Comment 4•10 years ago
|
||
Getting vcs sync failure emails - may be related: e.g. report from /opt/vcs2vcs/bin/check_process_delay on github-sync3.dmz.scl3.mozilla.com: process not running for stale file '/opt/vcs2vcs/b2g/details.log' -rw-r----- 1 vcs2vcs vcs2vcs 54023543 Oct 29 10:20 /opt/vcs2vcs/b2g/details.log process not running for stale file '/opt/vcs2vcs/b2g/update.log' -rw-r----- 1 vcs2vcs vcs2vcs 87759091 Oct 29 10:20 /opt/vcs2vcs/b2g/update.log
Reporter | ||
Comment 5•10 years ago
|
||
[vcs2vcs@github-sync3.dmz.scl3 ~]$ ps -ef | grep 15[5]95 vcs2vcs 15595 10601 0 10:20 pts/10 00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg id http://hg.mozilla.org/releases/mozilla-b2g30_v1_4
Reporter | ||
Comment 6•10 years ago
|
||
Calling in dustin for assistance with diagnostics! Not clear why this process has hung...
Comment 7•10 years ago
|
||
closed gaia and b2g-inbound now for the problems
Comment 8•10 years ago
|
||
[root@github-sync3.dmz.scl3 ~]# lsof -p 15595 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ... hg 15595 vcs2vcs 0u CHR 136,10 0t0 13 /dev/pts/10 hg 15595 vcs2vcs 1w FIFO 0,8 0t0 209787502 pipe hg 15595 vcs2vcs 2u CHR 136,10 0t0 13 /dev/pts/10 hg 15595 vcs2vcs 3u IPv4 209787513 0t0 TCP github-sync3.dmz.scl3.mozilla.com:60367->hg-zlb.vips.scl3.mozilla.com:http (ESTABLISHED) [root@github-sync3.dmz.scl3 ~]# strace -p 15595 Process 15595 attached - interrupt to quit recvfrom(3, [root@github-sync3.dmz.scl3 ~]# netstat | grep 60367 tcp 0 0 github-sync3.dmz.scl3:60367 hg-zlb.vips.scl3.mozil:http ESTABLISHED so it looks like that process is hung waiting for that TCP connection. I wasn't able to confirm or deny that the connection is actually hung on the other end because it's going out through NAT to the ZLB, but it looks like this connection got "wedged" somehow -- either by NAT timing out or by some combination of packet loss between the two -- and the hg process is happily waiting forever for a response from the server, while the server has forgotten all about the transaction.
Reporter | ||
Comment 9•10 years ago
|
||
Strangely, the b2g bumper stamp age problem recovered a minute or two before I killed the hung process on the vcs sync server, so it may be that they were not related after all. I will continue investigating.
Comment 10•10 years ago
|
||
Seeing this in master.log: 06:15:40 INFO - Got output: 06:15:40 FATAL - Uncaught exception: Traceback (most recent call last): 06:15:40 FATAL - File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1268, in run 06:15:40 FATAL - self.run_action(action) 06:15:40 FATAL - File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1210, in run_action 06:15:40 FATAL - self._possibly_run_method(method_name, error_if_missing=True) 06:15:40 FATAL - File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1151, in _possibly_run_method 06:15:40 FATAL - return getattr(self, method_name)() 06:15:40 FATAL - File "../mozharness/scripts/b2g_bumper.py", line 563, in push_loop 06:15:40 FATAL - self.massage_manifests() 06:15:40 FATAL - File "../mozharness/scripts/b2g_bumper.py", line 484, in massage_manifests 06:15:40 FATAL - self.resolve_refs(manifest) 06:15:40 FATAL - File "../mozharness/scripts/b2g_bumper.py", line 223, in resolve_refs 06:15:40 FATAL - abs_revision = result.get() 06:15:40 FATAL - File "/usr/lib64/python2.6/multiprocessing/pool.py", line 422, in get 06:15:40 FATAL - raise self._value 06:15:40 FATAL - IndexError: list index out of range 06:15:40 FATAL - Running post_fatal callback... 06:15:40 FATAL - Exiting -1 Exception in thread Thread-42 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner File "/usr/lib64/python2.6/threading.py", line 484, in run File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker File "/usr/lib64/python2.6/Queue.py", line 138, in put File "/usr/lib64/python2.6/threading.py", line 274, in notify <type 'exceptions.TypeError'>: 'NoneType' object is not callable Exception in thread Thread-37 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner File "/usr/lib64/python2.6/threading.py", line 484, in run File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker File "/usr/lib64/python2.6/Queue.py", line 138, in put File "/usr/lib64/python2.6/threading.py", line 274, in notify <type 'exceptions.TypeError'>: 'NoneType' object is not callable Exception in thread Thread-41 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner File "/usr/lib64/python2.6/threading.py", line 484, in run File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker File "/usr/lib64/python2.6/Queue.py", line 138, in put File "/usr/lib64/python2.6/threading.py", line 274, in notify <type 'exceptions.TypeError'>: 'NoneType' object is not callable Also, it was waiting for mapper earlier: 06:10:52 INFO - Mapping gaia revision to git using http://cruncher.build.mozilla.org/mapper/gaia/git/618fb3177226f535b9e4151646e3639aa9611818 06:10:52 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:10:52 INFO - Sleeping 30 seconds before retrying 06:10:53 INFO - Got output: 827214fcf38d6569aeb5c6d6f31cb296d1f09272 refs/heads/master 06:10:53 INFO - Got output: d50ae982b19f42f0b66d08b9eb306be81687869f refs/heads/master 06:10:53 INFO - Got output: 8986df0f82e15ac2798df0b6c2ee3435400677ac refs/heads/b2g-4.3_r2.1 06:11:23 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:11:23 INFO - Sleeping 30 seconds before retrying 06:11:53 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:11:53 INFO - Sleeping 30 seconds before retrying 06:12:23 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:12:23 INFO - Sleeping 30 seconds before retrying 06:12:53 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:12:53 INFO - Sleeping 30 seconds before retrying 06:13:23 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:13:23 INFO - Sleeping 30 seconds before retrying 06:13:53 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:13:53 INFO - Sleeping 30 seconds before retrying 06:14:23 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:14:23 INFO - Sleeping 30 seconds before retrying 06:14:53 WARNING - Error: Mapper returned a revision of None; maybe it needs more time. 06:14:53 INFO - Sleeping 30 seconds before retrying 06:15:23 INFO - Using 91003c59ab1ddde6ebab4dad6f035ee05881a53b for gaia to match 618fb3177226f535b9e4151646e3639aa9611818 in gaia.json
Reporter | ||
Comment 11•10 years ago
|
||
So this event highlighted several shortfallings of the b2g bumper mechanism, which I will follow up with in separate bugs. However, the immediate issue to solve is rolling back bug 1034147, which should enable trees to be opened again. git ls-remote https://git.mozilla.org/external/sprd-aosp/platform/system/media refs/heads/LNX.LF.3.5.2.1_1 is not returning any results, and this is attempted because the b2g manifests point to: git ls-remote http://sprdsource.spreadtrum.com:8085/b2g/android/platform/system/media refs/heads/LNX.LF.3.5.2.1_1 Therefore, this is not related to mirroring - even not using the git.mozilla.org mirrors, I would expect builds of dolphin to break with the changes from bug 1034147.
Reporter | ||
Comment 12•10 years ago
|
||
Closing, as there are other bugs for the b2g bumper improvements, and the tree-closing problem was resolved with the bug 1034147 backout from yesterday.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•