Closed Bug 1090885 Opened 10 years ago Closed 10 years ago

B2G Bumper nagios alerts in #buildduty /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp due to hg commit failing due to no changes in working directory to commit

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

References

Details

Errors, like this in all the b2g bumper log files (/builds/b2g_bumper/v*.log on buildbot-master66.srv.releng.usw2.mozilla.com):

04:06:06     INFO - Running command: ['/usr/local/bin/hg', '--config', 'web.cacerts=/etc/pki/tls/certs/ca-bundle.crt', 'commit', '-u', 'B2G Bumper Bot <release+b2gbumper@mozilla.com>', '-m', 'Bumping manifests a=b2g-bump'] in /builds/b2g_bumper/v1.4/build/mozilla-b2g30_v1_4
04:06:06     INFO - Copy/paste: /usr/local/bin/hg --config web.cacerts=/etc/pki/tls/certs/ca-bundle.crt commit -u "B2G Bumper Bot <release+b2gbumper@mozilla.com>" -m "Bumping manifests a=b2g-bump"
04:06:06     INFO - Using env: {'HOME': '/home/cltbld',
04:06:06     INFO -  'LANG': 'en_US.UTF-8',
04:06:06     INFO -  'LOGNAME': 'cltbld',
04:06:06     INFO -  'OLDPWD': '/builds/b2g_bumper',
04:06:06     INFO -  'PATH': '/usr/local/bin:/usr/bin:/bin',
04:06:06     INFO -  'PWD': '/builds/b2g_bumper/v1.4',
04:06:06     INFO -  'SHELL': '/bin/sh',
04:06:06     INFO -  'SHLVL': '2',
04:06:06     INFO -  'USER': 'cltbld',
04:06:06     INFO -  '_': '/usr/bin/python'}
04:06:14     INFO -  nothing changed 
04:06:14    ERROR - Return code: 1 
04:06:14     INFO - No changes - all done
04:06:14     INFO - Copying logs to upload dir...

Not clear why b2g bumper is trying to commit when there are no changes to commit.

It looks like this is the reason the stamp file is not getting touched, i.e. last lines of /usr/local/bin/run_b2g_bumper.sh look like:

# Now wait for them to finish
retval=0
for pid in $pids; do
    wait $pid
    if [ $? != 0 ]; then
        retval=$?
    fi
done

if [ $retval = 0 ]; then
    # Touch our timestamp file so nagios can check if we're fresh
    touch b2g_bumper.stamp
    log "Done"
fi

And /usr/local/bin/run_b2g_bumper.sh is called from cron:

[cltbld@buildbot-master66.srv.releng.usw2.mozilla.com b2g_bumper]$ cat /etc/cron.d/run_b2g_bumper 
# This file is managed by puppet
*/5 * * * *	cltbld	/usr/local/bin/run_b2g_bumper.sh
[cltbld@buildbot-master66.srv.releng.usw2.mozilla.com b2g_bumper]$
This is therefore causing alerts in #buildduty such as:

nagios-releng
12:02:50 Wed 04:02:50 PDT [4444] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp is 4814 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
So it looks like the ERROR entries might be a red herring, as in the code, we simply test whether there were changes (http://hg.mozilla.org/build/mozharness/file/190eb44f4d0e/scripts/b2g_bumper.py#l567) and then exit the push loop if there weren't any. So the exit code from the mozharness script should not be affected by this ERROR entry in the log, I believe. Now trying to work out why the b2g_bumper.stamp file is not getting touched.
I've had a look through the mozharness framework (mostly script.py) to see if I can find anything that would cause the mozharness framework to throw a non-zero return code, but not found anything obvious. I gunzipped previous b2g log file archives to see if I could find similar ERROR messages in there, but could not.

I cd'd into the working directory where the commit was being attempted for one of the log files, and could confirm that the working directory was clean, and in sync with hg.mozilla.org, and that commits had been made and pushed since the b2g bumper stamp file was generated, so there have been successful commits/pushes to hg.m.o without b2g bumper stamp getting updated.
Getting vcs sync failure emails - may be related:

e.g.

report from /opt/vcs2vcs/bin/check_process_delay on github-sync3.dmz.scl3.mozilla.com:
process not running for stale file '/opt/vcs2vcs/b2g/details.log'
-rw-r----- 1 vcs2vcs vcs2vcs 54023543 Oct 29 10:20 /opt/vcs2vcs/b2g/details.log
process not running for stale file '/opt/vcs2vcs/b2g/update.log'
-rw-r----- 1 vcs2vcs vcs2vcs 87759091 Oct 29 10:20 /opt/vcs2vcs/b2g/update.log
[vcs2vcs@github-sync3.dmz.scl3 ~]$ ps -ef | grep 15[5]95
vcs2vcs  15595 10601  0 10:20 pts/10   00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg id http://hg.mozilla.org/releases/mozilla-b2g30_v1_4
Calling in dustin for assistance with diagnostics! Not clear why this process has hung...
closed gaia and b2g-inbound now for the problems
[root@github-sync3.dmz.scl3 ~]# lsof -p 15595
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
...
hg      15595 vcs2vcs    0u   CHR    136,10      0t0        13 /dev/pts/10
hg      15595 vcs2vcs    1w  FIFO       0,8      0t0 209787502 pipe
hg      15595 vcs2vcs    2u   CHR    136,10      0t0        13 /dev/pts/10
hg      15595 vcs2vcs    3u  IPv4 209787513      0t0       TCP github-sync3.dmz.scl3.mozilla.com:60367->hg-zlb.vips.scl3.mozilla.com:http (ESTABLISHED)

[root@github-sync3.dmz.scl3 ~]# strace -p 15595
Process 15595 attached - interrupt to quit
recvfrom(3,

[root@github-sync3.dmz.scl3 ~]# netstat | grep 60367
tcp        0      0 github-sync3.dmz.scl3:60367 hg-zlb.vips.scl3.mozil:http ESTABLISHED

so it looks like that process is hung waiting for that TCP connection.  I wasn't able to confirm or deny that the connection is actually hung on the other end because it's going out through NAT to the ZLB, but it looks like this connection got "wedged" somehow -- either by NAT timing out or by some combination of packet loss between the two -- and the hg process is happily waiting forever for a response from the server, while the server has forgotten all about the transaction.
Strangely, the b2g bumper stamp age problem recovered a minute or two before I killed the hung process on the vcs sync server, so it may be that they were not related after all. I will continue investigating.
Seeing this in master.log:
06:15:40     INFO - Got output: 
06:15:40    FATAL - Uncaught exception: Traceback (most recent call last):
06:15:40    FATAL -   File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1268, in run
06:15:40    FATAL -     self.run_action(action)
06:15:40    FATAL -   File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1210, in run_action
06:15:40    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
06:15:40    FATAL -   File "/builds/b2g_bumper/mozharness/mozharness/base/script.py", line 1151, in _possibly_run_method
06:15:40    FATAL -     return getattr(self, method_name)()
06:15:40    FATAL -   File "../mozharness/scripts/b2g_bumper.py", line 563, in push_loop
06:15:40    FATAL -     self.massage_manifests()
06:15:40    FATAL -   File "../mozharness/scripts/b2g_bumper.py", line 484, in massage_manifests
06:15:40    FATAL -     self.resolve_refs(manifest)
06:15:40    FATAL -   File "../mozharness/scripts/b2g_bumper.py", line 223, in resolve_refs
06:15:40    FATAL -     abs_revision = result.get()
06:15:40    FATAL -   File "/usr/lib64/python2.6/multiprocessing/pool.py", line 422, in get
06:15:40    FATAL -     raise self._value
06:15:40    FATAL - IndexError: list index out of range
06:15:40    FATAL - Running post_fatal callback...
06:15:40    FATAL - Exiting -1
Exception in thread Thread-42 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
  File "/usr/lib64/python2.6/threading.py", line 484, in run
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker
  File "/usr/lib64/python2.6/Queue.py", line 138, in put
  File "/usr/lib64/python2.6/threading.py", line 274, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread Thread-37 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
  File "/usr/lib64/python2.6/threading.py", line 484, in run
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker
  File "/usr/lib64/python2.6/Queue.py", line 138, in put
  File "/usr/lib64/python2.6/threading.py", line 274, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread Thread-41 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
  File "/usr/lib64/python2.6/threading.py", line 484, in run
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 71, in worker
  File "/usr/lib64/python2.6/Queue.py", line 138, in put
  File "/usr/lib64/python2.6/threading.py", line 274, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable


Also, it was waiting for mapper earlier:
06:10:52     INFO - Mapping gaia revision to git using http://cruncher.build.mozilla.org/mapper/gaia/git/618fb3177226f535b9e4151646e3639aa9611818
06:10:52  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:10:52     INFO - Sleeping 30 seconds before retrying
06:10:53     INFO - Got output: 827214fcf38d6569aeb5c6d6f31cb296d1f09272        refs/heads/master
06:10:53     INFO - Got output: d50ae982b19f42f0b66d08b9eb306be81687869f        refs/heads/master
06:10:53     INFO - Got output: 8986df0f82e15ac2798df0b6c2ee3435400677ac        refs/heads/b2g-4.3_r2.1
06:11:23  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:11:23     INFO - Sleeping 30 seconds before retrying
06:11:53  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:11:53     INFO - Sleeping 30 seconds before retrying
06:12:23  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:12:23     INFO - Sleeping 30 seconds before retrying
06:12:53  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:12:53     INFO - Sleeping 30 seconds before retrying
06:13:23  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:13:23     INFO - Sleeping 30 seconds before retrying
06:13:53  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:13:53     INFO - Sleeping 30 seconds before retrying
06:14:23  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:14:23     INFO - Sleeping 30 seconds before retrying
06:14:53  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
06:14:53     INFO - Sleeping 30 seconds before retrying
06:15:23     INFO - Using 91003c59ab1ddde6ebab4dad6f035ee05881a53b for gaia to match 618fb3177226f535b9e4151646e3639aa9611818 in gaia.json
Depends on: 1034147
So this event highlighted several shortfallings of the b2g bumper mechanism, which I will follow up with in separate bugs.

However, the immediate issue to solve is rolling back bug 1034147, which should enable trees to be opened again.

git ls-remote https://git.mozilla.org/external/sprd-aosp/platform/system/media refs/heads/LNX.LF.3.5.2.1_1

is not returning any results, and this is attempted because the b2g manifests point to:

git ls-remote http://sprdsource.spreadtrum.com:8085/b2g/android/platform/system/media refs/heads/LNX.LF.3.5.2.1_1

Therefore, this is not related to mirroring - even not using the git.mozilla.org mirrors, I would expect builds of dolphin to break with the changes from bug 1034147.
Closing, as there are other bugs for the b2g bumper improvements, and the tree-closing problem was resolved with the bug 1034147 backout from yesterday.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.