Closed Bug 1034807 Opened 10 years ago Closed 10 years ago

mobile imaging returning 500 errors

Categories

(Data & BI Services Team :: DB: MySQL, task, P1)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nigelb, Assigned: mpressman)

References

()

Details

(Whiteboard: [data: consultative])

Seeing this all over at the moment and trees are closed pending this being fixed.

Android 4.0 Panda fx-team opt test mochitest-1 on 2014-07-04 23:47:23 PDT for push 5eb2e8471e28

slave: panda-0704

https://tbpl.mozilla.org/php/getParsedLog.php?id=43179853&tree=Fx-Team

00:04:36     INFO -  07-05 00:04:32.773 I/SUTAgentAndroid( 1897): 10.12.135.17 : isdir /mnt/sdcard/tests/profile/minidumps/
00:04:36     INFO -  07-05 00:04:32.796 I/SUTAgentAndroid( 1897): 10.12.135.17 : isdir /mnt/sdcard/tests/profile/minidumps
00:04:36     INFO -  07-05 00:04:32.812 I/SUTAgentAndroid( 1897): 10.12.135.17 : cd /mnt/sdcard/tests/profile/minidumps
00:04:36     INFO -  07-05 00:04:32.843 I/SUTAgentAndroid( 1897): 10.12.135.17 : ls
00:04:36     INFO -  07-05 00:04:32.867 I/SUTAgentAndroid( 1897): 10.12.135.17 : isdir /mnt/sdcard/tests/logs
00:04:36     INFO -  07-05 00:04:32.882 I/SUTAgentAndroid( 1897): 10.12.135.17 : cd /mnt/sdcard/tests/logs
00:04:36     INFO -  07-05 00:04:32.906 I/SUTAgentAndroid( 1897): 10.12.135.17 : ls
00:04:36     INFO -  07-05 00:04:32.929 I/SUTAgentAndroid( 1897): 10.12.135.17 : pull /mnt/sdcard/tests/logs/mochitest.log
00:04:36     INFO -  07-05 00:04:32.953 I/SUTAgentAndroid( 1897): 10.12.135.17 : hash /mnt/sdcard/tests/logs/mochitest.log
00:04:36     INFO -  07-05 00:04:32.984 I/SUTAgentAndroid( 1897): 10.12.135.17 : isdir /mnt/sdcard/tests/logs
00:04:36     INFO -  07-05 00:04:33.007 I/SUTAgentAndroid( 1897): 10.12.135.17 : cd /mnt/sdcard/tests/logs
00:04:36     INFO -  07-05 00:04:33.023 I/SUTAgentAndroid( 1897): 10.12.135.17 : ls
00:04:36     INFO -  07-05 00:04:33.046 I/SUTAgentAndroid( 1897): 10.12.135.17 : rm /mnt/sdcard/tests/logs/mochitest.log
00:04:36     INFO -  07-05 00:04:33.070 I/SUTAgentAndroid( 1897): 10.12.135.17 : isdir /mnt/sdcard/tests/profile
00:04:36     INFO -  07-05 00:04:33.085 I/SUTAgentAndroid( 1897): 10.12.135.17 : rmdr /mnt/sdcard/tests/profile
00:04:36     INFO -  07-05 00:04:34.617 I/SUTAgentAndroid( 1897): 10.12.135.17 : execsu /system/bin/logcat -v time -d dalvikvm:I ConnectivityService:S WifiMonitor:S WifiStateTracker:S wpa_supplicant:S NetworkStateTracker:S
00:04:36     INFO -  Mochi-Remote INFO | Device info: {'uptime': ['0 days 0 hours 11 minutes 46 seconds 635 ms'], 'sutuserinfo': [], 'power': ['Power status:', ' AC power OFFLINE', ' Battery charge NO BATTERY', ' Remaining charge: 0%', ' Battery Temperature: 0.0 (c)'], 'process': [['10043', '1897', 'com.mozilla.SUTAgentAndroid'], ['10038', '1730', 'com.android.providers.calendar'], ['10021', '1851', 'com.android.calendar'], ['10047', '2299', 'org.mozilla.fennec.UpdateService'], ['10036', '2019', 'com.android.quicksearchbox'], ['10017', '1987', 'com.android.gallery3d'], ['10008', '2004', 'com.android.musicfx'], ['10045', '1530', 'net.rocboronat.android.wallpaper.npe'], ['10027', '1958', 'com.android.defcontainer'], ['10014', '1545', 'com.android.inputmethod.latin'], ['10007', '1652', 'com.android.smspush'], ['1001', '1571', 'com.android.phone'], ['1000', '1402', 'system'], ['10001', '1635', 'android.process.acore'], ['10028', '1973', 'com.svox.pico'], ['10030', '1806', 'com.android.deskclock'], ['10024', '1598', 'com.android.launcher'], ['10000', '1513', 'android.process.media'], ['1000', '1619', 'com.android.settings'], ['10031', '1774', 'com.android.email'], ['10044', '1832', 'com.mozilla.watcher'], ['10033', '1745', 'com.android.exchange'], ['10032', '1759', 'com.android.mms'], ['1000', '1498', 'com.android.systemui']], 'screen': ['X:1280 Y:672'], 'memory': ['PA:741183488, FREE: 545468416'], 'systime': ['2014/07/05 12:04:35:872'], 'rotation': ['ROTATION:0'], 'disk': ['/data: 534761472 total, 263081984 available', '/system: 534761472 total, 185090048 available', '/mnt/sdcard: 8573165568 total, 8539742208 available'], 'os': ['pandaboard-eng 4.0.4 IMM76I 5 test-keys'], 'id': ['2e:60:30:63:4e:01'], 'uptimemillis': ['706685'], 'temperature': ['Temperature: 42.0']}
00:04:36     INFO -  Mochi-Remote INFO | Test root: /mnt/sdcard/tests
00:04:36     INFO - Return code: 0
00:04:36     INFO - TinderboxPrint: mochitest-1<br/>40282/0/105
00:04:36     INFO - # TBPL SUCCESS #
00:04:36     INFO - The mochitest suite: mochitest-1 ran with return status: SUCCESS
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - Running tests...
00:04:36     INFO - logcat.py output:
00:04:36     INFO - reconnecting socket
00:04:36     INFO - 
00:04:36     INFO - 07/04/2014 23:55:39: DEBUG: logcat.py running SUT command: execext su t=3600 logcat -v time
00:04:36     INFO - 
00:04:36     INFO - Running post-action listener: _resource_record_post_action
00:04:36     INFO - #####
00:04:36     INFO - ##### Running close-request step.
00:04:36     INFO - #####
00:04:36     INFO - Running pre-action listener: _resource_record_pre_action
00:04:36     INFO - Running main action method: close_request
00:04:36 CRITICAL - Bad return status from http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/request/1659890/return/: 500!
00:04:36     INFO - Running post-action listener: _resource_record_post_action
00:04:36    FATAL - Uncaught exception: Traceback (most recent call last):
00:04:36    FATAL -   File "/builds/panda-0704/test/scripts/mozharness/base/script.py", line 1230, in run
00:04:36    FATAL -     self.run_action(action)
00:04:36    FATAL -   File "/builds/panda-0704/test/scripts/mozharness/base/script.py", line 1172, in run_action
00:04:36    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
00:04:36    FATAL -   File "/builds/panda-0704/test/scripts/mozharness/base/script.py", line 1113, in _possibly_run_method
00:04:36    FATAL -     return getattr(self, method_name)()
00:04:36    FATAL -   File "scripts/scripts/android_panda.py", line 473, in close_request
00:04:36    FATAL -     mph.close_request(self.request_url)
00:04:36    FATAL -   File "/builds/panda-0704/test/build/venv/lib/python2.7/site-packages/mozpoolclient.py", line 258, in close_request
00:04:36    FATAL -     check_mozpool_status(status)
00:04:36    FATAL -   File "/builds/panda-0704/test/build/venv/lib/python2.7/site-packages/mozpoolclient.py", line 57, in check_mozpool_status
00:04:36    FATAL -     raise MozpoolException('mozpool status not ok, code %s' % pprint.pformat(status))
00:04:36    FATAL - MozpoolException: mozpool status not ok, code 500
00:04:36    FATAL - Running post_fatal callback...
00:04:36    FATAL - Exiting -1
00:04:36     INFO - Running post-run listener: _resource_record_post_run
00:04:36     INFO - Running post-run listener: _upload_blobber_files
00:04:36     INFO - Blob upload gear active.
00:04:36     INFO - Preparing to upload files from /builds/panda-0704/test/build/blobber_upload_dir.
00:04:36     INFO - Files from /builds/panda-0704/test/build/blobber_upload_dir are to be uploaded with <fx-team> branch at the following location(s): https://blobupload.elasticbeanstalk.com
00:04:36     INFO - Running command: ['/builds/panda-0704/test/build/venv/bin/python', '/builds/panda-0704/test/build/venv/bin/blobberc.py', '-u', 'https://blobupload.elasticbeanstalk.com', '-a', '/builds/panda-0704/test/oauth.txt', '-b', 'fx-team', '-d', '/builds/panda-0704/test/build/blobber_upload_dir', '--output-manifest-url', '/builds/panda-0704/test/build/blobber_manifest.txt']
00:04:36     INFO - Copy/paste: /builds/panda-0704/test/build/venv/bin/python /builds/panda-0704/test/build/venv/bin/blobberc.py -u https://blobupload.elasticbeanstalk.com -a /builds/panda-0704/test/oauth.txt -b fx-team -d /builds/panda-0704/test/build/blobber_upload_dir --output-manifest-url /builds/panda-0704/test/build/blobber_manifest.txt
00:04:37     INFO -  (blobuploader) - INFO - Open directory for files ...
00:04:37     INFO -  (blobuploader) - INFO - Uploading /builds/panda-0704/test/build/blobber_upload_dir/logcat.log ...
00:04:37     INFO -  (blobuploader) - INFO - Using https://blobupload.elasticbeanstalk.com
00:04:37     INFO -  (blobuploader) - INFO - Uploading, attempt #1.
00:04:38     INFO -  (blobuploader) - INFO - TinderboxPrint: <a href='http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/9a890619fa603f89de8bc645e89a53ed182816831376a9652cbfcd1ac280e0f3055368604a6b757c62c3b8a7081f3ef2d5af8a419f901ee5d8333493c30744ec'>logcat.log</a>: uploaded
00:04:38     INFO -  (blobuploader) - INFO - Blobserver returned 202. File uploaded!
00:04:38     INFO -  (blobuploader) - INFO - Done attempting.
00:04:38     INFO -  (blobuploader) - INFO - Uploading to blobber a json structured file with all the files that have been uploaded.
00:04:38     INFO -  (blobuploader) - INFO - Here are the contents: {"logcat.log": "http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/9a890619fa603f89de8bc645e89a53ed182816831376a9652cbfcd1ac280e0f3055368604a6b757c62c3b8a7081f3ef2d5af8a419f901ee5d8333493c30744ec"}
00:04:38     INFO -  (blobuploader) - INFO - Uploading uploaded_files.json ...
00:04:38     INFO -  (blobuploader) - INFO - Using https://blobupload.elasticbeanstalk.com
00:04:38     INFO -  (blobuploader) - INFO - Uploading, attempt #1.
00:04:39     INFO -  (blobuploader) - INFO - TinderboxPrint: <a href='http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/d1a09d85e4827a9acd9ce7effd29d32717aabca59a216496396210d8e8c3a9966e21e62eded45d0a72b0d43deeaac976b28aac5c65f034f3033720ef6f7e36db'>uploaded_files.json</a>: uploaded
00:04:39     INFO -  (blobuploader) - INFO - Blobserver returned 202. File uploaded!
00:04:39     INFO -  (blobuploader) - INFO - Done attempting.
00:04:39     INFO -  (blobuploader) - INFO - Iteration through files over.
00:04:39     INFO - Return code: 0
00:04:39     INFO - Reading from file /builds/panda-0704/test/build/blobber_manifest.txt
00:04:39     INFO - Contents:
00:04:39     INFO -  http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/d1a09d85e4827a9acd9ce7effd29d32717aabca59a216496396210d8e8c3a9966e21e62eded45d0a72b0d43deeaac976b28aac5c65f034f3033720ef6f7e36db
00:04:39     INFO - Setting buildbot property blobber_manifest_url to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/d1a09d85e4827a9acd9ce7effd29d32717aabca59a216496396210d8e8c3a9966e21e62eded45d0a72b0d43deeaac976b28aac5c65f034f3033720ef6f7e36db
00:04:39     INFO - Writing buildbot properties ['blobber_manifest_url'] to /builds/panda-0704/test/properties/blobber_manifest_url
00:04:39     INFO - Writing to file /builds/panda-0704/test/properties/blobber_manifest_url
00:04:39     INFO - Contents:
00:04:39     INFO -  blobber_manifest_url:http://mozilla-releng-blobs.s3.amazonaws.com/blobs/fx-team/sha512/d1a09d85e4827a9acd9ce7effd29d32717aabca59a216496396210d8e8c3a9966e21e62eded45d0a72b0d43deeaac976b28aac5c65f034f3033720ef6f7e36db
00:04:39     INFO - rmtree: /builds/panda-0704/test/build/blobber_manifest.txt
00:04:39     INFO - retry: Calling <built-in function remove> with args: ('/builds/panda-0704/test/build/blobber_manifest.txt',), kwargs: {}, attempt #1
program finished with exit code 255
elapsedTime=1030.833952
========= Finished '/tools/buildbot/bin/python scripts/scripts/android_panda.py ...' failed (results: 2, elapsed: 17 mins, 10 secs) (at 2014-07-05 00:04:39.151787) =========
Severity: normal → blocker
Priority: -- → P1
From mobile-imaging-001.p1.releng.scl3:/var/log/mozpool.log:
InternalError: (InternalError) (1526, u'Table has no partition for value 1404518434') 'INSERT INTO device_logs (device_id, ts, source, message) VALUES (%s, %s, %s, %s)' (3368L, datetime.datetime(2014, 7, 5, 0, 0, 34, 68973), 'sut', 'connecting to SUT agent')

This started happening 2014-07-05 00:00 Pacific, so likely some sort of failure in creating new partitions per
  http://hg.mozilla.org/build/mozpool/file/default/sql/schema.sql

IT doc for mozpool is https://mana.mozilla.org/wiki/display/IT/Mozpool
Plan - transfer this to mozilla.org :: Server Operations: Database at 0730 Mountain time (1330 UTC), and page mpressman. 

The trees are closed, but it's a relatively quiet time and we can afford to wait 2.5 hours for daylight in DBA-land.
Assignee: nobody → server-ops-database
Component: Buildduty → Server Operations: Database
Product: Release Engineering → mozilla.org
QA Contact: bugspam.Callek → scabral
Version: unspecified → other
Assignee: server-ops-database → mpressman
ran call dbprocs() manually and now I'm seeing dates greater than 2014-07-04 23:59 in device_logs:
mysql> select max(ts) from device_logs;

+---------------------+
| max(ts)             |
+---------------------+
| 2014-07-05 07:48:46 |
+---------------------+
1 row in set (0.00 sec)

mysql>
cc-ing Amy so we can post-mortem why this failed on Monday.
This is now fixed. Restarted jobs are now green.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Jul  5 00:00:01 mobile-imaging-001 CROND[27330]: (apache) CMD (/opt/mozpool/dbcron.sh)

So, the crontask ran, but didn't do anything.

[root@mobile-imaging-001.p1.releng.scl3.mozilla.com ~]# ls -al /tmp/dbcron.output 
-rw-r--r-- 1 root root 160 Jul  1 18:28 /tmp/dbcron.output

which was when we determined that the crontask wasn't enabled because cron had seen the crontask before 'apache' was created (bug 1033112), restarted cron, and ran the crontask manually (7/1/2014 irc log):

[21:28:38] <dustin> I ran dbcron there, so it should clear

So that was the root cause - I ran that as root.  D'oh.  Nagios history shows that the service went green again shortly thereafter.

According to subsequent nagios history, this alerted again on July 3:

 Service Critical[07-03-2014 18:09:20] SERVICE ALERT: generic1.db.scl3.mozilla.com;Mozpool Log MySQL Partitions - device_logs;CRITICAL;SOFT;1;WARNING: last partition for device_logs expires in 23 hours

and was ack'd by dgarvey at 07-03-2014 18:51:43:

21:51 < nagios-scl3> Thu 18:51:43 PDT generic1.db.scl3.mozilla.com:Mozpool Log MySQL Partitions - device_logs is ACKNOWLEDGEMENT (CRITICAL): WARNING: last partition for device_logs expires in 23 hours  (http://m.mozilla.org/Mozpool+Log+MySQL+Partitions+-+device_logs) (nagiosadmin) (dgarvey)known

I don't know what ,if anything, was done at that point.  The playbook (linked above) indicates running 'dbcron' on the DB host (admittedly, on the wrong host since we moved it to generic; now corrected).  On 7/1 when it failed, ashish paged me asking if he should run that.

The crontask did send a mail about the failure, but:

Jul  5 00:00:02 mobile-imaging-001 postfix/smtp[27340]: 442CC3804F0: to=<root@mobile-imaging-001.p1.releng.scl3.mozilla.com>, relay=none, delay=0.05, delays=0.04/0/0/0, dsn=5.4.6, status=bounced (mail for mobile-imaging-001.p1.releng.scl3.mozilla.com loops back to myself

So that will need to be fixed - bug 1034848.
I don't think the escalations worked quite right here, either -- looks like nigelb paged nthomas and called coop.  Ideally that would have gotten shunted to the MOC more quickly than that, as they're the first-responders, and have the playbook that could have re-opened trees in a few minutes.
My first attempt was to ping on-call to see if I can get contact information for anyone from releng, according to the wiki page. dgarvey was on call and I did not get a response to my ping at 08:34 UTC / 01:34 PDT. At 08:50 UTC / 01:50 PDT, the on-call changed to Usul and he pinged me following a conversation on Twitter.

He took a quick look and offered to page nthomas, which he did. That didn't actually work, so he gave me nthomas's phone number and I called him up. Nick took a look at the logs and determined we needed a DBA. I believe at this point, if I had communicated the results of our investigation to Usul, we would have discovered the playbook to fix the issue. Or did the investigation in conjunction with Usul.
FWIW, I wasn't aware of the nagios alert because it was some time before the event, so not in the recent history that I did check. And the permissions on mana prevent me from seeing the playbook (m.mozilla.org), so I'm not sure it would have helped in engaging oncall instead of a DBA.
Thanks for the details.  Do we want to meet to wrap up the postmortem, or call this good enough?
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Thanks for the details.  Do we want to meet to wrap up the postmortem, or
> call this good enough?

I think this is good enough. Here are my takeaways:

* MOC should be the first point of contact if no one is around in-channel. They should be the first point of escalation even if someone is in channel.
* playbook is not visible to at least some members of releng and sheriffs. I don't think we necessarily need write access, but read access might help us debug issues better (or find holes in docs/procedures).
* should be be able to send mail from mobile-imaging: bug 1034848
* should verify that all docs/procedures related to pandas are still accurate after the panda move from scl1 completes
Brady, I'm just looping you in here because of the MOC involvement. There was a check acked at 07-03-2014 18:51:43, but I'm not sure if there was any follow up or someone attempted the remediation that was listed in the playbook. The error in the playbook has now been fixed, so if we run into this issue again, the docs should be correct for fixing it.
Whiteboard: [ consultative ]
Whiteboard: [ consultative ] → [consultative]
Whiteboard: [consultative] → [data: consultative]
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.