1353509 - Autophone - autophone process is hanging intermittently

Assignee

Description

•

7 years ago

After recovering from the network outage for the devices and restarting, autophone-3 is hanging on status and other control messages. I have attempted to restart it several times and it has continued to lock up. I have force stopped it by killing the processes and an clearing the buildcache in an attempt to clear the problem.

Bob Clary [:bc] (inactive)

Assignee

Comment 1

•

7 years ago

I finally deleted the pulse queues and restarted. It appears to be working now but I don't understand the cause or why this would fix the hang. I enabled debug output and it appeared to be hanging during/after the get_build_data:

2017-04-04 13:35:21,843|17879|Thread-5|root|DEBUG|get_build_data: {'changeset': u'https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e', 'repo': u'autoland', 'date': datetime.datetime(2017, 4, 3, 14, 50, 14, tzinfo=<UTC>), 'build_type': 'opt', 'sdk': u'api-15', 'nightly': True, 'url': 'https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk', 'abi': u'arm', 'builder_type': 'taskcluster', 'id': '20170403145014', 'platform': u'android-api-15', 'changeset_dirs': ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], 'revision': u'0b45ef5f7702377bded5a67c6af0a713eccfc41e'}
2017-04-04 13:35:21,865|17879|Thread-5|root|DEBUG|BuildMetadata: url: https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk, directory: builds/aHR0cHM6Ly9xdWV1ZS50YXNrY2x1c3Rlci5uZXQvdjEvdGFzay9ZZ2V5YUVPV1NxQzVLa1B4aTBTZ3lBL3J1bnMvMC9hcnRpZmFjdHMvcHVibGljL2J1aWxkL3RhcmdldC5hcGs=, tree: autoland, buildid: 20170403145014, revision: 0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset: https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset_dirs: ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], app_name: org.mozilla.fennec, version: 55.0a1, build_type: opt, treeherder_url: https://treeherder.mozilla.org, abi: arm, sdk: api-15, nightly: True, platform: android-api-15, builder_type: taskcluster

Bob Clary [:bc] (inactive)

Assignee

Comment 2

•

7 years ago

autophone-1,2's pulse queues were deleted before the servers could process the outstanding messages which accumulated during the outage. autophone-3's pulse queue was not deleted however since it hadn't grown as fast as the others. Perhaps that is the root cause since autophone-3 appears to be working properly since its queue was deleted. marking fixed for now until proven otherwise.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Updated

•

7 years ago

Blocks: 1352892

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

7 years ago

This is not resolved. Overnight, each server has encountered periods where it stopped consuming the pulse queues. Fortunately, they recovered but this issue is ongoing.

Severity: normal → blocker

Status: RESOLVED → REOPENED

Priority: -- → P1

Resolution: FIXED → ---

Summary: Autophone - autophone-3 is hanging on startup → Autophone - autophone process is hanging intermittently

Bob Clary [:bc] (inactive)

Assignee

Comment 4

•

7 years ago

Attached image Screenshot_2017-04-05_07-32-16-autophone-hangs.png — Details

Couldn't figure an easy way to capture the message summaries other than screen shotting it.

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

7 years ago

Unfortunately, we updated the test manifests at the same time of the office network maintenance and I think this was conflated with the network change. I believe we are experiencing a deadlock due to the changes in the test manifests and device assignments. Removing bug 1349904.

No longer blocks: 1352892

Bob Clary [:bc] (inactive)

Assignee

Comment 6

•

7 years ago

bug 1353760 made the autophone production queues unbounded. This allowed us to keep processing even though the total number of messages temporarily exceeded 20,000. All of the production queues have now been processed and are back to normal values. Therefore, I believe we can say that the pure processing of pulse messages may be slow but it does not dead lock on its own. I did not issue any ap commands to the processes during this time.

There was still the previous situation where an ap command which contacts the autophone process and sends / receives messages from the server process would appear to hang. This may be due to a blocking http request to one of our other services such as hg, treeherder, taskcluster and I just didn't wait long enough for it to resolve. I'm investigating that possibility.

One other thing is that we are using the default pyamqp transport which uses the pure Python library amqp. librabbitmq is supposedly a high performance transport written in C. Perhaps I could improve the performance by installing librabbitmq. It will be used if available and default back to pyamqp if it is not.

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

7 years ago

Forgot to include this link: http://docs.celeryproject.org/projects/kombu/en/latest/userguide/connections.html#amqp-transports

Bob Clary [:bc] (inactive)

Assignee

Comment 8

•

7 years ago

Attached file hang.log — Details

This is happening again on autophone-3. It began around 4:15 AM PDT this morning for some unknown reason when we stopped consuming pulse messages. We are not in danger of losing our queue now that they are unbounded. Looking at the tail of the logs, it appears the is a deadlock with the workers attempting to send status messages to the parent process.

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

7 years ago

The parent process is constantly consuming about 5% cpu while the children are pretty much idle. killed the parent process and restarted.

Bob Clary [:bc] (inactive)

Assignee

Comment 10

•

7 years ago

Attached patch bug-1353509-v1.patch — Details — Splinter Review

Attachment #8855892 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 11

•

7 years ago

This patch just removes some platforms which we no longer support or don't test. This reduces the number of bindings we have for the exchanges/queues, but is not meant to fix the underlying issue. Might as well get it in now while we continue to investigate.

Joel Maher ( :jmaher ) (UTC -8)

Comment 12

•

7 years ago

Comment on attachment 8855892 [details] [diff] [review]
bug-1353509-v1.patch

Review of attachment 8855892 [details] [diff] [review]:
-----------------------------------------------------------------

win/win!

Attachment #8855892 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 13

•

7 years ago

https://github.com/mozilla/autophone/commit/55d214bb3bfade38dc1b11cf1c136cffd62aa030

leaving open for the underlying issue.

deployed 2017-04-08 17:45

Blocks: autophone-deployments

Bob Clary [:bc] (inactive)

Assignee

Comment 14

•

7 years ago

This has been behaving much better since the removal of some of the platforms, but we started having issues again today with the queues overly large. They recovered temporarily but the queues remained too large. autophone-1 was effectively hung up and I rebooted it but autophone-2 and autophone-3 were responding to ap commands thought their queues were also too large for comfort.

Bob Clary [:bc] (inactive)

Assignee

Comment 15

•

7 years ago

Actually, the change to the binds had no effect since the queues are now durable and the bindings are fixed until I delete them.

This may also be associated with being DOSd via try. I'm investigating.

Bob Clary [:bc] (inactive)

Assignee

Comment 16

•

7 years ago

Attached patch bug-1353509.patch — Details — Splinter Review

Attachment #8860013 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

7 years ago

Comment on attachment 8860013 [details] [diff] [review]
bug-1353509.patch

Review of attachment 8860013 [details] [diff] [review]:
-----------------------------------------------------------------

nice to see durable queues being used

Attachment #8860013 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 18

•

7 years ago

https://github.com/mozilla/autophone/commit/48a11a49c0f312d0f7306a6b23bdb5bf907a82ad

Bob Clary [:bc] (inactive)

Assignee

Comment 19

•

7 years ago

autophone-2 began hanging and accumulating pulse message this morning. It appears it began after:

2017-04-20 10:00:02,007|957|PulseMonitorThread|root|ERROR|Unable to open https://treeherder.mozilla.org/api/jobdetail/?repository=autoland&job_id=91626182
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/utils.py", line 53, in get_remote_text
    conn = urllib2.urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open
    context=self._context)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>

Bob Clary [:bc] (inactive)

Assignee

Comment 20

•

7 years ago

https://github.com/mozilla/autophone/commit/48a11a49c0f312d0f7306a6b23bdb5bf907a82ad

Bob Clary [:bc] (inactive)

Assignee

Comment 21

•

7 years ago

Now autophone-3 is hung up with repeated

2017-04-21 08:05:51,226|920|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=93298011 missing keys: set(['config_file', 'chunk', 'build_url', 'builder_type']) job: {u'submit_timestamp': 1492775164, u'build_system_type': u'taskcluster', u'machine_name': u'i-0492d2009f4e5d6e6', u'job_group_symbol': u'tc-M', u'logs': [{u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/logs/live_backing.log', u'name': u'builds-4h'}, {u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/test_info/browser-chrome-chunked_errorsummary.log', u'name': u'errorsummary_json'}], u'job_group_name': u'Mochitests executed by TaskCluster', u'platform_option': u'opt', u'job_type_description': u'', u'result_set_id': 195226, u'build_platform_id': 215, u'result': u'success', u'machine_platform_os': u'-', u'push_id': 195226, u'ref_data_name': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'machine_platform_architecture': u'-', u'job_type_id': 33664, u'build_platform': u'windows7-32-vm', u'running_eta': 752, u'job_type_name': u'test-windows7-32-vm/opt-mochitest-browser-chrome-2', u'id': 93298011, u'platform': u'windows7-32-vm', u'state': u'completed', u'end_timestamp': 1492776487, u'build_os': u'-', u'who': u'fliu@mozilla.com', u'taskcluster_metadata': {u'task_id': u'MwddINb-R9iguE0-lyjp1w', u'retry_id': 0}, u'option_collection_hash': u'102210fe594ee9b33d82058545b1ed14f4c8206e', u'failure_classification_id': 1, u'autoclassify_status': u'crossreferenced', u'job_type_symbol': u'bc2', u'reason': u'scheduled', u'last_modified': u'2017-04-21T12:08:08.122351', u'tier': 2, u'job_coalesced_to_guid': None, u'job_guid': u'33075d20-d6fe-47d8-a0b8-4d3e9728e9d7/0', u'start_timestamp': 1492775644, u'build_architecture': u'-', u'job_group_description': u'', u'signature': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'job_group_id': 257, u'resource_uri': u'/api/project/try/jobs/93298011/'}

which isn't even an autophone job.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f7336ed7ed2eee0d144cb9ac762e92cfc2c2726

Bob Clary [:bc] (inactive)

Assignee

Comment 22

•

7 years ago

All three job action queues are growing this morning. Looking at the logs we are getting lots of

2017-04-26 05:00:55,498|921|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=94342282 missing keys: set([
2

These are for jobs which do not belong to us and we appear not to be processing them quickly enough. Perhaps we don't need to log as much here.

Bob Clary [:bc] (inactive)

Assignee

Comment 23

•

7 years ago

Attached patch bug-1353509-privatebuild-v1.patch — Details — Splinter Review

make this a debug only message.

Attachment #8861917 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 24

•

7 years ago

Comment on attachment 8861917 [details] [diff] [review]
bug-1353509-privatebuild-v1.patch

Review of attachment 8861917 [details] [diff] [review]:
-----------------------------------------------------------------

great

Attachment #8861917 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 25

•

7 years ago

https://github.com/mozilla/autophone/commit/cc5c38e9725d79124aca6bad26dfeb927a0edf0f

will deploy in a few minutes. Leaving open.

Bob Clary [:bc] (inactive)

Assignee

Comment 26

•

7 years ago

autophone-1 hung this morning. It appeared to be downloading builds/installing builds on the workers. The log messages weren't informative. I killed the process and restarted.

Bob Clary [:bc] (inactive)

Assignee

Comment 27

•

7 years ago

This autophone-1 hang has persisted across reboots and a selective delete of a single job. I've enabled DEBUG logging and started with a fresh jobs.sqlite database (saving the old version for later resubmissions). Hopefully we'll catch up with pulse before too long.

Bob Clary [:bc] (inactive)

Assignee

Comment 28

•

7 years ago

This appears to be related to queuing up the treeherder messages in the database.

2017-05-04 12:17:44,220|8990|PulseMonitorThread|root|DEBUG|new_job: worker phoneid pixel-05
2017-05-04 12:17:44,290|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder.queue_request: {'endpoint_base': u'jobs', 'data': [<thclient.client.TreeherderJob object at 0x7f041009f7d0>]}
2017-05-04 12:17:44,291|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder shared_lock.acquire

https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L130 should log the lock request, acquire the lock, insert the row into the treeherder table, then log the lock release and release the lock.

We never see the release request which leads me to believe that the insertion into the Treeherder table is hanging. In the mean time, anyone who wants to communicate with treeherder will do the same which will block since the shared lock is never released. Sounds like sqlite is the dead lock to me.

In the mean time, I have deleted the autophone-1 pulse queues and restarted. If necessary, we'll have to resubmit the missing jobs.

Bob Clary [:bc] (inactive)

Assignee

Comment 29

•

7 years ago

Attached patch bug-1353509-pulse-listen-locking.patch — Details — Splinter Review

We have been using a blocking shared lock to serialize access
across threads and processes for the Pulse messages and
Treeherder submissions.

One assumption has been that the Kombu consumer did not call the
connection consumer's callback functions until the
connection.drain_events was called. This is not always the case
however. The callbacks can be called before the drain_events call
is made.

If the callback was called after the lock was set in the listen
method but before the lock was released and the drain_events
method was called, then handle_message would be called with the
shared lock already locked which violates the assumption. This
would result in a dead lock if the message resulted in a call to
a function which also attempted to lock the shared lock.

This patch attempts to prevent the dead lock by unlocking the
shared lock upon entering handle_message. If it is not locked, no
foul, no error. However if it is locked, we relock after
completing the handling of the message to keep our locking
assumptions in listen true.

I do have another patch which adds a new shared multiprocessing Lock for serializing access to the jobs database, but I don't think it is absolutely necessary and would like to try this one first.

Attachment #8864751 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

Comment on attachment 8864751 [details] [diff] [review]
bug-1353509-pulse-listen-locking.patch

Review of attachment 8864751 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me!

Attachment #8864751 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 31

•

7 years ago

https://github.com/mozilla/autophone/commit/2c5298003ccfd0c120e18244cc9e1461d025de7f

deployed 2017-05-06 ~11:25.

marking fixed with the hope that it is true this time. If not, I'll open a new bug.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Comment 32

•

7 years ago

autophone-2 hung this morning...

2017-05-09 09:27:09,477       nexus-4-9 INFO     <2017-05-09T16:27:09+00:00> nexus-4-9 (WORKING): logcontrol: flush log
2017-05-09 09:27:09,477       nexus-4-9 INFO     Waiting for log_flushed_event
2017-05-09 09:27:09,477       nexus-4-9 INFO     Got log_flushed_event
2017-05-09 09:34:43,256            root WARNING  check_for_unrecoverable_errors: Purging hung phone pixel-02
2017-05-09 17:51:42,256     nexus-6p-11 ERROR    Exception while checking path /sdcard/tests
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/worker.py", line 652, in _check_path
    self.dm.mkdir(d, parents=True, root=True)
  File "/mozilla/autophone/autophone/adb.py", line 1684, in mkdir
    root=root):
  File "/mozilla/autophone/autophone/adb.py", line 1104, in shell_bool
    raise ADBTimeoutError("%s" % adb_process)
ADBTimeoutError: args: adb -s 84B7N16421000012 wait-for-device shell su 0 mkdir -p /sdcard/tests/autopho

check_for_unrecoverable_errors: Purging hung phone pixel-02 might be the hint.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Bob Clary [:bc] (inactive)

Assignee

Comment 33

•

7 years ago

autophone-1 hung last night:

2017-05-15 20:19:24,777            root WARNING  check_for_unrecoverable_errors: Purging hung phone nexus-6p-1

Bob Clary [:bc] (inactive)

Assignee

Comment 34

•

7 years ago

autophone-2

2017-05-16 19:05:47,025        pixel-02 ERROR    TalosTestJob 20170517010617 opt remote-tsvg Exception tearing down job
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/phonetest.py", line 878, in teardown_job
    tests=[self])
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 442, in submit_complete
    self.queue_request(machine, project, tjc)
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
    self.shared_lock.release()
ValueError: semaphore or lock released too many times

2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets 53 ERROR Automation Error: Exception caught while running tests
2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 365, in run_test_harness
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     retVal = mochitest.runTests(options)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return self.runMochitests(options, testsToRun)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     result = self.doTests(options, testsToRun)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return status, self.lastTestSeen
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets AttributeError: 'MochiRemote' object has no attribute 'lastTestSeen'
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web server
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Failed to stop web server on http://10.252.73.231:38852/server/shutdown
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     with closing(urllib2.urlopen(self.shutdownURL)) as c:
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return opener.open(url, data, timeout)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 429, in open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     response = self._open(req, data)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 447, in _open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     '_open', req)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     result = func(*args)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return self.do_open(httplib.HTTPConnection, req)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     raise URLError(err)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets URLError: <urlopen error [Errno 111] Connection refused>
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web socket server
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping ssltunnel
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 393, in <module>
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     sys.exit(main())
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 389, in main
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return run_test_harness(parser, options)
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 371, in run_test_harness
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     mochitest.cleanup(options)
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 67, in cleanup
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     MochitestDesktop.cleanup(self, options)
2017-05-15 21:27:32,888       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     os.remove(self.manifest)
2017-05-15 21:27:32,888       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets OSError: [Errno 2] No such file or directory: '/tmp/tmpc59Co3.mozrunner/test
s.manifest'

Bob Clary [:bc] (inactive)

Assignee

Comment 35

•

7 years ago

Attached file 2017-05-18 09:50 hang — Details

Not purging but related to rebooting a disconnected device?

Bob Clary [:bc] (inactive)

Assignee

Comment 36

•

7 years ago

2017-05-22 18:00:27,737        pixel-07 ERROR    UnitTestJob mozilla-beta 20170522120019 debug api-25 android-api-15 mochitest-dom-browser-element Exception tearing down job
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/phonetest.py", line 904, in teardown_job
    tests=[self])
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 466, in submit_complete
    self.queue_request(machine, project, tjc)
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
    self.shared_lock.release()
ValueError: semaphore or lock released too many times

2017-05-22 18:01:21,261      nexus-6p-2 INFO     ========= Started Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261154) =========
2017-05-22 18:01:21,261      nexus-6p-2 INFO     ========= Finished Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261475) =========
2017-05-22 18:14:50,697            root WARNING  check_for_unrecoverable_errors: Purging hung phone nexus-6p-2

Maybe this is a lock contention issue. We screw up releasing too many times but later try to acquire too many times? And what's up with the purging thing. I'll look into this more now.

Bob Clary [:bc] (inactive)

Assignee

Comment 37

•

7 years ago

In AutophoneTreeherder

    def queue_request(self, machine, project, job_collection):
        logger = utils.getLogger()
        logger.debug('AutophoneTreeherder.queue_request: %s', job_collection.__dict__)
        logger.debug('AutophoneTreeherder shared_lock.acquire')
        self.shared_lock.acquire()
        try:
            self.jobs.new_treeherder_job(machine, project, job_collection)
        finally:
            logger.debug('AutophoneTreeherder shared_lock.release')
            self.shared_lock.release()


In Jobs

    def new_treeherder_job(self, machine, project, job_collection):
        logger = utils.getLogger()
        logger.debug('jobs.new_treeherder_job: %s %s %s',
                     machine, project, job_collection.__dict__)
        attempts = 0
        now = datetime.datetime.utcnow().isoformat()
        conn = self._conn()
        job_cursor = self._execute_sql(
            conn,
            'insert into treeherder values (?, ?, ?, ?, ?, ?)',
            values=(None, attempts, now, machine, project, job_collection.to_json()))
        job_cursor.close()
        self._commit_connection(conn)
        self._close_connection(conn)

There is no way these could get out of sync, so it must be the case that someone else is releasing the shared lock without first obtaining it. Must be a thread in the Main process.

Bob Clary [:bc] (inactive)

Assignee

Comment 38

•

7 years ago

Apart from bug 1367199, I've been looking into this in a variety of ways. One approach was to realize that simply having the shared lock as an attribute which is "passed" to the worker sub process created via invoking the PhoneWorkerSubProcess.run method probably isn't correct. I changed the run method to take a shared_lock argument and changed the Process call to include args=(shared_lock,) so that we use the "correct" method of sharing an object via shared memory. With various other changes, I managed to create dead locks accidentally and also managed to pretty much lock up my gnome session while Autophone was running. This leads me to believe that the previous method of just reusing the PhoneWorkerSubProcess's instance's attribute of shared_lock didn't do anything to serialize access as they probably weren't shared at all.

I've been investigating exactly why I think I need the shared lock and it seems to me that really all I cared about was inserting rows into the jobs, tests and treeherder tables in the jobs.sqlite database. But, sqlite3 already handles locking the database which leads to me to wonder if I really need the shared lock at all.

That is what I am investigating now.

Additional thoughts welcome.

Bob Clary [:bc] (inactive)

Assignee

Comment 39

•

7 years ago

Attached patch bug-1353509-shared-lock-v1.patch — Details — Splinter Review

Attachment #8870901 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 40

•

7 years ago

Attached patch bug-1353509-purge-logging-v1.patch — Details — Splinter Review

Attachment #8870902 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 41

•

7 years ago

Attached patch bug-1353509-worker-msg-loop-wait-v1.patch — Details — Splinter Review

Attachment #8870903 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 42

•

7 years ago

Comment on attachment 8870901 [details] [diff] [review]
bug-1353509-shared-lock-v1.patch

Review of attachment 8870901 [details] [diff] [review]:
-----------------------------------------------------------------

bummer to remove the shared locks

Attachment #8870901 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 43

•

7 years ago

Comment on attachment 8870902 [details] [diff] [review]
bug-1353509-purge-logging-v1.patch

Review of attachment 8870902 [details] [diff] [review]:
-----------------------------------------------------------------

simple!

Attachment #8870902 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Attachment #8870903 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 44

•

7 years ago

https://github.com/mozilla/autophone/commit/05108454c60e8672553aa4b28a663da7c331211d
https://github.com/mozilla/autophone/commit/6153b8eaef0b8addce38c368975c02fe0d35d084
https://github.com/mozilla/autophone/commit/96dc1e04ebbd63b9b950c47b99d7751fed3f1318

I'll leave open for a few days just in case.

Bob Clary [:bc] (inactive)

Assignee

Comment 45

•

7 years ago

Seems to be resolved. I've retriggered jobs and had no issues.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

2 years ago

Product: Testing → Testing Graveyard

Screenshot_2017-04-05_07-32-16-autophone-hangs.png 7 years ago Bob Clary [:bc] (inactive) 121.89 KB, image/png		Details
hang.log 7 years ago Bob Clary [:bc] (inactive) 12.64 KB, text/plain		Details
bug-1353509-v1.patch 7 years ago Bob Clary [:bc] (inactive) 1.39 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509.patch 7 years ago Bob Clary [:bc] (inactive) 1.21 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-privatebuild-v1.patch 7 years ago Bob Clary [:bc] (inactive) 962 bytes, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-pulse-listen-locking.patch 7 years ago Bob Clary [:bc] (inactive) 1.51 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
2017-05-18 09:50 hang 7 years ago Bob Clary [:bc] (inactive) 3.99 KB, text/plain		Details
bug-1353509-shared-lock-v1.patch 7 years ago Bob Clary [:bc] (inactive) 13.57 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-purge-logging-v1.patch 7 years ago Bob Clary [:bc] (inactive) 3.41 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-worker-msg-loop-wait-v1.patch 7 years ago Bob Clary [:bc] (inactive) 818 bytes, patch	jmaher : review+	Details \| Diff \| Splinter Review