Closed Bug 1353509 Opened 7 years ago Closed 7 years ago

Autophone - autophone process is hanging intermittently

Categories

(Testing Graveyard :: Autophone, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(10 files)

After recovering from the network outage for the devices and restarting, autophone-3 is hanging on status and other control messages. I have attempted to restart it several times and it has continued to lock up. I have force stopped it by killing the processes and an clearing the buildcache in an attempt to clear the problem.
I finally deleted the pulse queues and restarted. It appears to be working now but I don't understand the cause or why this would fix the hang. I enabled debug output and it appeared to be hanging during/after the get_build_data:

2017-04-04 13:35:21,843|17879|Thread-5|root|DEBUG|get_build_data: {'changeset': u'https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e', 'repo': u'autoland', 'date': datetime.datetime(2017, 4, 3, 14, 50, 14, tzinfo=<UTC>), 'build_type': 'opt', 'sdk': u'api-15', 'nightly': True, 'url': 'https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk', 'abi': u'arm', 'builder_type': 'taskcluster', 'id': '20170403145014', 'platform': u'android-api-15', 'changeset_dirs': ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], 'revision': u'0b45ef5f7702377bded5a67c6af0a713eccfc41e'}
2017-04-04 13:35:21,865|17879|Thread-5|root|DEBUG|BuildMetadata: url: https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk, directory: builds/aHR0cHM6Ly9xdWV1ZS50YXNrY2x1c3Rlci5uZXQvdjEvdGFzay9ZZ2V5YUVPV1NxQzVLa1B4aTBTZ3lBL3J1bnMvMC9hcnRpZmFjdHMvcHVibGljL2J1aWxkL3RhcmdldC5hcGs=, tree: autoland, buildid: 20170403145014, revision: 0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset: https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset_dirs: ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], app_name: org.mozilla.fennec, version: 55.0a1, build_type: opt, treeherder_url: https://treeherder.mozilla.org, abi: arm, sdk: api-15, nightly: True, platform: android-api-15, builder_type: taskcluster
autophone-1,2's pulse queues were deleted before the servers could process the outstanding messages which accumulated during the outage. autophone-3's pulse queue was not deleted however since it hadn't grown as fast as the others. Perhaps that is the root cause since autophone-3 appears to be working properly since its queue was deleted. marking fixed for now until proven otherwise.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1352892
This is not resolved. Overnight, each server has encountered periods where it stopped consuming the pulse queues. Fortunately, they recovered but this issue is ongoing.
Severity: normal → blocker
Status: RESOLVED → REOPENED
Priority: -- → P1
Resolution: FIXED → ---
Summary: Autophone - autophone-3 is hanging on startup → Autophone - autophone process is hanging intermittently
Couldn't figure an easy way to capture the message summaries other than screen shotting it.
Unfortunately, we updated the test manifests at the same time of the office network maintenance and I think this was conflated with the network change. I believe we are experiencing a deadlock due to the changes in the test manifests and device assignments. Removing bug 1349904.
No longer blocks: 1352892
bug 1353760 made the autophone production queues unbounded. This allowed us to keep processing even though the total number of messages temporarily exceeded 20,000. All of the production queues have now been processed and are back to normal values. Therefore, I believe we can say that the pure processing of pulse messages may be slow but it does not dead lock on its own. I did not issue any ap commands to the processes during this time.

There was still the previous situation where an ap command which contacts the autophone process and sends / receives messages from the server process would appear to hang. This may be due to a blocking http request to one of our other services such as hg, treeherder, taskcluster and I just didn't wait long enough for it to resolve. I'm investigating that possibility.

One other thing is that we are using the default pyamqp transport which uses the pure Python library amqp. librabbitmq is supposedly a high performance transport written in C. Perhaps I could improve the performance by installing librabbitmq. It will be used if available and default back to pyamqp if it is not.
Attached file hang.log
This is happening again on autophone-3. It began around 4:15 AM PDT this morning for some unknown reason when we stopped consuming pulse messages. We are not in danger of losing our queue now that they are unbounded. Looking at the tail of the logs, it appears the is a deadlock with the workers attempting to send status messages to the parent process.
The parent process is constantly consuming about 5% cpu while the children are pretty much idle. killed the parent process and restarted.
Attachment #8855892 - Flags: review?(jmaher)
This patch just removes some platforms which we no longer support or don't test. This reduces the number of bindings we have for the exchanges/queues, but is not meant to fix the underlying issue. Might as well get it in now while we continue to investigate.
Comment on attachment 8855892 [details] [diff] [review]
bug-1353509-v1.patch

Review of attachment 8855892 [details] [diff] [review]:
-----------------------------------------------------------------

win/win!
Attachment #8855892 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/55d214bb3bfade38dc1b11cf1c136cffd62aa030

leaving open for the underlying issue.

deployed 2017-04-08 17:45
This has been behaving much better since the removal of some of the platforms, but we started having issues again today with the queues overly large. They recovered temporarily but the queues remained too large. autophone-1 was effectively hung up and I rebooted it but autophone-2 and autophone-3 were responding to ap commands thought their queues were also too large for comfort.
Actually, the change to the binds had no effect since the queues are now durable and the bindings are fixed until I delete them.

This may also be associated with being DOSd via try. I'm investigating.
Attachment #8860013 - Flags: review?(jmaher)
Comment on attachment 8860013 [details] [diff] [review]
bug-1353509.patch

Review of attachment 8860013 [details] [diff] [review]:
-----------------------------------------------------------------

nice to see durable queues being used
Attachment #8860013 - Flags: review?(jmaher) → review+
autophone-2 began hanging and accumulating pulse message this morning. It appears it began after:

2017-04-20 10:00:02,007|957|PulseMonitorThread|root|ERROR|Unable to open https://treeherder.mozilla.org/api/jobdetail/?repository=autoland&job_id=91626182
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/utils.py", line 53, in get_remote_text
    conn = urllib2.urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open
    context=self._context)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>
Now autophone-3 is hung up with repeated

2017-04-21 08:05:51,226|920|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=93298011 missing keys: set(['config_file', 'chunk', 'build_url', 'builder_type']) job: {u'submit_timestamp': 1492775164, u'build_system_type': u'taskcluster', u'machine_name': u'i-0492d2009f4e5d6e6', u'job_group_symbol': u'tc-M', u'logs': [{u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/logs/live_backing.log', u'name': u'builds-4h'}, {u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/test_info/browser-chrome-chunked_errorsummary.log', u'name': u'errorsummary_json'}], u'job_group_name': u'Mochitests executed by TaskCluster', u'platform_option': u'opt', u'job_type_description': u'', u'result_set_id': 195226, u'build_platform_id': 215, u'result': u'success', u'machine_platform_os': u'-', u'push_id': 195226, u'ref_data_name': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'machine_platform_architecture': u'-', u'job_type_id': 33664, u'build_platform': u'windows7-32-vm', u'running_eta': 752, u'job_type_name': u'test-windows7-32-vm/opt-mochitest-browser-chrome-2', u'id': 93298011, u'platform': u'windows7-32-vm', u'state': u'completed', u'end_timestamp': 1492776487, u'build_os': u'-', u'who': u'fliu@mozilla.com', u'taskcluster_metadata': {u'task_id': u'MwddINb-R9iguE0-lyjp1w', u'retry_id': 0}, u'option_collection_hash': u'102210fe594ee9b33d82058545b1ed14f4c8206e', u'failure_classification_id': 1, u'autoclassify_status': u'crossreferenced', u'job_type_symbol': u'bc2', u'reason': u'scheduled', u'last_modified': u'2017-04-21T12:08:08.122351', u'tier': 2, u'job_coalesced_to_guid': None, u'job_guid': u'33075d20-d6fe-47d8-a0b8-4d3e9728e9d7/0', u'start_timestamp': 1492775644, u'build_architecture': u'-', u'job_group_description': u'', u'signature': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'job_group_id': 257, u'resource_uri': u'/api/project/try/jobs/93298011/'}

which isn't even an autophone job.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f7336ed7ed2eee0d144cb9ac762e92cfc2c2726
All three job action queues are growing this morning. Looking at the logs we are getting lots of

2017-04-26 05:00:55,498|921|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=94342282 missing keys: set([
2

These are for jobs which do not belong to us and we appear not to be processing them quickly enough. Perhaps we don't need to log as much here.
make this a debug only message.
Attachment #8861917 - Flags: review?(jmaher)
Comment on attachment 8861917 [details] [diff] [review]
bug-1353509-privatebuild-v1.patch

Review of attachment 8861917 [details] [diff] [review]:
-----------------------------------------------------------------

great
Attachment #8861917 - Flags: review?(jmaher) → review+
autophone-1 hung this morning. It appeared to be downloading builds/installing builds on the workers. The log messages weren't informative. I killed the process and restarted.
This autophone-1 hang has persisted across reboots and a selective delete of a single job. I've enabled DEBUG logging and started with a fresh jobs.sqlite database (saving the old version for later resubmissions). Hopefully we'll catch up with pulse before too long.
This appears to be related to queuing up the treeherder messages in the database.

2017-05-04 12:17:44,220|8990|PulseMonitorThread|root|DEBUG|new_job: worker phoneid pixel-05
2017-05-04 12:17:44,290|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder.queue_request: {'endpoint_base': u'jobs', 'data': [<thclient.client.TreeherderJob object at 0x7f041009f7d0>]}
2017-05-04 12:17:44,291|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder shared_lock.acquire

https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L130 should log the lock request, acquire the lock, insert the row into the treeherder table, then log the lock release and release the lock.

We never see the release request which leads me to believe that the insertion into the Treeherder table is hanging. In the mean time, anyone who wants to communicate with treeherder will do the same which will block since the shared lock is never released. Sounds like sqlite is the dead lock to me.

In the mean time, I have deleted the autophone-1 pulse queues and restarted. If necessary, we'll have to resubmit the missing jobs.
We have been using a blocking shared lock to serialize access
across threads and processes for the Pulse messages and
Treeherder submissions.

One assumption has been that the Kombu consumer did not call the
connection consumer's callback functions until the
connection.drain_events was called. This is not always the case
however. The callbacks can be called before the drain_events call
is made.

If the callback was called after the lock was set in the listen
method but before the lock was released and the drain_events
method was called, then handle_message would be called with the
shared lock already locked which violates the assumption. This
would result in a dead lock if the message resulted in a call to
a function which also attempted to lock the shared lock.

This patch attempts to prevent the dead lock by unlocking the
shared lock upon entering handle_message. If it is not locked, no
foul, no error. However if it is locked, we relock after
completing the handling of the message to keep our locking
assumptions in listen true.

I do have another patch which adds a new shared multiprocessing Lock for serializing access to the jobs database, but I don't think it is absolutely necessary and would like to try this one first.
Attachment #8864751 - Flags: review?(jmaher)
Comment on attachment 8864751 [details] [diff] [review]
bug-1353509-pulse-listen-locking.patch

Review of attachment 8864751 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me!
Attachment #8864751 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/2c5298003ccfd0c120e18244cc9e1461d025de7f

deployed 2017-05-06 ~11:25.

marking fixed with the hope that it is true this time. If not, I'll open a new bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
autophone-2 hung this morning...

2017-05-09 09:27:09,477       nexus-4-9 INFO     <2017-05-09T16:27:09+00:00> nexus-4-9 (WORKING): logcontrol: flush log
2017-05-09 09:27:09,477       nexus-4-9 INFO     Waiting for log_flushed_event
2017-05-09 09:27:09,477       nexus-4-9 INFO     Got log_flushed_event
2017-05-09 09:34:43,256            root WARNING  check_for_unrecoverable_errors: Purging hung phone pixel-02
2017-05-09 17:51:42,256     nexus-6p-11 ERROR    Exception while checking path /sdcard/tests
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/worker.py", line 652, in _check_path
    self.dm.mkdir(d, parents=True, root=True)
  File "/mozilla/autophone/autophone/adb.py", line 1684, in mkdir
    root=root):
  File "/mozilla/autophone/autophone/adb.py", line 1104, in shell_bool
    raise ADBTimeoutError("%s" % adb_process)
ADBTimeoutError: args: adb -s 84B7N16421000012 wait-for-device shell su 0 mkdir -p /sdcard/tests/autopho

check_for_unrecoverable_errors: Purging hung phone pixel-02 might be the hint.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
autophone-1 hung last night:

2017-05-15 20:19:24,777            root WARNING  check_for_unrecoverable_errors: Purging hung phone nexus-6p-1
autophone-2

2017-05-16 19:05:47,025        pixel-02 ERROR    TalosTestJob 20170517010617 opt remote-tsvg Exception tearing down job
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/phonetest.py", line 878, in teardown_job
    tests=[self])
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 442, in submit_complete
    self.queue_request(machine, project, tjc)
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
    self.shared_lock.release()
ValueError: semaphore or lock released too many times

2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets 53 ERROR Automation Error: Exception caught while running tests
2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,884       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 365, in run_test_harness
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     retVal = mochitest.runTests(options)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return self.runMochitests(options, testsToRun)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     result = self.doTests(options, testsToRun)
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return status, self.lastTestSeen
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets AttributeError: 'MochiRemote' object has no attribute 'lastTestSeen'
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web server
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Failed to stop web server on http://10.252.73.231:38852/server/shutdown
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     with closing(urllib2.urlopen(self.shutdownURL)) as c:
2017-05-15 21:27:32,885       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return opener.open(url, data, timeout)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 429, in open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     response = self._open(req, data)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 447, in _open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     '_open', req)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     result = func(*args)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return self.do_open(httplib.HTTPConnection, req)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     raise URLError(err)
2017-05-15 21:27:32,886       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets URLError: <urlopen error [Errno 111] Connection refused>
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web socket server
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping ssltunnel
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 393, in <module>
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     sys.exit(main())
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 389, in main
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     return run_test_harness(parser, options)
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 371, in run_test_harness
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     mochitest.cleanup(options)
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets   File "mochitest/runtestsremote.py", line 67, in cleanup
2017-05-15 21:27:32,887       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     MochitestDesktop.cleanup(self, options)
2017-05-15 21:27:32,888       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets     os.remove(self.manifest)
2017-05-15 21:27:32,888       nexus-5-6 INFO     UnitTestJob 20170516031557 debug mochitest-toolkit-widgets OSError: [Errno 2] No such file or directory: '/tmp/tmpc59Co3.mozrunner/test
s.manifest'
Attached file 2017-05-18 09:50 hang
Not purging but related to rebooting a disconnected device?
2017-05-22 18:00:27,737        pixel-07 ERROR    UnitTestJob mozilla-beta 20170522120019 debug api-25 android-api-15 mochitest-dom-browser-element Exception tearing down job
Traceback (most recent call last):
  File "/mozilla/autophone/autophone/phonetest.py", line 904, in teardown_job
    tests=[self])
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 466, in submit_complete
    self.queue_request(machine, project, tjc)
  File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
    self.shared_lock.release()
ValueError: semaphore or lock released too many times

2017-05-22 18:01:21,261      nexus-6p-2 INFO     ========= Started Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261154) =========
2017-05-22 18:01:21,261      nexus-6p-2 INFO     ========= Finished Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261475) =========
2017-05-22 18:14:50,697            root WARNING  check_for_unrecoverable_errors: Purging hung phone nexus-6p-2

Maybe this is a lock contention issue. We screw up releasing too many times but later try to acquire too many times? And what's up with the purging thing. I'll look into this more now.
In AutophoneTreeherder

    def queue_request(self, machine, project, job_collection):
        logger = utils.getLogger()
        logger.debug('AutophoneTreeherder.queue_request: %s', job_collection.__dict__)
        logger.debug('AutophoneTreeherder shared_lock.acquire')
        self.shared_lock.acquire()
        try:
            self.jobs.new_treeherder_job(machine, project, job_collection)
        finally:
            logger.debug('AutophoneTreeherder shared_lock.release')
            self.shared_lock.release()


In Jobs

    def new_treeherder_job(self, machine, project, job_collection):
        logger = utils.getLogger()
        logger.debug('jobs.new_treeherder_job: %s %s %s',
                     machine, project, job_collection.__dict__)
        attempts = 0
        now = datetime.datetime.utcnow().isoformat()
        conn = self._conn()
        job_cursor = self._execute_sql(
            conn,
            'insert into treeherder values (?, ?, ?, ?, ?, ?)',
            values=(None, attempts, now, machine, project, job_collection.to_json()))
        job_cursor.close()
        self._commit_connection(conn)
        self._close_connection(conn)

There is no way these could get out of sync, so it must be the case that someone else is releasing the shared lock without first obtaining it. Must be a thread in the Main process.
Apart from bug 1367199, I've been looking into this in a variety of ways. One approach was to realize that simply having the shared lock as an attribute which is "passed" to the worker sub process created via invoking the PhoneWorkerSubProcess.run method probably isn't correct. I changed the run method to take a shared_lock argument and changed the Process call to include args=(shared_lock,) so that we use the "correct" method of sharing an object via shared memory. With various other changes, I managed to create dead locks accidentally and also managed to pretty much lock up my gnome session while Autophone was running. This leads me to believe that the previous method of just reusing the PhoneWorkerSubProcess's instance's attribute of shared_lock didn't do anything to serialize access as they probably weren't shared at all.

I've been investigating exactly why I think I need the shared lock and it seems to me that really all I cared about was inserting rows into the jobs, tests and treeherder tables in the jobs.sqlite database. But, sqlite3 already handles locking the database which leads to me to wonder if I really need the shared lock at all.

That is what I am investigating now.

Additional thoughts welcome.
Attachment #8870901 - Flags: review?(jmaher)
Attachment #8870902 - Flags: review?(jmaher)
Comment on attachment 8870901 [details] [diff] [review]
bug-1353509-shared-lock-v1.patch

Review of attachment 8870901 [details] [diff] [review]:
-----------------------------------------------------------------

bummer to remove the shared locks
Attachment #8870901 - Flags: review?(jmaher) → review+
Comment on attachment 8870902 [details] [diff] [review]
bug-1353509-purge-logging-v1.patch

Review of attachment 8870902 [details] [diff] [review]:
-----------------------------------------------------------------

simple!
Attachment #8870902 - Flags: review?(jmaher) → review+
Attachment #8870903 - Flags: review?(jmaher) → review+
Seems to be resolved. I've retriggered jobs and had no issues.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: