Open Bug 1269483 Opened 8 years ago Updated 7 years ago

pushdataaggregator.service got stuck, apparently didn't shut down

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

Details

May 02 18:25:55 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: 2016-05-02T18:25:55Z 14829 vcsreplicator.aggregator copying hg-changegroup-1 from partition 0
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: 2016-05-02T18:26:08Z 14829 vcsreplicator.aggregator exiting main consume loop with error
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: Traceback (most recent call last):
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/bin/vcsreplicator-aggregator", line 9, in <module>
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: load_entry_point('vcsreplicator==0.1', 'console_scripts', 'vcsreplicator-aggregator')()
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 197, in cli
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: producer_topic=aggregate_topic)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/daemon.py", line 44, in run_in_loop
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: fn(alive=alive, **kwargs)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 41, in _run_aggregation
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: alive=alive)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 69, in synchronize_fully_consumed_messages
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: offsets = consumer_offsets(client, consumer_topic, groups)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/util.py", line 70, in consumer_offsets
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: offsets = fetch_partition_offsets(client, topic)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/util.py", line 42, in fetch_partition_offsets
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: resps = client.send_offset_request(reqs)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 510, in send_offset_request
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: if not fail_on_error or not self._raise_on_response_error(resp)]
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 251, in _raise_on_response_error
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: kafka.common.check_error(resp)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/common.py", line 233, in check_error
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: raise error_class(response)
May 02 18:26:08 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14829]: kafka.common.NotLeaderForPartitionError: OffsetResponse(topic='pushdata', partition=3, error=6, offsets=())
May 02 20:38:50 hgssh3.dmz.scl3.mozilla.com systemd[1]: Stopping Aggregate replicated Mercurial pushes to a new Kafka topic...
May 02 20:38:50 hgssh3.dmz.scl3.mozilla.com systemd[1]: Started Aggregate replicated Mercurial pushes to a new Kafka topic.
May 02 20:38:50 hgssh3.dmz.scl3.mozilla.com systemd[1]: Starting Aggregate replicated Mercurial pushes to a new Kafka topic...
May 02 20:38:50 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[14390]: 2016-05-02T20:38:50Z 14390 vcsreplicator.aggregator 4965 unacked messages in 8 partition: [0, 1, 2, 3, 4, 5, 6, 7]

It looks like the service started shutdown but didn't exit cleanly. The process got wedged and basically hanged. This caused pulse notification to stall for ~2 hours.
I've added a check_pushdataaggregator_lag nagios check to hgssh. 

fubar: could you please install it on hgssh3. Should run every minute or 2. I documented the check at https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html#check-pushdataaggregator-lag. Although I guess we'll need to create a Mana page too.
done. check_pushdataaggregator_lag is set to only run on hg-ssh-master like pulsenotifier_lag.

also removed the kafka/zookeeper checks on hgssh1
https://hg.mozilla.org/hgcustom/version-control-tools/rev/8a25a181ce4a2de4470428b6e0a634ebfa8bb1ac
vcsreplicator: refactor exit code handling for run_in_loop daemons (bug 1269483)
Had a few more hangs in production in the last few days.

I installed debug symbols and got a stack from a stuck process after it said it was exiting.

May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: vcsreplicator.aggregator copying heartbeat-1 from partition 0
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: kafka.producer Unable to send messages
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: Traceback (most recent call last):
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: fail_on_error=self.sync_fail_on_error
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: (not fail_on_error or not self._raise_on_response_error(resp))]
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 251, in _raise_on_response_error
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: kafka.common.check_error(resp)
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/common.py", line 233, in check_error
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: raise error_class(response)
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: NotLeaderForPartitionError: ProduceResponse(topic='replicatedpushdata', partition=0, error=6, offset=-1)
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: vcsreplicator.aggregator exception in daemon loop function
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: vcsreplicator.aggregator executing loop exiting after error
May 06 18:19:04 hgssh3.dmz.scl3.mozilla.com pushdataaggregator[20825]: vcsreplicator.aggregator process exiting code 1
May 06 18:46:20 hgssh3.dmz.scl3.mozilla.com systemd[1]: Stopping Aggregate replicated Mercurial pushes to a new Kafka topic...
May 06 18:46:20 hgssh3.dmz.scl3.mozilla.com systemd[1]: Started Aggregate replicated Mercurial pushes to a new Kafka topic.
May 06 18:46:20 hgssh3.dmz.scl3.mozilla.com systemd[1]: Starting Aggregate replicated Mercurial pushes to a new Kafka topic...


#0  0x00007f3b889bbee1 in try_rich_compare (v=<function at remote 0xa374de8>, w=<function at remote 0xff8b230>, op=2) at /usr/src/debug/Python-2.7.5/Objects/object.c:613
#1  0x00007f3b889bdf9e in do_richcmp (op=2, w=<function at remote 0xff8b230>, v=<function at remote 0xa374de8>) at /usr/src/debug/Python-2.7.5/Objects/object.c:930
#2  PyObject_RichCompare (v=<function at remote 0xa374de8>, w=<function at remote 0xff8b230>, op=2) at /usr/src/debug/Python-2.7.5/Objects/object.c:982
#3  0x00007f3b889be12f in PyObject_RichCompareBool (v=<optimized out>, w=<optimized out>, op=op@entry=2) at /usr/src/debug/Python-2.7.5/Objects/object.c:1004
#4  0x00007f3b889d12d4 in tuplerichcompare (
    v=(<function at remote 0xa374de8>, (<Consumer(commit_lock=<thread.lock at remote 0xa375150>, _cleanup_func=<function at remote 0xa374de8>, auto_commit_every_n=100, group='hgweb11', commit_timer=None, auto_commit=False, offsets={0: 109834, 1: 106751, 2: 113444, 3: 101934, 4: 108848, 5: 98855, 6: 99068, 7: 98929}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated),
    w=(<function at remote 0xff8b230>, (<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated), op=2) at /usr/src/debug/Python-2.7.5/Objects/tupleobject.c:614
#5  0x00007f3b889bdfe9 in PyObject_RichCompare (
    v=(<function at remote 0xa374de8>, (<Consumer(commit_lock=<thread.lock at remote 0xa375150>, _cleanup_func=<function at remote 0xa374de8>, auto_commit_every_n=100, group='hgweb11', commit_timer=None, auto_commit=False, offsets={0: 109834, 1: 106751, 2: 113444, 3: 101934, 4: 108848, 5: 98855, 6: 99068, 7: 98929}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated),
    w=(<function at remote 0xff8b230>, (<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated), op=2) at /usr/src/debug/Python-2.7.5/Objects/object.c:961
#6  0x00007f3b889be12f in PyObject_RichCompareBool (v=<optimized out>,
    w=w@entry=(<function at remote 0xff8b230>, (<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated), op=op@entry=2) at /usr/src/debug/Python-2.7.5/Objects/object.c:1004
---Type <return> to continue, or q <return> to quit---
#7  0x00007f3b889aca49 in listremove (self=0x176bb48,
    v=(<function at remote 0xff8b230>, (<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x...(truncated)) at /usr/src/debug/Python-2.7.5/Objects/listobject.c:2351
#8  0x00007f3b88a17b2c in call_function (oparg=<optimized out>, pp_stack=0x7ffe52fbbd90) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4086
#9  PyEval_EvalFrameEx (
    f=f@entry=Frame 0x1b43200, for file /var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py, line 191, in stop (self=<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at re...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740
#10 0x00007f3b88a190bd in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, argcount=argcount@entry=1, kws=0x1b431c8, kwcount=0, defs=0x0, defcount=0, closure=closure@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330
#11 0x00007f3b88a1776f in fast_function (nk=<optimized out>, na=1, n=1, pp_stack=0x7ffe52fbbf90, func=<function at remote 0x19d4aa0>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4194
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffe52fbbf90) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4119
#13 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x1b43040, for file /var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py, line 82, in cleanup (obj=<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at r...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740
#14 0x00007f3b88a190bd in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x30aff6e8, argcount=1, kws=kws@entry=0x7f3b88ed2068, kwcount=kwcount@entry=0, defs=defs@entry=0x0,
    defcount=defcount@entry=0, closure=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330
#15 0x00007f3b889a605d in function_call (func=<function at remote 0xff8b230>,
    arg=(<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34328>, (u'hgssh3.dmz.scl3.mozi...(truncated), kw={}) at /usr/src/debug/Python-2.7.5/Objects/funcobject.c:526
#16 0x00007f3b889810b3 in PyObject_Call (func=func@entry=<function at remote 0xff8b230>,
---Type <return> to continue, or q <return> to quit---
    arg=arg@entry=(<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34258>, (u'hgweb14.dmz.scl3.mozilla.com', 9092): <KafkaConnection at remote 0x1b34328>, (u'hgssh3.dmz.scl3.mozi...(truncated), kw=kw@entry={}) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#17 0x00007f3b88a142f7 in ext_do_call (nk=<optimized out>, na=<optimized out>, flags=<optimized out>, pp_stack=0x7ffe52fbc250, func=<function at remote 0xff8b230>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4411
#18 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x3119f1c0, for file /usr/lib64/python2.7/atexit.py, line 24, in _run_exitfuncs (exc_info=None, func=<function at remote 0xff8b230>, targs=(<Consumer(commit_lock=<thread.lock at remote 0xfdd2cb0>, _cleanup_func=<function at remote 0xff8b230>, auto_commit_every_n=100, group='hgweb12', commit_timer=None, auto_commit=False, offsets={0: 110284, 1: 107189, 2: 113906, 3: 102372, 4: 109261, 5: 99230, 6: 99440, 7: 99301}, topic='pushdata', auto_commit_every_t=5000, client=<KafkaClient(client_id='pushdataaggregator', correlation_id=1808789, hosts=[('hgweb13.dmz.scl3.mozilla.com', 9092), ('hgweb12.dmz.scl3.mozilla.com', 9092), ('hgweb14.dmz.scl3.mozilla.com', 9092), ('hgweb11.dmz.scl3.mozilla.com', 9092), ('hgssh3.dmz.scl3.mozilla.com', 9092)], brokers={8: <BrokerMetadata at remote 0x1b33998>, 9: <BrokerMetadata at remote 0x1b337e0>, 10: <BrokerMetadata at remote 0x1b33050>, 6: <BrokerMetadata at remote 0x1b33e68>, 7: <BrokerMetadata at remote 0x1b339f0>}, timeout=10, conns={(u'hgweb12.dmz.scl3.mozilla.com', 9092...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2779
#19 0x00007f3b88a190bd in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f3b88ed2068, argcount=0, kws=kws@entry=0x0, kwcount=kwcount@entry=0, defs=defs@entry=0x0,
    defcount=defcount@entry=0, closure=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330
#20 0x00007f3b889a5f68 in function_call (func=<function at remote 0x1876320>, arg=(), kw=0x0) at /usr/src/debug/Python-2.7.5/Objects/funcobject.c:526
#21 0x00007f3b889810b3 in PyObject_Call (func=func@entry=<function at remote 0x1876320>, arg=arg@entry=(), kw=<optimized out>) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#22 0x00007f3b88a12f07 in PyEval_CallObjectWithKeywords (func=func@entry=<function at remote 0x1876320>, arg=(), arg@entry=0x0, kw=kw@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3967
#23 0x00007f3b88a34207 in call_sys_exitfunc () at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1754
#24 Py_Finalize () at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:419
#25 0x00007f3b88a33b48 in Py_Exit (sts=sts@entry=1) at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1781
#26 0x00007f3b88a33c87 in handle_system_exit () at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1155
#27 0x00007f3b88a33f4d in handle_system_exit () at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1177
#28 PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1165
#29 0x00007f3b88a3414a in PyErr_Print () at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:1068
#30 0x00007f3b88a34b6e in PyRun_SimpleFileExFlags (fp=<optimized out>, fp@entry=0x12f9740, filename=filename@entry=0x7ffe52fbcede "/var/hg/venv_tools/bin/vcsreplicator-aggregator", closeit=closeit@entry=1, flags=flags@entry=0x7ffe52fbc5c0)
    at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:956
#31 0x00007f3b88a34f63 in PyRun_AnyFileExFlags (fp=fp@entry=0x12f9740, filename=filename@entry=0x7ffe52fbcede "/var/hg/venv_tools/bin/vcsreplicator-aggregator", closeit=closeit@entry=1, flags=flags@entry=0x7ffe52fbc5c0)
    at /usr/src/debug/Python-2.7.5/Python/pythonrun.c:756
#32 0x00007f3b88a45b9f in Py_Main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/Python-2.7.5/Modules/main.c:640
#33 0x00007f3b87c72b15 in __libc_start_main (main=0x4006f0 <main>, argc=3, ubp_av=0x7ffe52fbc788, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe52fbc778) at libc-start.c:274
#34 0x0000000000400721 in _start ()
Frame 18 says we're in atexit.py. Looks like the Kafka consumer registers an atexit handler and this is hanging, preventing process termination. I'll have to dig into the source code of python-kafka to see what's up. We /might/ not be using a context manager or try..finally or something.
QA Contact: hwine → klibby
You need to log in before you can comment on or make changes to this bug.