Bug 1492453 Comment 1 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I also found something funny with Zookeeper in hgssh1.dmz.mdc1: some writes to the log are very delayed, up to more than 2 seconds! When I checked the server, the IO was very low. Note that this is only happening in that precise machine where is the others is not. Maybe a faulty disk?

One possible thing to test is to remove this instance from the Zookeeper ensemble (and possibly other one to leave 3 members) and see if it improves the situation. In case reads are not fasts enough (which I doubt) these servers can be converted into Observers[1]

[adelbarrio@hgssh1.dmz.mdc1 ~]$ grep -v INFO /var/log/zookeeper/zookeeper.log.2018-09-1*
/var/log/zookeeper/zookeeper.log.2018-09-10:2018-09-10 07:24:30,641 [myid:20] - WARN  [SyncThread:20:FileTxnLog@342] - fsync-ing the write ahead log in SyncThread:20 took 1755ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
/var/log/zookeeper/zookeeper.log.2018-09-10:2018-09-10 11:21:02,918 [myid:20] - WARN  [SyncThread:20:FileTxnLog@342] - fsync-ing the write ahead log in SyncThread:20 took 1633ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
.......



Now, I found some logs which relate the Zookeeper errors with Kafka and finally leading to the errors we are experiencing in the producers  pushdataaggregator and  pushdataaggregator-pending. This might be due to long GC from Zookeeper, but I could not confirm it (the GC log in zookeeper is not enabled).
At some point Kafka running on hgweb2.dmz.mdc1 experiences a SessionExpired from Zookeeper, and fails for a moment in renewing it. This cause pushdataaggregator to throw NotLeaderForPartitionError. Check the next logs and timing:


kafka in hgweb2.dmz.mdc1:
[2018-09-18 08:41:17,461] WARN Attempting to send response via channel for which there is no open connection, connection id 10.48.74.52:9092-
10.48.74.45:39874-110896 (kafka.network.Processor)
[2018-09-18 08:41:17,504] INFO [Partition pushdata-5 broker=15] Shrinking ISR from 14,20,17,16,15 to 15 (kafka.cluster.Partition)
[2018-09-18 08:41:17,650] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:17,660] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:17,661] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:17,661] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:19,993] INFO [ZooKeeperClient] Session expired. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:20,015] INFO [ZooKeeperClient] Initializing a new session to localhost:2181/hgmoreplication. (kafka.zookeeper.ZooKeeperClient)
[2018-09-18 08:41:20,026] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/pushdata/partitions/5/state
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:465)
        at kafka.zk.KafkaZkClient.conditionalUpdatePath(KafkaZkClient.scala:621)
        at kafka.utils.ReplicationUtils$.updateLeaderAndIsr(ReplicationUtils.scala:33)
        at kafka.cluster.Partition.kafka$cluster$Partition$$updateIsr(Partition.scala:669)
        at kafka.cluster.Partition$$anonfun$4.apply$mcZ$sp(Partition.scala:513)
        ... Stacktrace continues

That error lead to pushdataaggregator throwing:
Sep 18 08:40:47 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer producer.stop() called, but producer is not async
Sep 18 08:41:16 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.conn <BrokerConnection node_id=0 host=hgweb2.dmz.mdc1.mozilla.com:9092 <connected> [IPv4 ('10.48.74.52', 9092)]> timed out after 10000 ms. Closing connection.
Sep 18 08:41:16 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.client Request failed: [Error 7] RequestTimedOutError: Request timed out after 10000 ms
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: vcsreplicator.aggregator 16 unacked messages in 1 partition: [0]
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: vcsreplicator.aggregator copying heartbeat-1 from partition 0
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer Unable to send messages
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: Traceback (most recent call last):
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 423, in _send_messages
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: fail_on_error=self.sync_fail_on_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 647, in send_produce_request
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: (not fail_on_error or not self._raise_on_response_error(resp))]
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 407, in _raise_on_response_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.errors.check_error(resp)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: raise error_class(response)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: NotLeaderForPartitionError: [Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'replicatedpushdata', partition=0, error=6, offset=-1)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer error sending message to Kafka; reinitializing client to retry
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: Traceback (most recent call last):
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 51, in send_message
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: self.topic, partition, msg)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 380, in send_messages
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: return self._send_messages(topic, partition, *msg)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 423, in _send_messages
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: fail_on_error=self.sync_fail_on_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 647, in send_produce_request
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: (not fail_on_error or not self._raise_on_response_error(resp))]
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 407, in _raise_on_response_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.errors.check_error(resp)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: raise error_class(response)
Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: NotLeaderForPartitionError: [Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'replicatedpushdata', partition=0, error=6, offset=-1)


And pushdataaggregator-pending throwing:
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: Traceback (most recent call last):
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/daemon.py", line 45, in run_in_loop
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: fn(alive=alive, **kwargs)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 48, in _run_aggregation
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: alive=alive)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 113, in synchronize_fully_consumed_messages
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: ack_group=ack_group)[2]
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 73, in get_aggregation_counts
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: offsets = consumer_offsets(client, consumer_topic, groups)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/util.py", line 76, in consumer_offsets
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: auto_commit=False)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py", line 78, in __init__
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: self.fetch_last_known_offsets(partitions)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py", line 112, in fetch_last_known_offsets
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: check_error(resp)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: raise error_class(response)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: UnknownError: [Error -1] UnknownError: OffsetFetchResponsePayload(topic=u'pushdata', partition=0, offset=-1, metadata=u'', error=-1)
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: vcsreplicator.aggregator executing loop exiting after error
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: vcsreplicator.aggregator process exiting code 1
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com systemd[1]: pushdataaggregator-pending.service: main process exited, code=exited, status=1/FAILURE
Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com systemd[1]: Unit pushdataaggregator-pending.service entered failed state.




[1] https://zookeeper.apache.org/doc/r3.4.13/zookeeperObservers.html

Back to Bug 1492453 Comment 1