Open Bug 1492453 Opened 7 years ago Updated 7 years ago

Debugging Kafka issue NotLeaderForPartitionError

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: alberto, Unassigned)

Details

Attachments

(4 files)

Hi, I'm creating this issue to track the debugging efforts to pinpoint the cause of Kafka partitions without a leader. At first glance I found a misconfigured topic which has no leader because the replicas listed are not part of the Kafka cluster. This is most likely not the cause of the issue, but good to have Kafka in a clean state, either deleting the topic (if it's not needed) or adding a valid list of replica servers (14,15,16,17,20). [adelbarrio@hgssh1.dmz.mdc1 ~]$ /opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181/hgmoreplication --topic pushlog --describe Topic:pushlog PartitionCount:1 ReplicationFactor:5 Configs:unclean.leader.election.enable=false,min.insync.replicas=3,max.message.bytes=104857600 Topic: pushlog Partition: 0 Leader: -1 Replicas: 12,13,9,10,11 Isr: 13
I also found something funny with Zookeeper in hgssh1.dmz.mdc1: some writes to the log are very delayed, up to more than 2 seconds! When I checked the server, the IO was very low. Note that this is only happening in that precise machine where is the others is not. Maybe a faulty disk? One possible thing to test is to remove this instance from the Zookeeper ensemble (and possibly other one to leave 3 members) and see if it improves the situation. In case reads are not fasts enough (which I doubt) these servers can be converted into Observers[1] [adelbarrio@hgssh1.dmz.mdc1 ~]$ grep -v INFO /var/log/zookeeper/zookeeper.log.2018-09-1* /var/log/zookeeper/zookeeper.log.2018-09-10:2018-09-10 07:24:30,641 [myid:20] - WARN [SyncThread:20:FileTxnLog@342] - fsync-ing the write ahead log in SyncThread:20 took 1755ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide /var/log/zookeeper/zookeeper.log.2018-09-10:2018-09-10 11:21:02,918 [myid:20] - WARN [SyncThread:20:FileTxnLog@342] - fsync-ing the write ahead log in SyncThread:20 took 1633ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide ....... Now, I found some logs which relate the Zookeeper errors with Kafka and finally leading to the errors we are experiencing in the producers pushdataaggregator and pushdataaggregator-pending. This might be due to long GC from Zookeeper, but I could not confirm it (the GC log in zookeeper is not enabled). At some point Kafka running on hgweb2.dmz.mdc1 experiences a SessionExpired from Zookeeper, and fails for a moment in renewing it. This cause pushdataaggregator to throw NotLeaderForPartitionError. Check the next logs and timing: kafka in hgweb2.dmz.mdc1: [2018-09-18 08:41:17,461] WARN Attempting to send response via channel for which there is no open connection, connection id 10.48.74.52:9092- 10.48.74.45:39874-110896 (kafka.network.Processor) [2018-09-18 08:41:17,504] INFO [Partition pushdata-5 broker=15] Shrinking ISR from 14,20,17,16,15 to 15 (kafka.cluster.Partition) [2018-09-18 08:41:17,650] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:17,660] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:17,661] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:17,661] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:19,993] INFO [ZooKeeperClient] Session expired. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:20,015] INFO [ZooKeeperClient] Initializing a new session to localhost:2181/hgmoreplication. (kafka.zookeeper.ZooKeeperClient) [2018-09-18 08:41:20,026] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/pushdata/partitions/5/state at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:465) at kafka.zk.KafkaZkClient.conditionalUpdatePath(KafkaZkClient.scala:621) at kafka.utils.ReplicationUtils$.updateLeaderAndIsr(ReplicationUtils.scala:33) at kafka.cluster.Partition.kafka$cluster$Partition$$updateIsr(Partition.scala:669) at kafka.cluster.Partition$$anonfun$4.apply$mcZ$sp(Partition.scala:513) ... Stacktrace continues That error lead to pushdataaggregator throwing: Sep 18 08:40:47 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer producer.stop() called, but producer is not async Sep 18 08:41:16 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.conn <BrokerConnection node_id=0 host=hgweb2.dmz.mdc1.mozilla.com:9092 <connected> [IPv4 ('10.48.74.52', 9092)]> timed out after 10000 ms. Closing connection. Sep 18 08:41:16 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.client Request failed: [Error 7] RequestTimedOutError: Request timed out after 10000 ms Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: vcsreplicator.aggregator 16 unacked messages in 1 partition: [0] Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: vcsreplicator.aggregator copying heartbeat-1 from partition 0 Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer Unable to send messages Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: Traceback (most recent call last): Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 423, in _send_messages Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: fail_on_error=self.sync_fail_on_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 647, in send_produce_request Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: (not fail_on_error or not self._raise_on_response_error(resp))] Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 407, in _raise_on_response_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.errors.check_error(resp) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: raise error_class(response) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: NotLeaderForPartitionError: [Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'replicatedpushdata', partition=0, error=6, offset=-1) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.producer error sending message to Kafka; reinitializing client to retry Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: Traceback (most recent call last): Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 51, in send_message Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: self.topic, partition, msg) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 380, in send_messages Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: return self._send_messages(topic, partition, *msg) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 423, in _send_messages Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: fail_on_error=self.sync_fail_on_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 647, in send_produce_request Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: (not fail_on_error or not self._raise_on_response_error(resp))] Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 407, in _raise_on_response_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: kafka.errors.check_error(resp) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: raise error_class(response) Sep 18 08:41:21 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator[13617]: NotLeaderForPartitionError: [Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'replicatedpushdata', partition=0, error=6, offset=-1) And pushdataaggregator-pending throwing: Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: Traceback (most recent call last): Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/daemon.py", line 45, in run_in_loop Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: fn(alive=alive, **kwargs) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 48, in _run_aggregation Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: alive=alive) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 113, in synchronize_fully_consumed_messages Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: ack_group=ack_group)[2] Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/aggregator.py", line 73, in get_aggregation_counts Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: offsets = consumer_offsets(client, consumer_topic, groups) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/util.py", line 76, in consumer_offsets Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: auto_commit=False) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py", line 78, in __init__ Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: self.fetch_last_known_offsets(partitions) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/consumer/base.py", line 112, in fetch_last_known_offsets Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: check_error(resp) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/errors.py", line 515, in check_error Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: raise error_class(response) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: UnknownError: [Error -1] UnknownError: OffsetFetchResponsePayload(topic=u'pushdata', partition=0, offset=-1, metadata=u'', error=-1) Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: vcsreplicator.aggregator executing loop exiting after error Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com pushdataaggregator-pending[16941]: vcsreplicator.aggregator process exiting code 1 Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com systemd[1]: pushdataaggregator-pending.service: main process exited, code=exited, status=1/FAILURE Sep 18 08:41:20 hgssh1.dmz.mdc1.mozilla.com systemd[1]: Unit pushdataaggregator-pending.service entered failed state. [1] https://zookeeper.apache.org/doc/r3.4.13/zookeeperObservers.html
This is excellent debugging and *extremely* helpful: thank you!!! We don't have GC logging enabled for Zookeeper. We should definitely enable that. My experience with running Java server processes is that GC is a persistent problem and you should *always* have GC logging enabled to help diagnose issues. We do have GC logging enabled for Kafka. And from its logs, I noticed that we're coming nowhere close to hitting the -Xmx1G ceiling. The most memory we appear to be using after GC is ~320MB. I also noticed that Kafka starts by default with -Xms1G -Xmx1G. So, immediate action items: 1. Enable GC logging for Zookeeper 2a. Lower -Xmx for Kafka from 1G to 512m to reflect actual peak memory utilization requirements 2b. Tweak Kafka GC settings so GC kicks in sooner (so we keep memory utilization low and don't flirt with the 1G heap limit) 3. Remove -Xms1G from Kafka #3 will help with the test environment, which spawns N Kafka processes. Right now, each of those processes will reserve 1GB and this will quickly lead to swapping. Removing -Xms will have Kafka only use what memory it initially needs. I like the idea of using Xms in server environments so you aren't surprised when a process memory footprint grows and you start swapping. But I'm not worried about this for this use of Kafka and the composition of our servers. Besides, swapping will occur regardless of whether the Java process started with 1GB pre-allocated or grew to that. I think the main thing that -Xms gets you is a pre-allocated heap that doesn't require any "warm up" to grow to its max size on initial start. I don't think our performance requirements are anywhere close to caring. So let's just nuke -Xms from Kafka. We'll probably want to tune Zookeeper's memory/GC settings as well. But let's get GC logs from Zookeeper first. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting has some nice info about GC. Finally, I'm worried about that long fsync for the Zookeeper write-ahead log. That's definitely worth investigating further. We shouldn't incur that much write I/O on hgssh1 outside of the NFS volume. And we should be using a fast SSD, so I'm not sure what would be causing ~2s of pending write I/O (which would be hundreds of megabytes). And I'd like to think that fsync()ing the write-ahead log (which isn't on the NFS volume) won't have to wait on the NFS volume to do anything. But I could be wrong. fsync() on Linux is wonky. We should definitely figure out what is behind the unflushed I/O and what's causing fsync to be slow. We might be able to mitigate that by tuning background sync/flush settings of the filesystem...
Oh, can this bug be made public? There's nothing in here that warrants a private bug, IMO. Most of the hg.mo infra configuration is in public repositories, so things like hostnames, paths, processes, etc are public domain.
We think the issues noted in Bug 1492453 may be a result of the JVM GC configuration we currently have. A good first step to verifying this theory is to enable GC logs. This commit adds CLI options to the JVM as recommended at https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting (see the section titled "GC pressure").
Comment on attachment 9010755 [details] zookeeper: enable GC logging (Bug 1492453) r?gps Gregory Szorc [:gps] has approved the revision.
Attachment #9010755 - Flags: review+
I also noticed that Kafka is using the G1 garbage collector. Zookeeper appears to be using the default collector, which doesn't appear to be the CMS collector. We might just want to preemptively switch Zookeeper to the G1 collector. I think it is sufficient to copy the GC-related arguments that Kafka is using. We should also remove -Xms1G from Zookeeper to limit initial memory utilization.
This commit adds the `KAFKA_HEAP_OPTS` environment variable to the `kafka` systemd unit. Setting this variable will override the default heap allocation settings from "-Xmx1G -Xms1G" to "-Xmx512m". Essentially, this will do two things. First, it lowers the maximum heap allocation pool from 1GB to 512MB. We are doing this in the hope that we can trigger GC on the Kafka processes sooner than our current deployment, which should make GC take less time as there will be less stale memory to collect. Second, we are removing the initial memory allocation setting altogether. This should force Kafka to only use as much memory as it needs (it rarely requires 1GB, especially not at program startup) and should provide a small performance boost in the test suite.
In a previous commit we added GC logging to the Zookeeper processes by setting environment variables in the systemd unit for those processes. It slipped my mind that we don't use systemd in docker, and need to update in a second location for parity in the test suite. This commit should fix that issue.
Comment on attachment 9010807 [details] kafka: modify heap memory allocation config (Bug 1492453) r?gps Gregory Szorc [:gps] has approved the revision.
Attachment #9010807 - Flags: review+
Comment on attachment 9010808 [details] zookeeper: mirror systemd config in test suite (Bug 1492453) r?gps Gregory Szorc [:gps] has approved the revision.
Attachment #9010808 - Flags: review+
Group: infra
You guys are moving this bug fast, like it! I've made the bug public as you suggested. A bit more of information regarding the very slow fsync() on hgssh1.dmz.mdc1. Something is bad in the IO side and is affecting all the applications running there. For example auditd, compare the times that "rate limit was exceeded" appears on hgssh1 and hgssh2 (hgssh1 it's been running for much more time than hgssh2, but I think the comparison is still valid): [adelbarrio@hgssh1.dmz.mdc1 log]$ dmesg | grep "rate limit exceeded" | wc -l 3137 [adelbarrio@hgssh2.dmz.mdc1 ~]$ dmesg | grep "rate limit exceeded" | wc -l 16 Finally I built a graph showing “weighted_io_time”* in the 3 hgssh servers, the results are pretty crazy! Running a very light disk health test with "sudo smartctl -t short -a /dev/sda" didn't show any problems on it. I didn't want to run longer and more accurate test for not interfering with the applications running. I think it's worth to reboot the server and maybe do a extensive disk health test before making the server to join the Zookeeper and Kafa cluster. *“weighted_io_time” - measure of both I/O completion time and the backlog that may be accumulating.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: