Closed Bug 1147100 Opened 9 years ago Closed 9 years ago

Pipeline does not gracefully recover from network failures

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: sphilp)

Details

The following occurred on 2015-03-20 (from http://status.aws.amazon.com/):

7:23 AM PDT We are investigating networking connectivity issues for some instances in the US-WEST-2 Region.
7:41 AM PDT We continue to investigate network connectivity and DNS resolution issues affecting some instances within a VPC in the US-WEST-2 Region.
8:07 AM PDT We have resolved the network connectivity and DNS resolution issues for the majority of instances within a VPC in the US-WEST-2 Region.
8:22 AM PDT Between 7:03 AM and 7:49 AM PDT we experienced network connectivity and DNS resolution issues affecting some instances within a VPC in the US-WEST-2 Region. The issue has been resolved and the service is operating normally.

The data pipeline was not able to gracefully recover, and I had to manually restart all Heka processes. We should pinpoint the cause (which I suspect is in KafkaInput, HttpListenInput, or both) and get a code fix in. On the edge nodes I saw file descriptor limits being hit:

Mar 20 15:58:46 ip-172-31-13-73 hekad: 2015/03/20 15:58:46 http: Accept error: accept tcp [::]:80: too many open files; retrying in 1s

And for Kafka input/output on all hekas:

Mar 20 14:32:00 ip-172-31-13-73 hekad: 2015/03/20 14:32:00 Plugin 'KafkaOutput' error: kafka server: Replica infomation not available, one or more brokers are down.

The latter is expected in a network failure scenario, but when the network connectivity issues were restored, the heka instances did not gracefully recover.
Priority: -- → P1
Not able to reproduce yet on stage, but a few notes that may trigger something...

Dropping kafka server connections on port 80 and 6667 results in kafka spitting out:

[2015-03-26 15:35:04,855] INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)

[2015-03-26 15:35:33,420] WARN [ReplicaFetcherThread-1-2034221], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 1036205; ClientId: ReplicaFetcherThread-1-2034221; ReplicaId: 2040115; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [telemetry,0] -> PartitionFetchInfo(162983,16777216). Possible cause: java.nio.channels.ClosedChannelException (kafka.server.ReplicaFetcherThread)

So Kafka seems to handle retrying when the connection drops: https://issues.apache.org/jira/browse/KAFKA-1461

Heka doesn't seem to have an issue here, it keeps submitting results.

Moving to the edge servers, if they can't resolve dns it throws:

hekad: 2015/03/26 16:42:03 Plugin 'KafkaOutput' error: kafka: Client has run out of available brokers to talk to. Is your cluster reachable?
hekad: 2015/03/26 16:42:06 Plugin 'KafkaOutput' error: kafka: Dropped 50 messages: kafka: broker: not connected

and 

hekad: 2015/03/26 16:52:43 Plugin 'LandfillOutput' error: Error publishing /media/ephemeral0/split_out/http_raw/finalized/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110 to s3://net-mozaws-stage-us-west-2-pipeline-data/landfill/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110: Put https://s3-us-west-2.amazonaws.com/net-mozaws-stage-us-west-2-pipeline-data/landfill/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110: dial tcp: lookup s3-us-west-2.amazonaws.com: no such host

Thinking of other ways to reproduce...
Another idea was to try terminating the Kafka services while load is being applied to the edge servers:

Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF
Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 10 messages: read tcp 172.31.10.45:6667: connection reset by peer
Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 61 messages: write tcp 172.31.10.45:6667: broken pipe
Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: 13 packs have been idle more than 120 seconds.
Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: (input) Plugin names and quantities found on idle packs:
Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: 	KafkaOutput: 13

It then continues to spit out:

Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: 48 packs have been idle more than 120 seconds.
Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: (input) Plugin names and quantities found on idle packs:
Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: 	KafkaOutput: 32
Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: 	LandfillOutput: 16

Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Client has run out of available brokers to talk to. Is your cluster reachable?
Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: read tcp 172.31.10.45:6667: connection reset by peer
Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF

It doesn't seem to retry those packs, so maybe that's in the ballpark...
"Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF" is more or less expected in this failure scenario presently. Rob is working presently on generic queueing in heka that should hopefully fix this. We're concerned here with the lack of graceful recovery after the issues were resolved.

The network failures induced so far have been using iptables rules. Stuart's going to try using AWS security groups for the next test, keeping edge->kafka connectivity but removing edge->zookeeper.
Assignee: nobody → whd
Priority: P1 → P2
Priority: P2 → P3
Assignee: whd → nobody
Assignee: nobody → sphilp
Wes and I narrowed in on the issue here, but he's going to throw away the staging stack so we can try reproducing on a clean cluster. Should be able to test tomorrow to confirm once Wes has the new stack up
Flags: needinfo?(whd)
The new staging stack is up, a bit later than projected.
Flags: needinfo?(whd)
Came back to this this afternoon. Wes and I reproduced the scenario and we're able to verify that everything fixed itself after outage, although it took about 25 minutes to come back online (Wes will look into this separately). 

The data during the outage was placed in the landfill, and when things came back online the pipeline continued processing incoming data as normal. The landfill could be processed at a later time, but it doesn't look like there was any data loss as long as the edge servers can continue to write to disk. 

Everything seems to look okay here, marking this fixed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.