Closed
Bug 1147100
Opened 9 years ago
Closed 9 years ago
Pipeline does not gracefully recover from network failures
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: whd, Assigned: sphilp)
Details
The following occurred on 2015-03-20 (from http://status.aws.amazon.com/): 7:23 AM PDT We are investigating networking connectivity issues for some instances in the US-WEST-2 Region. 7:41 AM PDT We continue to investigate network connectivity and DNS resolution issues affecting some instances within a VPC in the US-WEST-2 Region. 8:07 AM PDT We have resolved the network connectivity and DNS resolution issues for the majority of instances within a VPC in the US-WEST-2 Region. 8:22 AM PDT Between 7:03 AM and 7:49 AM PDT we experienced network connectivity and DNS resolution issues affecting some instances within a VPC in the US-WEST-2 Region. The issue has been resolved and the service is operating normally. The data pipeline was not able to gracefully recover, and I had to manually restart all Heka processes. We should pinpoint the cause (which I suspect is in KafkaInput, HttpListenInput, or both) and get a code fix in. On the edge nodes I saw file descriptor limits being hit: Mar 20 15:58:46 ip-172-31-13-73 hekad: 2015/03/20 15:58:46 http: Accept error: accept tcp [::]:80: too many open files; retrying in 1s And for Kafka input/output on all hekas: Mar 20 14:32:00 ip-172-31-13-73 hekad: 2015/03/20 14:32:00 Plugin 'KafkaOutput' error: kafka server: Replica infomation not available, one or more brokers are down. The latter is expected in a network failure scenario, but when the network connectivity issues were restored, the heka instances did not gracefully recover.
Updated•9 years ago
|
Priority: -- → P1
Assignee | ||
Comment 1•9 years ago
|
||
Not able to reproduce yet on stage, but a few notes that may trigger something... Dropping kafka server connections on port 80 and 6667 results in kafka spitting out: [2015-03-26 15:35:04,855] INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) [2015-03-26 15:35:33,420] WARN [ReplicaFetcherThread-1-2034221], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 1036205; ClientId: ReplicaFetcherThread-1-2034221; ReplicaId: 2040115; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [telemetry,0] -> PartitionFetchInfo(162983,16777216). Possible cause: java.nio.channels.ClosedChannelException (kafka.server.ReplicaFetcherThread) So Kafka seems to handle retrying when the connection drops: https://issues.apache.org/jira/browse/KAFKA-1461 Heka doesn't seem to have an issue here, it keeps submitting results. Moving to the edge servers, if they can't resolve dns it throws: hekad: 2015/03/26 16:42:03 Plugin 'KafkaOutput' error: kafka: Client has run out of available brokers to talk to. Is your cluster reachable? hekad: 2015/03/26 16:42:06 Plugin 'KafkaOutput' error: kafka: Dropped 50 messages: kafka: broker: not connected and hekad: 2015/03/26 16:52:43 Plugin 'LandfillOutput' error: Error publishing /media/ephemeral0/split_out/http_raw/finalized/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110 to s3://net-mozaws-stage-us-west-2-pipeline-data/landfill/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110: Put https://s3-us-west-2.amazonaws.com/net-mozaws-stage-us-west-2-pipeline-data/landfill/pipeline_prototype_incoming.stage.mozaws.net/20150326164738.850_ip-172-31-16-110: dial tcp: lookup s3-us-west-2.amazonaws.com: no such host Thinking of other ways to reproduce...
Assignee | ||
Comment 2•9 years ago
|
||
Another idea was to try terminating the Kafka services while load is being applied to the edge servers: Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 10 messages: read tcp 172.31.10.45:6667: connection reset by peer Mar 30 15:36:02 ip-172-31-35-102 hekad: 2015/03/30 15:36:02 Plugin 'KafkaOutput' error: kafka: Dropped 61 messages: write tcp 172.31.10.45:6667: broken pipe Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: 13 packs have been idle more than 120 seconds. Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: (input) Plugin names and quantities found on idle packs: Mar 30 15:38:05 ip-172-31-35-102 hekad: 2015/03/30 15:38:05 Diagnostics: KafkaOutput: 13 It then continues to spit out: Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: 48 packs have been idle more than 120 seconds. Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: (input) Plugin names and quantities found on idle packs: Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: KafkaOutput: 32 Mar 30 15:40:35 ip-172-31-0-130 hekad: 2015/03/30 15:40:35 Diagnostics: LandfillOutput: 16 Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Client has run out of available brokers to talk to. Is your cluster reachable? Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: read tcp 172.31.10.45:6667: connection reset by peer Mar 30 15:43:31 ip-172-31-0-130 hekad: 2015/03/30 15:43:31 Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF It doesn't seem to retry those packs, so maybe that's in the ballpark...
Reporter | ||
Comment 3•9 years ago
|
||
"Plugin 'KafkaOutput' error: kafka: Dropped 1 messages: EOF" is more or less expected in this failure scenario presently. Rob is working presently on generic queueing in heka that should hopefully fix this. We're concerned here with the lack of graceful recovery after the issues were resolved. The network failures induced so far have been using iptables rules. Stuart's going to try using AWS security groups for the next test, keeping edge->kafka connectivity but removing edge->zookeeper.
Updated•9 years ago
|
Assignee: nobody → whd
Updated•9 years ago
|
Priority: P1 → P2
Updated•9 years ago
|
Priority: P2 → P3
Updated•9 years ago
|
Assignee: whd → nobody
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → sphilp
Assignee | ||
Comment 4•9 years ago
|
||
Wes and I narrowed in on the issue here, but he's going to throw away the staging stack so we can try reproducing on a clean cluster. Should be able to test tomorrow to confirm once Wes has the new stack up
Flags: needinfo?(whd)
Reporter | ||
Comment 5•9 years ago
|
||
The new staging stack is up, a bit later than projected.
Flags: needinfo?(whd)
Assignee | ||
Comment 6•9 years ago
|
||
Came back to this this afternoon. Wes and I reproduced the scenario and we're able to verify that everything fixed itself after outage, although it took about 25 minutes to come back online (Wes will look into this separately). The data during the outage was placed in the landfill, and when things came back online the pipeline continued processing incoming data as normal. The landfill could be processed at a later time, but it doesn't look like there was any data loss as long as the edge servers can continue to write to disk. Everything seems to look okay here, marking this fixed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•