1161422 - [tracking] FHR incident (May 4th-5th, 2015)

extremely brief tl;dr:

ssl cert expired, fixed, tuned zeus to handle 2x connection volume (300k/min), eventually recovered loadavg likely due to conntrack shrink, stable until morning

detailed post-mortem during business hours

Assignee: server-ops-webops → rsoderberg

Component: WebOps: Product Delivery → WebOps: Other

Flags: needinfo?(rsoderberg)

Peter Radcliffe [:pir]

Updated

•

9 years ago

Updated

•

9 years ago

Group: infra

:Atoll

Assignee

Updated

•

9 years ago

Updated

•

9 years ago

Flags: needinfo?(rsoderberg)

Summary: zlb3.ops.scl3.mozilla.com:Load is WARNING & CRITICAL → [tracking] FHR incident (May 4th-5th, 2015)

:Atoll

Assignee

Updated

•

9 years ago

Depends on: 1161893

:Atoll

Assignee

Comment 3

•

9 years ago

# Summary

The SSL certificate for 'fhr.data.mozilla.com' expired at 18:48 yesterday, leading to a steady increase in client traffic until the renewed certificate was published. The certificate was renewed and various bugs have been filed to improve monitoring and performance of the components involved in this incident.

# Timeline (May 4th-5th; US/Pacific, -0700)

18:48 - Certificate expires.
19:35 - Bug filed by third party.
22:48 - Nagios load average warning for zlb3.ops.scl3.
22:55 - Alert observed on IRC by Webops.
22:58 - Nagios load critical warning for zlb3.
23:01 - Identified increase in hits per minute to zlb3.
23:06 - Ruled out hg/git/ftp/download as source of increase.
23:08 - Identified fhr as source of increase.
23:09 - Found expired SSL certificate.
23:16 - Replacement SSL certificate issued by Digicert.
23:17 - SSL certificate deployed to zlb3.
23:29 - Zeus still slow to respond at SSL negotiation phase.
23:36 - Identified performance issues with Zeus configuration for FHR worker pool.
00:03 - Completed three sets of alterations to Zeus worker configuration.
00:13 - Conntrack queue and load dropped to normal levels.

# Q&A

Q: How was the issue eventually detected?
A: Nagios reported a load alarm for one of the Zeus servers. Further investigation revealed a significant increase in hits per minute to that server, which was eventually traced to the fhr.data.mozilla.com VIP, which was configured with an expired certificate.

Q: Why didn't Zeus bandwidth monitoring catch this?
A: Bytes sent to clients decreased as a result of the expired certificate, as clients terminated their connection before any (encrypted) bytes could be sent in response to their POST request. This resulted in a slight decrease in bytes sent to clients, causing no impact on bandwidth to set off the alarms.

Q: Why wasn't the SSL certificate expiration detected?
A: We monitor each SSL-enabled hostname individually, which has led to gaps in coverage as various hostnames were issued a certificate without corresponding monitoring. Global monitoring is being considered to detect impending expirations without requiring individual checks. A third party reported the issue to a component that is not monitored during our team's evening hours.

Q: Could we have discovered the third party bug sooner?
A: Only by random chance. The Webops queues are monitored for 'blocker' issues, which automatically page MOC. The bug was not filed with 'blocker' priority, and thus did not page. Webops team timezones are such that, when the bug was filed, all team members (US and UK) were off work for the day.

Q: Is this normal behavior for FHR clients?
A: Yes. The client is designed to retry submissions on a randomized backoff interval. In situations such as this, in the best-case scenario their retry/backoff implementation will lead to a 2-3x increase in connection attempts. We will increase our SSL negotiation capacity to handle this. FHR traffic is currently processed by a single load balancer. We will likely spread the load across multiple load balancers.

Q: What were the Zeus / FHR worker misconfigurations?
A: We permitted an unlimited amount of connections to each backend node, now capped. We permitted an unlimited queue of connections pending to each backend node, now capped with expiration. We permitted multiple retries of a request in case of timeouts, no longer. We do not use HTTP keepalive to the backend workers, which remains an issue.

Q: Was conntrack part of the issue?
A: No, it was a reflection of the increase in traffic combined with the performance issues in our Zeus configuration for FHR workers. Once Zeus was tuned to pass requests to the backend nodes more efficiently, we observed that decreasing from ~1000k connections to ~400k connections, causing load average to drop from 26 to 10. This is approximately in line with previously-observed behaviors on the Zeus clusters.

# Bugs

1161420 - fhr.data.m.c cert expired (geotrust)
1161422 - [tracking] FHR incident (May 4th-5th, 2015)
1161423 - add SSL expiration monitoring for fhr.data.mozilla.com
1161875 - enable HTTP keepalive to FHR backend
1161890 - add SSL expiration log check to for Zeus cluster audit logs
1161893 - FHR clients negative feedback loop triggered by SSL certificate expiration
1161894 - Increase fhr.data.mozilla.com to 3 traffic IPs

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

:Atoll

Assignee

Updated

•

9 years ago

Depends on: 1161894

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

[tracking] FHR incident (May 4th-5th, 2015)

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: dgarvey, Assigned: Atoll)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1111] )

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Updated

Updated

Updated

Updated

Comment 3

Updated

Updated