Investigate Root Cause Of developer3.webapp.scl3.mozilla.com not reporting to New Relic

RESOLVED WORKSFORME

Status

RESOLVED WORKSFORME
2 years ago
2 years ago

People

(Reporter: bensternthal, Assigned: rwatson)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3242])

(Reporter)

Description

2 years ago
On July 22 the server developer3.webapp.scl3.mozilla.com did not report to New Relic from 8:47 to 8:52 AM CST (13:47 - 13:52 UTC), and an alert was raised.

An incident report was filed by John and we would like to understand the cause of this failure.

https://docs.google.com/document/d/1PlRqtBVYaZg_FCGVeKUMOFY7ApT_ZVq8M2GopRBb_7A/edit

Updated

2 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3242]
Component: WebOps: Engagement → WebOps: Other

Updated

2 years ago
Assignee: server-ops-webops → rwatson
(Assignee)

Comment 1

2 years ago
I had a quick look through the doc and grepped through the logs myself and I can't find any indication that this is the proxy. 
I would ask, what led you to that conclusion (other than the error in the doc), but I don't think this is worth too much effort since this was a blip. We also had some newrelic key changes that happened that week that might have caused a blip. Re-open or log more here if needed, but it looks like we haven't seen this since.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
As it says in the incident report, the error in /var/log/newrelic/nrsysmond.log is:

2016-07-22 11:01:47.542 (7734) error: RPM cmd='metric_data' for 'Infrastructure' failed: Proxy CONNECT aborted due to timeout

This is a rolling log file, and there are no recent entries in today's file. On developer3, there were 91 Proxy CONNECT errors on 7/28, or 6% of once-a-minute attempts failing.
You need to log in before you can comment on or make changes to this bug.