Closed Bug 1093757 Opened 10 years ago Closed 9 years ago

Install a RabbitMQ monitoring plugin for New Relic on stage and prod

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fubar, Assigned: fubar)

References

Details

ensure we're reporting data to newrelic for memcached, rabbitmq, etc so dev's can have more insight into production environment for development and supporting ops
newrelic plugin agent is also not actually collecting data from apache. the plugin config has port 80 hardcoded, but apache's on 8080.
Priority: -- → P2
Blocks: 1059325
all staging hosts are now reporting to newrelic correctly (proxy acl was blocking outbound data). 

apache was also configured to also listen on port 80 so that the agent could collect data.
We hit another situation today where two of the processors had stopped taking tasks (even though we hadn't deployed) resulting in:
log_parser      19702
log_parser_fail 375
log_parser_hp   16337

Having the queues in new relic would mean we could (presumably) set up email alerts, and so not have to wait until the sheriffs say "is there a problem with log parsing", by which time there is a 35000 job backlog - which takes a fair time to clear even after a |restart-jobs -p log|.

Also - is it expected that everything other than the webapp nodes have "0 rpm" on https://rpm.newrelic.com/accounts/677903/applications/4180461 ? Is there any way we can get that to report the actually number of tasks handled per second?
OS: Mac OS X → All
Priority: P2 → P1
Hardware: x86 → All
Is this rabbitmq new relic plugin what we need?
https://rpm.newrelic.com/accounts/677903/plugins/directory/95
:edmorley the webapp nodes should have rpm == 0 for non-web transactions and rpm > 0 for web transactions.
The opposite is true for all the other nodes: rpm == 0 for web transactions and rpm > 0 for non-web transactions.
(In reply to Mauro Doglio [:mdoglio] from comment #6)
> :edmorley the webapp nodes should have rpm == 0 for non-web transactions and
> rpm > 0 for web transactions.
> The opposite is true for all the other nodes: rpm == 0 for web transactions
> and rpm > 0 for non-web transactions.

The table on https://rpm.newrelic.com/accounts/677903/applications/4180461 has 0 rpm for all nodes apart from webapp, so seems like something needs tweaking.
Priority: P1 → P2
Please can we install either of these:
https://rpm.newrelic.com/accounts/677903/plugins/directory/25
https://rpm.newrelic.com/accounts/677903/plugins/directory/95

The former is what is used on the Mozilla General New Relic account:
https://rpm.newrelic.com/accounts/263620/plugins/11697

...so failing any other ideas, shall we go with that one?

Added bonus: once this is set up, we can set up alerts for message queue sizes that don't require access to Nagios (plus when the alerts _do_ fire, they'll link to the pretty graphs).
Summary: newrelic monitoring for memcache, rabbitmq, etc → Install a RabbitMQ monitoring plugin for New Relic on stage and prod
It's been installed and apparently failing to connect:

ERROR      2015-03-10 19:22:41,395 27769  MainProcess     MainThread newrelic_plugin_agent.agent                   send_components           L235   : Error reporting stats: HTTPSConnectionPool(host='platform-api.newrelic.com', port=443): Max retries exceeded with url: /platform/v1/metrics (Caused by ProxyError('Cannot connect to proxy.', error('Tunnel connection failed: 403 Forbidden',)))

which is messed up because I can connect to that directly. newrelic has fast become my least favorite part of this project.
proxy fixed and rabbitmq is finally reporting.
Blocks: 1141993
That's great - thank you :-)

@sheriffs:
Check this page if you ever think tasks are getting behind:
https://rpm.newrelic.com/accounts/677903/dashboard/6293241/page/4

Have filed bug 1141993 for setting up new relic alerts once we know what sensible values are for the thresholds.
Assignee: nobody → klibby
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.