monitoring the sync cluster

RESOLVED INVALID

Status

P1
normal
RESOLVED INVALID
7 years ago
2 months ago

People

(Reporter: tarek, Assigned: gozer)

Tracking

Details

(Whiteboard: devPreviewNonBlocker)

(Reporter)

Description

7 years ago
Here's the we need to make sure we have for the developer preview, 

- monitor the heartbeat page for appsync, every 10 s
- monitor the heartbeat page for the node.js server, every 10 s
- monitor the heartbeat page for every HBase node, every 10 s
 
in case of a timeout or non-200, send a mail to :

- <Bill's cell> (sent by email/in the ldap index) 
- appsync-has-no-bugs@googlegroups.com

If possible (not a priority) on IRC on #openwebapps

nice to have:
- CPU/memory/fd monitoring with an alarm on low resource
if there's a web page showing the status of these pieces, i'll put it up on my 2nd floor MV kiosk...
(Assignee)

Comment 2

7 years ago
(In reply to Tarek Ziadé (:tarek) from comment #0)
> Here's the we need to make sure we have for the developer preview, 

Exact urls would be nice ;-)
 
> - monitor the heartbeat page for appsync, every 10 s

http://appsync-stage1.vm1.labs.sjc1.mozilla.com/__heartbeat__

> - monitor the heartbeat page for the node.js server, every 10 s

http://sauropod-stage1.vm1.labs.sjc1.mozilla.com:8001/__heartbeat__ ??

> - monitor the heartbeat page for every HBase node, every 10 s

??

> in case of a timeout or non-200, send a mail to :
> 
> - <Bill's cell> (sent by email/in the ldap index) 
> - appsync-has-no-bugs@googlegroups.com

Easy.

> If possible (not a priority) on IRC on #openwebapps

Will need to check with whoever is already running nagios IRC bots

> nice to have:
> - CPU/memory/fd monitoring with an alarm on low resource

Will do.
Assignee: nobody → gozer
Status: NEW → ASSIGNED
(Reporter)

Comment 3

7 years ago
https://wiki.mozilla.org/Apps/ServerArchitecture will give you all the nodes urls

The URL is /__heartbeat__ for the two app servers, I don't know for the HBase server. ccing Ryan
Priority: -- → P1
FYI, the sauropod __heartbeat__ page pings hbase and errors out if it's not reachable
(Reporter)

Comment 5

7 years ago
I am not certain we want this e.g. if possible the HBase nodes should have their own heartbeat and the node one should be standalone, so we can tell which box is really down in the monitoring
Fair enough.  The hbase rest server has a "cluster status" page which should be useful for this:

  http://appsync-hbase-stage1.vm1.labs.sjc1.mozilla.com:8080/status/cluster
(Assignee)

Comment 7

7 years ago
Checks are now in place, and will send notifications to :

- <Bill's cell> (sent by email/in the ldap index) 
- appsync-has-no-bugs@googlegroups.com
- Gozer's cell
awesome!
Whiteboard: devPreviewNonBlocker
(Reporter)

Comment 9

7 years ago
I have tried to shut down gunicorn then nginx, and we did not receive any mail.

Did you get the SMS ?
Blocks: 710342
No longer blocks: 700492
(Assignee)

Updated

7 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
The old app sync codebase is no longer going to be supported. All resolved fixed bugs are being marked as invalid, as they no longer apply to the new apps in the cloud service.
Resolution: FIXED → INVALID

Updated

2 months ago
Product: Web Apps → Web Apps Graveyard
You need to log in before you can comment on or make changes to this bug.