659509 - Arecibo instance for production

Reporter

Description

•

13 years ago

We're exploring Andy McKay's http://www.areciboapp.com/ for organizing our tracebacks. You probably already know there's an instance up on khan, but, if it turns out to be a useful thing, we'd eventually like a production-suitable instance as well.

No action needed yet; just wanted to give you a heads-up. Cheers!

Erik Rose [:erik][:erikrose]

Reporter

Comment 1

•

13 years ago

"our" being SUMO

James Socol [:jsocol, :james]

Comment 2

•

13 years ago

This can almost certainly run on a VM in PHX, but there's some work to do before the app is easily deployable.

Corey Shields [:cshields]

Comment 3

•

13 years ago

(In reply to comment #2)
> This can almost certainly run on a VM in PHX, but there's some work to do
> before the app is easily deployable.

+1

Also, (I know nothing of this app right now) is this something that can work in multiple locations, ie when we *eventually* multi-home SUMO between data centers simultaneously?  Will we be able to run an instance in each data center without causing split-brain in your traceback monitoring?

Andy McKay

Comment 4

•

13 years ago

It does a http post (other options are available) of the data from a queue to the app. So it doesn't really matter which data center it can live in, as long as there's a http route of some kind between them.

Corey Shields [:cshields]

Comment 5

•

13 years ago

Cool.  Next question then (and adding to the CC list), how can we scale this out to all of our sites and not just SUMO?  A lot of people could benefit from a centralized error reporting interface (devs, qa, admins) but I'd rather it not be done in a one-off basis per site.  Let's do this for all sites.

And as we look to do this, is Arecibo the best tool for the job?  Are there others to investigate?

James Socol [:jsocol, :james]

Comment 6

•

13 years ago

I don't know of anything else off-the-shelf that does what Arecibo does, and quick Googling doesn't come up with anything--maybe Andy knows of its competitors?

Scaling it out to multiple apps should be relatively simple. I've heard of successful large-scale tests on khan (200k errors/day) without issue, and the app has the concept of not just apps but multiple app installs built-in. We could easily push dev, stage, prod, or prod-phx/prod-ams, etc, for multiple apps to the same Arecibo, as far as I can tell.

Erik Rose [:erik][:erikrose]

Reporter

Comment 7

•

13 years ago

And Arecibo has clients for Python, Django itself, PHP, JS, and several other platforms and languages, so it could certainly eventually cover "all our sites".

Andy McKay

Comment 8

•

13 years ago

The main competition includes: 

django-sentry (https://github.com/dcramer/django-sentry) - which is catching up to Arecibo in some areas and ahead in others. It can log more than just errors mind.

There are plenty of hosted non open source solutions: http://hoptoadapp.com/pages/home, https://errormator.com/ etc. There's also open source solutions in different stacks: http://code.google.com/p/elmah/.

I haven't done an extensive analysis of them all in a while.

Regarding performance, its coped with 200k a dayin the past. But haven't tried it on MySQL etc. Something like sending all the 404's from some sites might just be a challenge, but I'm not sure that's something we really want to be doing with this anyway.

Justin Dow [:jabba]

Assignee

Comment 9

•

13 years ago

Perhaps could add to the generic ganglia/graphite VMs in each data center?

Assignee: server-ops → cshields

Corey Shields [:cshields]

Comment 10

•

13 years ago

(In reply to comment #9)
> Perhaps could add to the generic ganglia/graphite VMs in each data center?

exactly.

James Socol [:jsocol, :james]

Comment 11

•

13 years ago

(In reply to comment #9)
> Perhaps could add to the generic ganglia/graphite VMs in each data center?

At some point we should just name them "reporting" or something, heh ;)

But that WFM.

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Updated

•

13 years ago

Blocks: 658118

Corey Shields [:cshields]

Updated

•

13 years ago

Assignee: cshields → jdow

Justin Dow [:jabba]

Assignee

Updated

•

13 years ago

Severity: enhancement → normal

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Comment 12

•

13 years ago

Where are we at with this?

Justin Dow [:jabba]

Assignee

Comment 13

•

13 years ago

This is now up.

There is an instance at each primary data center, at the following URLs:

https://arecibo-sjc.mozilla.org/
https://arecibo-phx.mozilla.org/

both of which require LDAP auth.

There is a landing page directing users to those, at http://arecibo.mozilla.org/ .

The landing page also describes how to bypass LDAP auth, from within the network via VPN, or if a webapp is talking directly to arecibo:

http://arecibo1.dmz.sjc1.mozilla.com/ 
and
http://arecibo1.dmz.phx1.mozilla.com/

I've done the entire thing with puppet, in a slightly hacky fashion using puppet exec statements to install the required dependencies, but doing it this way should make the entire system auto-update from github. Ping me on IRC if you want the main puppet manifests for inclusion in the github project for future users.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Andy McKay

Comment 14

•

13 years ago

This is awesome, thanks!

Could you check celery on sjc (or show me how to check the logs). I've sent in some sample data, but its not showing up. This would happen if the celery or rabbitmq stack was having an issue.

Justin Dow [:jabba]

Assignee

Comment 15

•

13 years ago

I'm not sure how celery works. There is a rabbitmq server running and its log from today has the following:

=INFO REPORT==== 12-Jul-2011::13:14:53 ===
accepted TCP connection on [::]:5672 from 127.0.0.1:44994

=INFO REPORT==== 12-Jul-2011::13:14:53 ===
starting TCP connection <0.15728.8> from 127.0.0.1:44994

=INFO REPORT==== 12-Jul-2011::13:14:53 ===
closing TCP connection <0.15728.8> from 127.0.0.1:44994

=INFO REPORT==== 12-Jul-2011::13:17:59 ===
accepted TCP connection on [::]:5672 from 127.0.0.1:45214

=INFO REPORT==== 12-Jul-2011::13:17:59 ===
starting TCP connection <0.15996.8> from 127.0.0.1:45214

=INFO REPORT==== 12-Jul-2011::13:17:59 ===
closing TCP connection <0.15996.8> from 127.0.0.1:45214

=INFO REPORT==== 12-Jul-2011::13:18:07 ===
accepted TCP connection on [::]:5672 from 127.0.0.1:45216

=INFO REPORT==== 12-Jul-2011::13:18:07 ===
starting TCP connection <0.16015.8> from 127.0.0.1:45216

=INFO REPORT==== 12-Jul-2011::13:18:07 ===
closing TCP connection <0.16015.8> from 127.0.0.1:45216


What other logs should I be looking for?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Dow [:jabba]

Assignee

Comment 16

•

13 years ago

Jeremy, I think we're missing a celery daemon, and I didn't quite understand how to set that up, I think you have some puppet modules for it? Can you take a look?

Assignee: jdow → jeremy.orem+bugs

Jeremy Orem [:oremj]

Comment 17

•

13 years ago

Celery is set up now.

Assignee: jeremy.orem+bugs → jdow

Andy McKay

Comment 18

•

13 years ago

Working great, thanks

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

Arecibo instance for production

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: erik, Assigned: jabba)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated