We're exploring Andy McKay's http://www.areciboapp.com/ for organizing our tracebacks. You probably already know there's an instance up on khan, but, if it turns out to be a useful thing, we'd eventually like a production-suitable instance as well. No action needed yet; just wanted to give you a heads-up. Cheers!
"our" being SUMO
This can almost certainly run on a VM in PHX, but there's some work to do before the app is easily deployable.
(In reply to comment #2) > This can almost certainly run on a VM in PHX, but there's some work to do > before the app is easily deployable. +1 Also, (I know nothing of this app right now) is this something that can work in multiple locations, ie when we *eventually* multi-home SUMO between data centers simultaneously? Will we be able to run an instance in each data center without causing split-brain in your traceback monitoring?
It does a http post (other options are available) of the data from a queue to the app. So it doesn't really matter which data center it can live in, as long as there's a http route of some kind between them.
Cool. Next question then (and adding to the CC list), how can we scale this out to all of our sites and not just SUMO? A lot of people could benefit from a centralized error reporting interface (devs, qa, admins) but I'd rather it not be done in a one-off basis per site. Let's do this for all sites. And as we look to do this, is Arecibo the best tool for the job? Are there others to investigate?
I don't know of anything else off-the-shelf that does what Arecibo does, and quick Googling doesn't come up with anything--maybe Andy knows of its competitors? Scaling it out to multiple apps should be relatively simple. I've heard of successful large-scale tests on khan (200k errors/day) without issue, and the app has the concept of not just apps but multiple app installs built-in. We could easily push dev, stage, prod, or prod-phx/prod-ams, etc, for multiple apps to the same Arecibo, as far as I can tell.
And Arecibo has clients for Python, Django itself, PHP, JS, and several other platforms and languages, so it could certainly eventually cover "all our sites".
The main competition includes: django-sentry (https://github.com/dcramer/django-sentry) - which is catching up to Arecibo in some areas and ahead in others. It can log more than just errors mind. There are plenty of hosted non open source solutions: http://hoptoadapp.com/pages/home, https://errormator.com/ etc. There's also open source solutions in different stacks: http://code.google.com/p/elmah/. I haven't done an extensive analysis of them all in a while. Regarding performance, its coped with 200k a dayin the past. But haven't tried it on MySQL etc. Something like sending all the 404's from some sites might just be a challenge, but I'm not sure that's something we really want to be doing with this anyway.
Perhaps could add to the generic ganglia/graphite VMs in each data center?
Assignee: server-ops → cshields
(In reply to comment #9) > Perhaps could add to the generic ganglia/graphite VMs in each data center? exactly.
(In reply to comment #9) > Perhaps could add to the generic ganglia/graphite VMs in each data center? At some point we should just name them "reporting" or something, heh ;) But that WFM.
Where are we at with this?
This is now up. There is an instance at each primary data center, at the following URLs: https://arecibo-sjc.mozilla.org/ https://arecibo-phx.mozilla.org/ both of which require LDAP auth. There is a landing page directing users to those, at http://arecibo.mozilla.org/ . The landing page also describes how to bypass LDAP auth, from within the network via VPN, or if a webapp is talking directly to arecibo: http://arecibo1.dmz.sjc1.mozilla.com/ and http://arecibo1.dmz.phx1.mozilla.com/ I've done the entire thing with puppet, in a slightly hacky fashion using puppet exec statements to install the required dependencies, but doing it this way should make the entire system auto-update from github. Ping me on IRC if you want the main puppet manifests for inclusion in the github project for future users.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
This is awesome, thanks! Could you check celery on sjc (or show me how to check the logs). I've sent in some sample data, but its not showing up. This would happen if the celery or rabbitmq stack was having an issue.
I'm not sure how celery works. There is a rabbitmq server running and its log from today has the following: =INFO REPORT==== 12-Jul-2011::13:14:53 === accepted TCP connection on [::]:5672 from 127.0.0.1:44994 =INFO REPORT==== 12-Jul-2011::13:14:53 === starting TCP connection <0.15728.8> from 127.0.0.1:44994 =INFO REPORT==== 12-Jul-2011::13:14:53 === closing TCP connection <0.15728.8> from 127.0.0.1:44994 =INFO REPORT==== 12-Jul-2011::13:17:59 === accepted TCP connection on [::]:5672 from 127.0.0.1:45214 =INFO REPORT==== 12-Jul-2011::13:17:59 === starting TCP connection <0.15996.8> from 127.0.0.1:45214 =INFO REPORT==== 12-Jul-2011::13:17:59 === closing TCP connection <0.15996.8> from 127.0.0.1:45214 =INFO REPORT==== 12-Jul-2011::13:18:07 === accepted TCP connection on [::]:5672 from 127.0.0.1:45216 =INFO REPORT==== 12-Jul-2011::13:18:07 === starting TCP connection <0.16015.8> from 127.0.0.1:45216 =INFO REPORT==== 12-Jul-2011::13:18:07 === closing TCP connection <0.16015.8> from 127.0.0.1:45216 What other logs should I be looking for?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Jeremy, I think we're missing a celery daemon, and I didn't quite understand how to set that up, I think you have some puppet modules for it? Can you take a look?
Assignee: jdow → jeremy.orem+bugs
Celery is set up now.
Assignee: jeremy.orem+bugs → jdow
Working great, thanks
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago → 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.