Closed
Bug 784106
Opened 13 years ago
Closed 13 years ago
Skunkworks server for processing Firefox+Flash hang reports
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: benjamin, Assigned: fox2mike)
References
Details
Attachments
(1 file)
1.63 KB,
text/plain
|
Details |
This is a request for a single server that will end up being temporarily used in a semi-production environment.
For background, see https://wiki.mozilla.org/Socorro/Hang_Processing_Proposal which will allow Socorro to produce useful information from Flash hang reports. But that will require some significant rearrangement of Socorro and won't be done until Q4, and we need to get the useful data sooner. So what I'd like to do is run a one-off server which processes just hang report from Firefox nightly and aurora channel, planning to decommission the server when Socorro grows the necessary functionality.
This server won't need the normal uptime guarantees of socorro, and will be processing a relatively low volume of reports. So I think the server requirements are like this:
* 100G of storage (150k/minidump * 4 minidumps per hang * 1500 reports/day * 90 days of data * wiggle room)
* 2 cores
* public-facing HTTP collector frontend at hang-reports.mozilla.org (or similar DNS). After the experiment we will probably continue the DNS pointing at the existing crash-reports.mozilla.com Socorro collector cluster).
Sofware:
* modern python
* java 1.6 or newer (the same version that is on gsgw1000.metrics.phx1 is fine)
* git
* hg
* svn
I think that with java and python I can install all the other requirements within a user account with CLASSPATH and PYTHONPATH.
cc'ing cshields and joes since I want to make sure I have the opsec/deployment involved from the getgo. The code isn't written yet, but it will be extremely simple so shouldn't be hard to get a sec review once it's running.
Assignee | ||
Updated•13 years ago
|
Assignee: server-ops → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: jdow → shyam
Assignee | ||
Comment 1•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #0)
> cc'ing cshields and joes since I want to make sure I have the
> opsec/deployment involved from the getgo. The code isn't written yet, but it
> will be extremely simple so shouldn't be hard to get a sec review once it's
> running.
Once opsec signs off, I can spin up something for you.
If the 100GB disk is a hard requirement, we might need to find some hardware. if we can live with 60-70GB, we'll just go with a seamicro xeon node.
Assignee: server-ops-devservices → shyam
Reporter | ||
Comment 2•13 years ago
|
||
60-70G should be fine, I can trim the retention window or trim older minidump storage as necessary.
I also forgot one important machine requirement: mounting the netapp breakpad in readonly mode (same as symbolpush.mozilla.org:/mnt/netapp/breakpad)
Comment 3•13 years ago
|
||
It sounds like this is moving along nicely, but if this ever needs prioritization help please holler - solving flash hangs is one of the most important things we'll do this year on desktop.
Reporter | ||
Comment 4•13 years ago
|
||
One thing that came up during discussions: the frontend will need to be HTTPS not HTTP; is that going to require a cert purchase, or will an existing wildcard cert be acceptable?
Assignee | ||
Comment 5•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> One thing that came up during discussions: the frontend will need to be
> HTTPS not HTTP; is that going to require a cert purchase, or will an
> existing wildcard cert be acceptable?
https is fine, I'll get you a cert for this domain (wildcard is way more expensive and needs infrasec ack).
what are hg/git/svn used for?
Reporter | ||
Comment 7•13 years ago
|
||
For installing the various tools I'll need to install the app and its dependencies and process the data (pig, jython, socorro, configman).
Reporter | ||
Comment 8•13 years ago
|
||
https://github.com/bsmedberg/skunky-hangprocessor is the initial-cut code: currently the config is hardcoded so I'm planning on adding configuration file support, otherwise it's pretty straightforward.
is it possible to rpmify all the important applications and libraries which aren't currently available as rpm in our repos? (like jython)
this allow us to track versions and notify when the packages/libs have vulnerabilities
Reporter | ||
Comment 10•13 years ago
|
||
There are two pieces to this: the actual collector, which is dirt-simple, and all the analysis scripts I need to run on the data, which require all the extra stuff.
As you can see, all the collector will use is web.py and configman: I don't know whether those are available as RPMs or what (I usually get them via easy_install), but I'm happy for those to be tracked/etc.
OTOH, all the rest of the data analysis bits require custom or trunk versions of things and aren't going to be exposed to the outside consumer, so I'd like to avoid as much overhead and process as possible for something that's supposed to be a temporary project.
blocking-basecamp: --- → -
:fox2mike can you work on having the rpm packages installed when we have them already, and build ones when possible (ie the for items that don't need customization such as easy_install'd/pip installed stuff) ?
note also that any system that goes in production is "full production" for us, there is no "in the middle" state
other than that, as I'm told this is a priority, you've my ack
Assignee | ||
Comment 12•13 years ago
|
||
:bsmedberg,
The host is online, would you like access to it to set stuff up? Or I can help with that too. Probably needs a netops hole for you to reach it via mpt-vpn. Host is hangprocessor1.webapp.scl3.mozilla.com and has 8 cores, 16GB RAM and a 70GB SSD on it.
Reporter | ||
Comment 13•13 years ago
|
||
Yes I'd love to start setting it up.
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> Yes I'd love to start setting it up.
Sure thing. I've filed a bug for netops and I'll add your account to the machine.
Can you please keep a wiki page/notes on what changes you make to the box, so if this ever has to be "proper" production, I'll have all that's needed to make it happen.
Thanks!
Assignee | ||
Comment 15•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> Yes I'd love to start setting it up.
All yours.
[shyam@hangprocessor1.webapp.scl3 ~]$ id bsmedberg
uid=509(bsmedberg) gid=509(bsmedberg) groups=509(bsmedberg),100(users)
You have sudo as well, which will give you root on the box. I'll help you setup a "public" URL for this etc as well as SSL when you're done with your setup. As I mentioned in the previous comment, please document everything you do.
Thanks!
Assignee | ||
Updated•13 years ago
|
Assignee: shyam → benjamin
Assignee | ||
Comment 16•13 years ago
|
||
bsmedberg, let me know if you can't login to hangprocessor1.webapp.phx1.mozilla.com, that's the new box with the NFS mount on it as well.
Reporter | ||
Comment 17•13 years ago
|
||
ok, this is set up. The collector is running on port 8000. I'll attach the install/setup instructions shortly. Shyam, I believe this is ready for external HTTPS setup.
Currently both the collector and the processor are running as nohup commands under my account. For now I think this is fine, but if there are ways to make this more automatic using services or monit or puppet or something, I'd love to know the easiest way to make my scripts friendly for those types of systems.
If it's not too much trouble, it would be great to have some basic nagios alerts:
* check that the HTTP server is running on port 8000
* check that disk space isn't getting very low
Extra bonus:
* check if any of the symlinks in /home/bsmedberg/hangprocessor/queue/* have a ctime more than 20 minutes ago (checks whether the processor is down or seriously backlogged)
Assignee: benjamin → shyam
Reporter | ||
Comment 18•13 years ago
|
||
Assignee | ||
Comment 19•13 years ago
|
||
What do we want to call this (publicly)? (So I can setup DNS and get the SSL certs)
Reporter | ||
Comment 20•13 years ago
|
||
hang-reports.mozilla.org
Assignee | ||
Comment 21•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #20)
> hang-reports.mozilla.org
I've setup Zeus + SSL and I think we still have some work left since hitting https://hang-reports.mozilla.org now tells me :
This is not the droid you are looking for. Perhaps crash-stats.mozilla.com is what you wanted?HTTP/1.1 408 Request Timeout Content-Length: 0 Content-Type: text/plain
:D
Reporter | ||
Comment 22•13 years ago
|
||
This is in fact almost the correct response. This isn't really a navigable website, it just accepts POSTs to https://hang-reports.mozilla.org/submit . I'll look into making sure that the Content-Length and Content-Type headers are not included there... I'm not sure why that's happening.
Assignee | ||
Comment 23•13 years ago
|
||
Oh alright. I was expecting a front-end, so I was probably in error there.
If things look fine, I'll go ahead and file bugs for monitoring etc and get some documentation filed on our end.
Also, when you say submit, who's submitting here? I hope this isn't going to accept submissions from end users, coz this is really just one machine and will go cuckoo if we send a lot of traffic to it :)
Reporter | ||
Comment 24•13 years ago
|
||
It is getting submissions from end users on nightly and aurora channels only: as noted in comment 0 I expect to get about 1500 submissions per day, which should be below any capacity limits for this machine.
Reporter | ||
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•