Closed Bug 538206 Opened 15 years ago Closed 15 years ago

Add code to crash-report web heads to emit raw crash data to Hadoop cluster in addition to NFS

Categories

(Socorro :: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dre, Assigned: lars)

References

Details

Attachments

(4 files, 1 obsolete file)

We can't keep up with processing if we hammer the NFS server by scanning directories. The best way we can solve this is by having the web heads write the crash report to the Hadoop cluster as soon as it is received. For the immediate future, this could be done in addition to the existing storage in the NFS. Eventually, when we are 100% integrated, we could eventually have this be the only storage method. We have a fairly simple Python script that uses the Thrift API to interact with the cluster. I'll attach this as sample code to what would need to be done on your side. We'd like to get this first integration point in as soon as we can. Ideally by the end of January. That way we can continue testing and development without further impact to your existing infrastructure.
Extract and run python crashreports.py --help
Target Milestone: 1.3 → 1.4
Assignee: nobody → lars
questions: 1) are the 'connections' based on HBaseConnection long lived and reusable? Are they mulithread safe? My instinct is to cache one per thread and reuse within that thread rather than pooling them and sharing between threads. Is establishing a connection a fast or slow process? 2) it looks like the 'create_ooid' methods are used to create new entries?
btw, the generated code Hbase.py is an utter abomination and makes the infant forms of all world's great prophets cry.
I asked around a bit and have heard that lots of people will connect process disconnect all in the context of a single web request so I imagine that it should be light weight enough to do that. I haven't heard anything about thread safety so I'll do a bit more research there. Yes, create_ooid and create_ooid_from_files are the two methods we've provided for you to create new entries. Poorly named, I'm sorry. :/ Feel free to rename all the methods in this thing to names that are consistent with your code. Nothing has to be locked down in the API yet.
I've a bifurcated version of the collector ready for testing...
(In reply to comment #5) To code review / pick up the code see: http://code.google.com/p/socorro/source/detail?r=1702
Pushing to 1.5. Waiting on Hadoop VM.
Target Milestone: 1.4 → 1.5
Shoot. I forgot that my last comment regarding this was an e-mail to Lars rather than on this bug. Trying to get a usable VM is taking some time. I was hoping that for the moment, we could test the code by pointing it at the staging Hadoop cluster. I can give connection details immediately if you guys can plug them in.
You can email me the details. We're working on shipping 1.4 Thursday and then I will focus on this. Thanks.
Note: Want a throttle in the collector code for how much gets sent to hadoop
Assignee: lars → ozten.bugs
Blocks: 464775
Blocks: 439679
Update, this is set up on khan and we've submitted a few thousand crashes and verified that they are getting stored. The processor spends on average 40% of its time writing to HBase. Very little time is spent get a connection and connections are reused in mod_python so all the overhead is in the single call to write JSON and dump data. We need retry logic for establishing connections and calls to mutateRow that fail. I think we should stage the current version of the code, but I'm going to explore pulling the hbase writes out of collector as 40% is almost doubling the amount of time it takes to process a request. The attached CSV files show number of seconds it takes to 1) write to disk 2) write to HBase 3) process a request from start to end (total time) requests were against cm-hadoop09 and not thrift-hadoop khan has poor performance and there is probably poor network connectivity between them. We can check stage performance to see how different it is from this dev test.
40% sounds like a lot, but looking at your CSV, the average time to write a report is still only 0.07 seconds, right? That doesn't sound really bad. Also, does the write-to-disk section really take less than one millisecond? I'm surprised that writing to NFS is that fast.
(In reply to comment #13) No NFS writes, it's to khan's disk.
Ah. That sounds like it ought to be re-run in staging then. The staging setup is NFS like production, right?
Staging is finally free. Merged Lar's original code to trunk (r1761). Staging Bug#544912 request made.
Depends on: 544912
@aravind can you confirm comment #15
On staging, it writes to local disk. However, we have a pretty fast NFS filer in prod, so writing to NFS isn't (or shouldn't) be a factor for us.
Patch is transmitter side of a proposed change to trunk. Design Review: http://code.google.com/p/socorro/wiki/SocorroTransmitter For brevity, Changes to Collector not included, but minimal and described in SocorroTransmitter. (of course, Will provide a full patch once we get further along) Unit tests need to be fuller and Integration tests written. Looking for Design and Code review. Thanks. Will point @aravind to this comment for operations/design review also.
Attachment #426404 - Flags: review?(ted.mielczarek)
Attachment #426404 - Flags: review?(tbuckner)
Attachment #426404 - Flags: review?(lars)
Attachment #426404 - Flags: review?(griswolf)
Per the doc linked, we want to decouple the hbase side of things from the collector. Have we verified that hbase is really too slow for this job and we simply cannot rely on it? It seems like it currently runs on a pool of 16 servers, surely, the ought to be able to keep up! Also, the proposed design looks/feels overly complicated. We are moving things from the nfs store onto the webheads now, this will mean that we have to run these sweeper processes on multiple webheads. We will add to the load on the webheads, and add more variables into the mix. Maybe, I don't understand the problem correctly here, so feel free to correct me. Why don't we begin by shoving data into hbase directly, and don't store anything on the NFS store. If we find that hbase is down (or isn't able to keep up), we fall back to writing them to NFS like we currently do. We can change (or add a new script) the monitor to sweep the nfs store like it currently does and push these dump/json pairs to the hbase store instead of keeping track of them in the db. The monitors would still be scanning the fs like it does now - only it now shoves these dump/json pairs into hbase and deletes them from the nfs store. In most cases, we would expect the NFS sweeps (by the monitor) to be empty. It would find stuff there only when hbase isn't able to keep up. I feel like the proposed solution in comment 19 is pretty complicated and would involve some fancy coding and a ton of corner cases that would make it less robust. We already have a web debugged monitor and an engineered filesystem layout, I feel like we should (re)use that stuff.
I definitely think there are good questions brought up by Aravind and we need to have a clear understanding of the requirements and design choices. Some points I'd like to make about the HBase Thrift layer: 1. We have heard some word of potential slowness in submitting, but I don't believe as yet we have a good comparison benchmark that shows whether we are "too slow" and by how much. If we are falling behind, that needs to be taken care of by tuning pre-production first, and once we are satisfied with the per node performance, by scaling the cluster as needed. 2. Right now, using the round-robin A records, we have a weak point in creating connections to HBase. A machine could be malfunctioning, or some of the machines might be under heavier load than the others. Having a real load balancer in place would help there. 3. In a production environment (excepting #2 above), being unable to connect to HBase should be as likely as being unable to connect to the NFS mount or to the PostgreSQL DB. We should think about that while designing failure recovery. 4. The most worrisome aspect of data storage at the moment is that there is currently a small chance of data loss in the following scenario: 1. Collector submits a crash report to the cluster and the transaction is completed 2. Very shortly thereafter (a window of minutes), the node that received that crash report dies before the crash report is replicated out to the cluster 3. The data on the local disks of that node is not recoverable (i.e. disk fried) If the node comes back up with the data on disk intact, there is no problem. In the next release of Hadoop/HBase, they will be supporting the necessary distributed append log that is necessary to close this hole. That won't be for several months though, so we should be aware of this possibility even if it is rare and we should plan accordingly.
Comment on attachment 426404 [details] [diff] [review] First attempt a transmitter to compliment the Collector comment #20, comment #21 apply. testTransmitter.py: self.rolledJournalFilenames ... has your own directory hard coded. You should use testDir = os.path.dirname(__file__) testDir = os.path.abspath(os.path.dirname(__file__)) to avoid that sort of thing nosetests can get shirty about removing logfiles before it is quite done with them. Better not to remove logfiles in a tearDown phase, just in case. Instead remove them during setUp() transmitter.py: line 53: That should not be a warn, but an info log call line 55: Should this sleep time be configurable? line 56: After sleeping a maximum number of seconds or loops, you should log the problem as an error and quit. line 106: This should be a debug logging call line 110: same. We expect both these behaviors from time to time line 207 more or less: Similar to line 56: After you try some number of times, give up. You could do it via raise Exception('too many retries') to fall into line 214. line 214: should be "except Exception,x:" line 215: should mention type(x) (line 216 does the rest) line 235: Who writes these files? Could they send a signal when a file is ready? Sleep wait isn't actually too awful.
Attachment #426404 - Flags: review?(griswolf) → review-
(In reply to comment #20) Thanks @aravind, great feedback and questions. > Have we verified that hbase is really too slow for this job and we simply > cannot rely on it? Sorry, I never stated that hbase was too slow, nor that hbase was unreliable. Two goals here: 1) The production Collector has no network dependencies. We've banished it from talking to Postgres. I want to preserve this property. It allows us to take much of Mozilla's infrastrucutre down and still continue to accept crashes w/o issue. In Comment#21 Daniel makes a good point about the VIP or load balancer fronting hadoop. What if we want to service it? 2) I have done some primitive perf testing. I'm not hugely concerned here as I asked you if doubling collectors webheads would be an issue if we decreased the number of crashes per second that a collector could processes and you had no concerns. So until we have reason for concern, were good to go on performance. > Also, the proposed design looks/feels overly complicated. Let's make it simpler. How? transmitter in a nutshell * daemon which reads text files * transmits crashes from disk into hbase * cleansup text files * can be killed at any time * cleans up after itself or a previous transmitter * can be run in parallel * can be depoyed on any machine with text files and crashes There is very little care and feeding required, I think I put too much detail into the wiki. It will be easy to monitor * how many rolled journals exist on the filesystem * how many transmitter pids are alive * how many CRITICAL log lines in the last 5 minutes > We are moving things from the nfs store onto the webheads now, this will > mean that we have to run these sweeper processes on multiple webheads. > We will add to the load on the webheads, Sorry, I didn't mean to imply that we move .dump and .json off of NFS and onto webheads. Other than logging, I don't think there is more load. This is where I need your help and feedback. A collector logs to a text file (journal). We roll the journal files every 5 minutes from the collector (webhead) and they get processed by transmitters (running on ? box) which need nfs access. The text file contains the UUIDs and the transmitter reuses the dumpStorage logic for finding the crash. We could transmit these rolled journals to anywhere for processing. It decouples the collector webheads. > and add more variables into the mix. A transmitter is a standalone process so we can reason and troubleshoot it more easily than if we layer it onto an existing system like the collector or monitor. I agree adding more variables into the mix isn't a good idea. We want simple processes that do only one thing and which have known requirements and limitations. > Maybe, I don't understand the problem correctly here, so feel free to correct > me. Why don't we begin by shoving data into hbase directly, and don't store > anything on the NFS store. If we find that hbase is down > (or isn't able to keep up), we fall back to writing them to NFS like > we currently do. This is a great point. We can do this. What are the elements of retrying crashes which were written to NFS? We need a list of crashes and a deamon or cron to send them. The list of crashes could be from walking the file system or read from a text file. If you want to parallelize the resubmission into HBase, then you need a coordination mechanism. Keeping HBase submission in the collector, the elements we've described are overloading monitor with this task or creating another sub-system (transmitter). We don't know what the next major phase of the monitor / processor will look like, but let's assume monitor and NFS disappear. Does the transmitter design need to be modified? If we layer these responsibilities onto the monitor, won't we need to extract it or rewrite it? > We can change (or add a new script) the monitor to sweep the nfs store > like it currently does and push these dump/json pairs to the hbase store > instead of keeping track of them in the db. The monitors would still be > scanning the fs like it does now - only it now shoves these dump/json pairs > into hbase and deletes them from the nfs store. This is a good stepwise refinement of our current architecture. I can see this working once we've replaced our current system completely with HBase, but how does this work with our requirement of storing the crashes in NFS and Hbase at the same time? (interim hybrid system) We cannot delete the files to mark them as transferred, how do we capture this information? We could use symlinks, but won't that require more FS code and increase inode counts? We could use an embedded database (SQLite), a simple message queue, or a journal (text file). I think we need a solution that will work now (while raw crashes are in both systems) and bonus points if it doesn't need to be modified (if there is/after the) switchover. > I feel like the proposed solution in comment 19 is pretty complicated > and would involve some fancy coding and a ton of corner cases that > would make it less robust. We already have a web debugged monitor > and an engineered filesystem layout, I feel like we should > (re)use that stuff. I agree and disagree. I agree that production code is better than new code. I disagree in that the monitor, processor, and collector are well designed and highly focused sub-systems and adding new responsibilities to them will increase coupling and reduce cohesion.
(In reply to comment #22) @griswolf thanks. I'll update the logging and unit tests. > line 55: Should this sleep time be configurable? Ya, I'll change that. > line 56: After sleeping a maximum number of seconds or loops, > you should log the problem as an error and quit. and > line 207 more or less: Similar to line 56: After you try > some number of times, give up. You could do it via raise > Exception('too many retries') to fall into line 214. I struggled with this. If we do exit, how does the transmitter get restarted? The current design is a daemon so it has an infinite loop. line 207 the intent is that we are here only because of networking errors. Any other type of error will cause the entry to be marked as FAIL and continue and not have a retry/loop. If we give up after N tries and return to 214, (with the current design) we will lose that entry and assuming the network issue is still present go through the same N tries and lose the next entry, etc. Is there something better to do than loop? A benefit of the CRITICAL entry is we can use "checklog.pl" or some other Nagios alarm on it as an indicator that the network between transmitter and HBase is down. > line 214: should be "except Exception,x:" > line 215: should mention type(x) (line 216 does the rest) Good call, I didn't know this pattern. Cool. > line 235: Who writes these files? Could they send a signal when a file is ready? Sleep wait isn't actually too awful. So the modpython-collector is writing the files. It probably lives on some other box. If we have to support real-time transmission of crashes, then we'll need something like this, but I think sleep is okay for now. I'll make these changes, thanks again.
(In reply to comment #24) > (In reply to comment #22) > > line 207 more or less: Similar to line 56: After you try > > some number of times, give up. You could do it via raise > > Exception('too many retries') to fall into line 214. > > I struggled with this. If we do exit, how does the transmitter get restarted? > The current design is a daemon so it has an infinite loop. > > line 207 the intent is that we are here only because of networking errors. Any > other type of error will cause the entry to be marked as FAIL and continue and > not have a retry/loop. > > If we give up after N tries and return to 214, (with the current design) we > will lose that entry and assuming the network issue is still present go through > the same N tries and lose the next entry, etc. Is there something better to do > than loop? A benefit of the CRITICAL entry is we can use "checklog.pl" or some > other Nagios alarm on it as an indicator that the network between transmitter > and HBase is down. > Logging: You really don't want to log every success or step except debug: You can fill a 1000000 character logfile in a few seconds (experience speaking), which means interesting stuff gets rolled out of sight in a few minutes. Daemon process versus failures: At some point, if things are all poo, you want your daemon to give up. Non-running daemon is as easy or easier to find with nagios as a log entry; and a running one can lose data or otherwise muddy the water. From other comments, I thought this was going to be a short-run process (every 5 minutes?)... letting cron be the daemon, in essence, in which case dying from troubles is not a big deal: Wait for the next time around and try again. I'll look at the program flow again and see if I can find a nice(er) path through error conditions.
I've found a bug and am not able to use TimedRotatingFileHandler for rolling over the Journal files. @aravind, oremj - What would be the best way to handle these journals? http://code.google.com/p/socorro/wiki/SocorroTransmitter#Collector_Details Question described in Journal Rollover Details Section
One option is ordinary RotatingFileHandler which rotates when it gets too big. In general when it overfills, it works backwards renaming basename.N to basename.N+1 (and basename to basename.1), then begins filling the new basename. You can set a limit on the number of such files (50 seems common in our code). This technique is used by modpython-collector
We still need input from Aravind on comment #26 but I've made the log handling a simple FileHandler that could be rolled via logroller or another system tool outside of python. This patch makes changes from patch 1 code review, as well as adding in the minor changes to collector, unit tests, and an integration test.
Attachment #426404 - Attachment is obsolete: true
Attachment #427139 - Flags: review?(ted.mielczarek)
Attachment #427139 - Flags: review?(tbuckner)
Attachment #427139 - Flags: review?(lars)
Attachment #427139 - Flags: review?(griswolf)
Attachment #426404 - Flags: review?(ted.mielczarek)
Attachment #426404 - Flags: review?(tbuckner)
Attachment #426404 - Flags: review?(lars)
(In reply to comment #28) > We still need input from Aravind on comment #26 I don't understand this design well enough to comment on whats needed.
(In reply to comment #29) I'm happy to add more detail to http://code.google.com/p/socorro/wiki/SocorroTransmitter#Collector_Details What's missing?
(In reply to comment #28) The patch doesn't apply cleanly. I'm not sure how to generate a clean patch, so here are the steps for applying the clean patch. When prompted, answer 'socorro/transmitter/hbaseClient.py' Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |Index: socorro/transmitter/__init__.py |=================================================================== |Index: socorro/transmitter/hbaseClient.py |=================================================================== |--- socorro/transmitter/hbaseClient.py (revision 0) |+++ socorro/transmitter/hbaseClient.py (working copy) -------------------------- File to patch: socorro/transmitter/hbaseClient.py socorro/transmitter/hbaseClient.py: No such file or directory Now you will need 'socorro/transmitter/hbaseClient.py'. You can copy from khan: /home/aking/breakpad/socorro-hdfs/socorro/transmitter/hbaseClient.py or I'll attach to this bug, also.
Attaching the original socorro/transmitter/hbaseClient.py
(In reply to comment #30) > I'm happy to add more detail to > http://code.google.com/p/socorro/wiki/SocorroTransmitter#Collector_Details > > What's missing? I guess I specifically don't understand some of the details. If I understand the proposed design correctly, you are having the webheads log stuff to some journal files? and then a sweeper program runs on each webhead and submits data to hbase. Is that correct? Before I can comment on #26 - do all the collectors on a single box write to the same log file? If so, I am assuming you are somehow accounting for multiple collectors wanting to log to the same file? You could shove this off to syslog - but syslog does not make guarantees of the order if which stuff comes in - and sometimes if can drop data. It does this to ensure that it does not become the bottleneck. We frequently run it in this mode on the webheads. It can also run in a mode where it can be asked to run more correctly, but that would need a dedicated remote syslog host, etc. What we pick depends on how critical these logs are? Also, it appears that the primary motivation for this design is that the previous collector didn't depend on external factors, and so this one shouldn't either. The previous design worked the way it did for a few reasons, we chose to keep it independent because we wanted the flexibility to re-architect the backend without impacting the collector. With HBase, we don't have that many options to change in the backend. I still say we should revisit the design and have it shove data directly into hbase and fall back to nfs if things don't work there. Also, you asked about how we would transition this into the new system, we could have the collector send some X% of crashes into hbase and ramp it up from there.. or some other combination of things like that.
(In reply to comment #33) Cool, thanks Aravind. > I guess I specifically don't understand some of the details. > If I understand > the proposed design correctly, you are having the webheads log > stuff to some journal files? and then a sweeper program > runs on each webhead and submits data to hbase. Is that correct? Yes, webheads log to journal files. No sweeper runs wherever you want to deploy it*. Yes sweeper submits data to hbase. Occasionally journal logs are rolled and given a date extension. These rolled logs are the inputs into the transmitter. * requirements for the transmitter deployment environment: - rolled journal logs (read/write access to local disk) - read access to NFS (dump, json) - network connection to hbase Having 1 log per webhead and rolling every 5 minutes might be a good way to breakup the files, so that multiple transmitters could be run in parallel to submit the crashes. > Before I can comment on #26 - do all the collectors on a single box > write to the same log file? If so, I am assuming you are somehow > accounting for multiple collectors wanting to log to the same file? This is configurable and up to you. On a single webhead, I think FileHandler is up to the task of having N mod_python threads write to the same log. N is the number of worker threads you have configured for mod_python. > You could shove this off to syslog - but syslog does not make guarantees > of the order if which stuff comes in - and sometimes if can drop data. > It does this to ensure that it does not become the bottleneck. We > frequently run it in this mode on the webheads. It can also run > in a mode where it can be asked to run more correctly, but that would > need a dedicated remote syslog host, etc. What we pick depends on > how critical these logs are? Maintaining order in the journal isn’t important. Dropping data is unacceptable (in the long term system, short term it might be acceptable). Yes, the journal log is critical; it is the main way to track a UUID which hasn’t been submitted successfully to HBase/Hadoop.
A couple questions (because they aren't immediately obvious to me) - I'm sensitive to any performance impact on the pm-app-generic web cluster. I like the idea of removing external dependencies, including NFS. Do we have a handle on performance metrics, specifically disk i/o? My current understanding is that that's hard to test in stage because the work load isn't as high as prod. Should we instead build out a separate collector "cluster"? The current platform is shared with a number of other production websites. I'm okay if this process affects itself, less so if it affects other sites. Even if this is short-term, perhaps a dedicated cluster is the right move?
I've said so outside this bug but will say it here to make it official. A lot of the contention is that the crash-report system shares infrastructure with the generic web cluster and there are competing demands on it. For the next release we should build out a 3-node cluster to handle this system. The Hbase fallback should be to local disk. This also lets us build out another cluster in Phoenix that has no NFS dependencies. Yell if there's any opposition, otherwise add the code and make the path to write to configurable and we'll figure out how & where to write it.
Assignee: ozten.bugs → lars
Here's my take on design of this new system. For now, we need a collector that is two pronged: writes to the standard NFS just as we've always done; writes to HBase. I've refactored the collector two segregate these two abilities so that in the future, we can just eliminate the NFS code. As we move to HBase, we should add the following enhancements to collector: 1) the HBaseConnection class should be robust and predictable: Bug 548355 2) the collector employing the HBaseConnection class should react to known failures and switch to saving crashes on the local disk. This local storage should use the same JsonDumpStorage system that we employ for NFS since it is known to be robust, edge cases in sharing were addressed long ago and it already contains methods that allow discovery of new entries. Bug 548359 3) a new process should be written that runs independently on each box that runs a collector. This process (probably cron) will watch the local JsonDumpStorage and attempt to submit any entries that it finds to HBase. Failure just means quitting and trying again later. Bug 548362 This system has several advantages over the previously proposed 'transmitter' system. It completely eliminates the dependency on NFS. By employing JsonDumpStorage on local storage, we eliminate the need for the need for the transmission of streams of uuids via a logging mechanism.
Target Milestone: 1.5 → 1.6
Why was this bumped to 1.6? I think this should be marked as fixed 1.5
I bumped all remaining open bugs in 1.5 to 1.6 after the Thursday release. If it's fixed, just change it to 1.5 and close it out.
Comment on attachment 427139 [details] [diff] [review] Second attempt at transmitter I think you've got enough other expertise here requested, re-request if you really need me.
Attachment #427139 - Flags: review?(ted.mielczarek)
Attachment #427139 - Flags: review?(tbuckner)
Attachment #427139 - Flags: review?(lars)
Attachment #427139 - Flags: review?(griswolf)
Attachment #427139 - Flags: review?
this is done, though some points in the discussion will come up again (Comment #36 and Comment #37)
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: