Closed Bug 1041263 Opened 10 years ago Closed 10 years ago

processing broken on stage

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Unassigned)

Details

Lars, I think we broke hbase processing while getting Ceph processing working ;) 2014-07-19 20:00:55,918 ERROR - Thread-12 - Error in processing a job Traceback (most recent call last): File "/data/socorro/application/socorro/lib/threaded_task_manager.py", line 336, in run function(*args, **kwargs) # execute the task File "/data/socorro/application/socorro/processor/processor_app.py", line 166, in transform crash_id File "/data/socorro/application/socorro/external/crashstorage_base.py", line 465, in save_raw_and_processed crash_id File "/data/socorro/application/socorro/external/crashstorage_base.py", line 170, in save_raw_and_processed self.save_raw_crash(raw_crash, dumps, crash_id) File "/data/socorro/application/socorro/external/crashstorage_base.py", line 879, in save_raw_crash self.wrapped_crashstore.save_raw_crash(raw_crash, dumps, crash_id) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 253, in save_raw_crash return transaction() File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 104, in <lambda> return lambda *args, **kwargs: self.transaction(lambda conn_ctx: f(conn_ctx.client, *args, **kwargs)) File "/data/socorro/application/socorro/database/transaction_executor.py", line 104, in __call__ result = function(connection, *args, **kwargs) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 104, in <lambda> return lambda *args, **kwargs: self.transaction(lambda conn_ctx: f(conn_ctx.client, *args, **kwargs)) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 166, in transaction for key, dump in dumps.iteritems(): AttributeError: 'NoneType' object has no attribute 'iteritems'
Lonnen, we should not release master to prod until this is either fixed or backed out
Flags: needinfo?(chris.lonnen)
Also, I think it is bad that neither our testing nor nagios noticed this... I first noticed in while testing processing with FS so I am surprised the integration test script didn't fail.
The plan -- back the changes out write a test to check for this in the future reland when the test is passing
Flags: needinfo?(chris.lonnen)
there's nothing to backout here. The problem is in the benchmarking code in a unfortunate interaction with HBase. At the end of processing a crash, we call a method called "save_raw_and_processed" rather than just "save_processed". Most crashstores make do with the base class version of this method which calls the "save_raw" and "save_processed" methods in turn. HBase, though has it's own "save_raw_and_processed" because the raw crash is already in HBase - we don't want to save it a second time (HBase doesn't really delete, we'd end up with wasted space with two copies of the raw crash in HBase). The Benchmarking code subverted HBase's self protection, by overriding HBase's "save_raw_and_processed" with the base class "save_raw_and_processed" and causing a second storage attempt at the raw_crash. Meanwhile, the processor knows too much and is being clever in a bad way. It knows that we're using the "save_raw_and_processed" instead of the simple "save_processed" that is used it the past. It also knows that HBase isn't going to save the raw crash. It knows that PG is only going to save the json part of the raw crash. So the too-clever-for-its-own-good processor doesn't send the dumps into the "save_raw_and_processed" call. With the BenchmarkingCrashStore by-passing HBase's dont-save-the-raw-crash intent, HBase gets a raw crash to store without the dumps. This causes HBase to lose its lunch... the solution: 1) implement "save_raw_and_processed" in the BenchmarkingCrashStore. 2) stop being clever in the processor and pass the dumps along with the raw crash into the "save_raw_and_processed" why wasn't any of this stuff caught? We don't IntegrationTest the benchmarking class. The problem will not appear in any other situation.
Actually, I take back my comment that the processor is being too clever for its own good. The reason the processor doesn't pass the dumps to the save method is that it doesn't have them. It order to pass them to the save routine, it would have to first fetch them from HBase. Since it knows that it won't be saving them back to HBase and PG doesn't want them, it doesn't go through the overhead of fetching them only to just throw them away.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #4) > why wasn't any of this stuff caught? We don't IntegrationTest the > benchmarking class. The problem will not appear in any other situation. Ah! Thanks lars, I didn't think that enabling benchmarking on stage *caused* this, I assumed it was just a regression in the main code (which definitely should have been caught by the integration test). Lowering severity, we should fix but isn't as critical as I had thought when I saw stage failing to process. I do wish our monitoring could catch this, though.
Severity: blocker → normal
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95
You need to log in before you can comment on or make changes to this bug.