1041263 - processing broken on stage

Reporter

Description

•

10 years ago

Lars, I think we broke hbase processing while getting Ceph processing working ;) 2014-07-19 20:00:55,918 ERROR - Thread-12 - Error in processing a job Traceback (most recent call last): File "/data/socorro/application/socorro/lib/threaded_task_manager.py", line 336, in run function(*args, **kwargs) # execute the task File "/data/socorro/application/socorro/processor/processor_app.py", line 166, in transform crash_id File "/data/socorro/application/socorro/external/crashstorage_base.py", line 465, in save_raw_and_processed crash_id File "/data/socorro/application/socorro/external/crashstorage_base.py", line 170, in save_raw_and_processed self.save_raw_crash(raw_crash, dumps, crash_id) File "/data/socorro/application/socorro/external/crashstorage_base.py", line 879, in save_raw_crash self.wrapped_crashstore.save_raw_crash(raw_crash, dumps, crash_id) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 253, in save_raw_crash return transaction() File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 104, in <lambda> return lambda *args, **kwargs: self.transaction(lambda conn_ctx: f(conn_ctx.client, *args, **kwargs)) File "/data/socorro/application/socorro/database/transaction_executor.py", line 104, in __call__ result = function(connection, *args, **kwargs) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 104, in <lambda> return lambda *args, **kwargs: self.transaction(lambda conn_ctx: f(conn_ctx.client, *args, **kwargs)) File "/data/socorro/application/socorro/external/hb/crashstorage.py", line 166, in transaction for key, dump in dumps.iteritems(): AttributeError: 'NoneType' object has no attribute 'iteritems'

Robert Helmer [:rhelmer]

Reporter

Comment 1

•

10 years ago

Lonnen, we should not release master to prod until this is either fixed or backed out

Flags: needinfo?(chris.lonnen)

Robert Helmer [:rhelmer]

Reporter

Comment 2

•

10 years ago

Also, I think it is bad that neither our testing nor nagios noticed this... I first noticed in while testing processing with FS so I am surprised the integration test script didn't fail.

Lonnen :lonnen

Comment 3

•

10 years ago

The plan -- back the changes out write a test to check for this in the future reland when the test is passing

Flags: needinfo?(chris.lonnen)

K Lars Lohn [:lars] [:klohn]

Comment 4

•

10 years ago

there's nothing to backout here. The problem is in the benchmarking code in a unfortunate interaction with HBase. At the end of processing a crash, we call a method called "save_raw_and_processed" rather than just "save_processed". Most crashstores make do with the base class version of this method which calls the "save_raw" and "save_processed" methods in turn. HBase, though has it's own "save_raw_and_processed" because the raw crash is already in HBase - we don't want to save it a second time (HBase doesn't really delete, we'd end up with wasted space with two copies of the raw crash in HBase). The Benchmarking code subverted HBase's self protection, by overriding HBase's "save_raw_and_processed" with the base class "save_raw_and_processed" and causing a second storage attempt at the raw_crash. Meanwhile, the processor knows too much and is being clever in a bad way. It knows that we're using the "save_raw_and_processed" instead of the simple "save_processed" that is used it the past. It also knows that HBase isn't going to save the raw crash. It knows that PG is only going to save the json part of the raw crash. So the too-clever-for-its-own-good processor doesn't send the dumps into the "save_raw_and_processed" call. With the BenchmarkingCrashStore by-passing HBase's dont-save-the-raw-crash intent, HBase gets a raw crash to store without the dumps. This causes HBase to lose its lunch... the solution: 1) implement "save_raw_and_processed" in the BenchmarkingCrashStore. 2) stop being clever in the processor and pass the dumps along with the raw crash into the "save_raw_and_processed" why wasn't any of this stuff caught? We don't IntegrationTest the benchmarking class. The problem will not appear in any other situation.

K Lars Lohn [:lars] [:klohn]

Comment 5

•

10 years ago

Actually, I take back my comment that the processor is being too clever for its own good. The reason the processor doesn't pass the dumps to the save method is that it doesn't have them. It order to pass them to the save routine, it would have to first fetch them from HBase. Since it knows that it won't be saving them back to HBase and PG doesn't want them, it doesn't go through the overhead of fetching them only to just throw them away.

Robert Helmer [:rhelmer]

Reporter

Comment 6

•

10 years ago

(In reply to K Lars Lohn [:lars] [:klohn] from comment #4) > why wasn't any of this stuff caught? We don't IntegrationTest the > benchmarking class. The problem will not appear in any other situation. Ah! Thanks lars, I didn't think that enabling benchmarking on stage *caused* this, I assumed it was just a regression in the main code (which definitely should have been caught by the integration test). Lowering severity, we should fix but isn't as critical as I had thought when I saw stage failing to process. I do wish our monitoring could catch this, though.

Severity: blocker → normal

[github robot]

Comment 7

•

10 years ago

Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/33dd2c55f4428b239722b7078569d312ea8cbb7e Remove extra whitespace in crashstorage tests. Fixes bug 1041263

[github robot]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Lonnen :lonnen

Updated

•

10 years ago

Target Milestone: --- → 95

Bugzilla

processing broken on stage

Categories

(Socorro :: Backend, task)

Tracking

(Not tracked)

People

(Reporter: rhelmer, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated