Race condition between crashmover and processor when raw crashes stored on S3

RESOLVED FIXED

Status

Socorro
Backend
RESOLVED FIXED
8 months ago
8 months ago

People

(Reporter: zolenkoe, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

8 months ago
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36

Steps to reproduce:

We have an internal socorro setup (ver 263.1487279675-1) on AWS, with crashes and symbols stored on S3.

Collector passes crash over to mover and mover copies crashes to S3 and passes it to rabbitmq, processor gets if from there and on it goes.

In some small number of cases, processor gets the job before mover finished saving the data to S3 and rejects the job because raw crash can't be found.

I "fixed" it by adding a sleep if processor fails with CrashIDNotFound and then just trying once again. Not sure if that can be fixed in a better place or somewhere in configuration.

        try:
            raw_crash = self.source.get_raw_crash(crash_id)
            dumps = self.source.get_raw_dumps_as_files(crash_id)
        except CrashIDNotFound:
            time.sleep(2)
            try:
                raw_crash = self.source.get_raw_crash(crash_id)
                dumps = self.source.get_raw_dumps_as_files(crash_id)
            except CrashIDNotFound:
                self.processor.reject_raw_crash(
                    crash_id,
                    'this crash cannot be found in raw crash storage'
                )
                return
        except Exception, x:
        ...


Actual results:

This is how failed case looks (notice mover reporting "Thread-3 - saved" after processor already failed):

Apr 21 17:28:17 ip-10-0-0-85 bash: 2017-04-21 17:28:17,675 INFO - collector - - uWSGIWorker1Core0 - a5119ffb-80fa-46d2-8c8f-c90da2170421 received
Apr 21 17:28:17 ip-10-0-0-85 bash: 2017-04-21 17:28:17,676 INFO - collector - - uWSGIWorker1Core0 - a5119ffb-80fa-46d2-8c8f-c90da2170421 accepted
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,603 DEBUG - crashmover - - QueuingThread - received (('a5119ffb-80fa-46d2-8c8f-c90da2170421',), {})
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,604 DEBUG - crashmover - - Thread-3 - RabbitMQCrashStorage saving crash a5119ffb-80fa-46d2-8c8f-c90da2170421
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,863 DEBUG - processor - - QueuingThread - received (('a5119ffb-80fa-46d2-8c8f-c90da2170421',), {'finished_func': <functools.partial object at 0x7fa1b83fbe10>})
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,875 INFO - processor - - Thread-4 - starting job: a5119ffb-80fa-46d2-8c8f-c90da2170421
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,875 WARNING - processor - - Thread-4 - a5119ffb-80fa-46d2-8c8f-c90da2170421 rejected: this crash cannot be found in raw crash storage
Apr 21 17:28:24 ip-10-0-0-85 bash: 2017-04-21 17:28:24,875 INFO - processor - - Thread-4 - finishing failed job: a5119ffb-80fa-46d2-8c8f-c90da2170421
Apr 21 17:28:26 ip-10-0-0-85 bash: 2017-04-21 17:28:26,933 INFO - crashmover - - Thread-3 - saved - a5119ffb-80fa-46d2-8c8f-c90da2170421
Apr 21 17:28:31 ip-10-0-0-85 bash: 2017-04-21 17:28:31,873 DEBUG - processor - - QueuingThread - RabbitMQCrashStorage acking a5119ffb-80fa-46d2-8c8f-c90da2170421 with delivery_tag 51816


Expected results:

This is how a successful case looks:

Apr 21 17:50:14 ip-10-0-0-85 bash: 2017-04-21 17:50:14,330 INFO - collector - - uWSGIWorker1Core0 - 0ca051cf-a7e3-47fb-8171-19c122170421 received
Apr 21 17:50:14 ip-10-0-0-85 bash: 2017-04-21 17:50:14,331 INFO - collector - - uWSGIWorker1Core0 - 0ca051cf-a7e3-47fb-8171-19c122170421 accepted
Apr 21 17:50:16 ip-10-0-0-85 bash: 2017-04-21 17:50:16,611 DEBUG - crashmover - - QueuingThread - received (('0ca051cf-a7e3-47fb-8171-19c122170421',), {})
Apr 21 17:50:16 ip-10-0-0-85 bash: 2017-04-21 17:50:16,612 DEBUG - crashmover - - Thread-1 - RabbitMQCrashStorage saving crash 0ca051cf-a7e3-47fb-8171-19c122170421
Apr 21 17:50:18 ip-10-0-0-85 bash: 2017-04-21 17:50:18,143 INFO - crashmover - - Thread-1 - saved - 0ca051cf-a7e3-47fb-8171-19c122170421
Apr 21 17:50:19 ip-10-0-0-85 bash: 2017-04-21 17:50:19,955 DEBUG - processor - - QueuingThread - received (('0ca051cf-a7e3-47fb-8171-19c122170421',), {'finished_func': <functools.partial object at 0x7feca4726af8>})
Apr 21 17:50:20 ip-10-0-0-85 bash: 2017-04-21 17:50:20,046 INFO - processor - - Thread-3 - starting job: 0ca051cf-a7e3-47fb-8171-19c122170421
Apr 21 17:50:20 ip-10-0-0-85 bash: 2017-04-21 17:50:20,046 DEBUG - processor - - Thread-3 - BreakpadStackwalkerRule: upload_file_minidump, /home/socorro/temp/0ca051cf-a7e3-47fb-8171-19c122170421.upload_file_minidump.TEMPORARY.dump
Apr 21 17:50:20 ip-10-0-0-85 bash: 2017-04-21 17:50:20,046 DEBUG - processor - - Thread-3 - External Command: timeout -s KILL 600 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/0ca051cf-a7e3-47fb-8171-19c122170421.Thread-3.TEMPORARY.json --symbols-url https://s3-us-west-2.amazonaws.com/socorro2-symbols2/symbols/ --symbols-cache /home/socorro/symbols-cache --symbols-tmp /home/socorro/symb
Apr 21 17:50:21 ip-10-0-0-85 bash: 2017-04-21 17:50:21,842 INFO - processor - - Thread-3 - finishing successful job: 0ca051cf-a7e3-47fb-8171-19c122170421
Apr 21 17:50:22 ip-10-0-0-85 bash: 2017-04-21 17:50:22,219 INFO - processor - - Thread-3 - saved - 0ca051cf-a7e3-47fb-8171-19c122170421
Apr 21 17:50:26 ip-10-0-0-85 bash: 2017-04-21 17:50:26,965 DEBUG - processor - - QueuingThread - RabbitMQCrashStorage acking 0ca051cf-a7e3-47fb-8171-19c122170421 with delivery_tag 59
(Reporter)

Comment 1

8 months ago
(the "fix" is in processor_app.py)
As I recall, the race condition comes from socorro.external.crashstorage_base.PolyCrashStorage saving to multiple destinations, but not doing it as a transaction and allowing any of those destinations to fail. For example, if you have PolyCrashStorage set up to save to S3 then to RabbitMQ and if it goes to save to S3, then fails, then tries RabbitMQ, then succeeds, you end up with a crash_id in the RabbitMQ queue for a crash that hasn't made it to S3, yet.

Thus the processor needs to be ok with crash ids in the queue for which there are no crashes on S3, yet.

For us, I thought the processor will see the crash isn't on S3, then put the crash id back in the queue and try again later--it wouldn't ack the crash id.

However, I don't know the processor code well enough to know if that's true or where that code is.
I'm surprised that the S3 save isn't synchronous. The whole PolyCrashStorage might run in threading, but individual tasks I thought were synchronous. 

There was a period where we had a bug in the configuration, on, I think, the collector. Can't remember exactly which component, but its configuration took a LIST of python classes to invoke. So the component would iterate over the list, turn Python module names into callable objects and call them one after the other. However, the order (in the configuration) was wrong in that it tried put the crash ID into Rabbit, THEN proceed to upload to S3. The S3 upload was normally so fast that even though it was started AFTERWARDS, it finished BEFORE the rabbitmq job worker started to depend on it. Thus we didn't notice it until it was in production.

We solved that by rearranging the order of the list of classes to make sure the S3 upload happened first. Also, a time.sleep solution sounds sane in case there's a slight flaw in S3 saying it has completely saved and it being available to download.
https://github.com/mozilla/socorro/blob/e950f32510f940ae6a203ab90df4e4c57d2f5839/socorro/external/crashstorage_base.py#L579

Sure seems like that iterates over the destinations, calls "save_raw_crash" on each one, and collects failures--but continues to iterate over them. So it does save the crash to all destinations in a serialized fashion, but it doesn't stop iteration if one of them errors out. Pretty sure that runs in a single thread.

Am I reading that PolyCrashStorage code that the crashmover uses right?
(Reporter)

Comment 5

8 months ago
"However, the order (in the configuration) was wrong in that it tried put the crash ID into Rabbit, THEN proceed to upload to S3."

Um, I think that's exactly what I did too. My configuration for collector is rabbitmq then S3...

socorro.external.rabbitmq.crashstorage.RabbitMQCrashStorage, socorro.external.boto.crashstorage.BotoS3CrashStorage

I'll try reverting sleep and swapping the order, looks like there is a high chance of user error :). Thanks!
(Reporter)

Comment 6

8 months ago
Ok, I reverted sleep and fixed crashmover configuration order to s3 then rabbitmq and now I can't reproduce the race. I'll monitor for a while, but I think that fixes it. Thanks!

This is how my config looks now for the record:

source.crashstorage_class=socorro.external.fs.crashstorage.FSTemporaryStorage
destination.crashstorage_class=socorro.external.crashstorage_base.PolyCrashStorage
destination.storage_classes=socorro.external.boto.crashstorage.BotoS3CrashStorage, socorro.external.rabbitmq.crashstorage.RabbitMQCrashStorage
destination.storage0.crashstorage_class=socorro.external.boto.crashstorage.BotoS3CrashStorage
destination.storage1.crashstorage_class=socorro.external.rabbitmq.crashstorage.RabbitMQCrashStorage


BTW, in socorro-infra repository rabbitmq does come before S3. I think that's where I got it from.

https://github.com/mozilla/socorro-infra/blob/master/socorro-config/crashmover.conf
(In reply to zolenkoe from comment #6)
> BTW, in socorro-infra repository rabbitmq does come before S3. I think
> that's where I got it from.
> 
> https://github.com/mozilla/socorro-infra/blob/master/socorro-config/
> crashmover.conf

Yuck! I'll fix this now, then once that's landed, close this out.
Created attachment 8861093 [details] [review]
Link to Github pull-request: https://github.com/mozilla/socorro-infra/pull/305

Comment 9

8 months ago
Commits pushed to master at https://github.com/mozilla/socorro-infra

https://github.com/mozilla/socorro-infra/commit/03b76ec976337f30458d88d064e7d3919b4997ad
Bug 1358577: Reverse order of crashmover crashstorage classes

It's better to have the BotoS3CrashStorage *before* the RabbitMQCrashStorage so
as to reduce the race condition where the processor goes to process a crash that
isn't on S3, yet.

https://github.com/mozilla/socorro-infra/commit/b9178a9d2eee5916577b949827846bfa26e68a2d
Merge pull request #305 from willkg/1358577-example-fix

Bug 1358577: Reverse order of crashmover crashstorage classes
I landed that change. Going to mark this as FIXED now.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.