Closed Bug 1087414 Opened 5 years ago Closed 5 years ago

back-fill unprocessed crashes from disk

Categories

(Socorro :: Backend, task)

XScale
BeOS
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dmaher, Assigned: rhelmer)

References

Details

Attachments

(1 file)

Following the manipulation in bug 1079642, there was a period of a few hours where crashes were received, written to disk, but not processed.  This is manually fix-able (WIP).
I'm currently working on getting these crashes moved out of file system on each collector and then properly moved on to HBase & RabbitMQ.   The crashes became "orphans" in the filesystem when the collectors and crashmovers were configured to use different file system types.  While the crashmovers had the correct config, the crashmovers were not restarted.  They didn't get restarted and therefore pick up the new config for several hours.  The crashmovers continued to work but were unable to actually find the crashes.

the method that I'm using to fix this problem is here: https://etherpad.mozilla.org/I1OM4MiPfg
socorro-collector1.webapp.phx1 re-enabled in Zeus (from "draining").
socorro-collector2.webapp.phx1 set to "Draining" in Zeus.
socorro-collector3.webapp.phx1 was drained and re-added.
socorro-collector4.webapp.phx1 currently in "Draining"
socorro-collector4.webapp.phx1 re-added.
socorro-collector5.webapp.phx1 draining.
socorro-collector6.webapp.phx1 was drained but back in now.
We got everything in before daily matviews are generated, but need to make sure reports_clean is up to date:

SELECT backfill_reports_clean('2014-10-22 00:00:00', '2014-10-22 23:00:00');

I think this will be done before nightly jobs run, if not we'll need to backfill_matviews for 2014-10-22
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
About 1/6th of the crashes were submitted after UTC midnight (the original date_processed is not preserved) and will appear the following day - not sure if this problem is worth worrying about.
we can make a time correction in PG. 

if the crash_id ends in 141022 and the processed_crash date is 141023, then take the processed_crash date back by 8 hours.

that won't have any effects on the ES and HB versions of the crashes, but it will make the corrections for the materialized views.
according to "select count(*) from reports_20141020 where uuid like '%141022' and date_processed > '2014-10-23'" about 19907 crash have the next days 'date_processed'.

also concerning:

reports_20141020 has 367946 entries for 10/22.  That's a normal number. 
reports_clean_20141020 has only 314163 entries for 10/22.  That number is quite low.

I suspect that the reports clean stored procedure has not handled this well.  we'll investigate in the morning.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #9)
> according to "select count(*) from reports_20141020 where uuid like
> '%141022' and date_processed > '2014-10-23'" about 19907 crash have the next
> days 'date_processed'.
> 
> also concerning:
> 
> reports_20141020 has 367946 entries for 10/22.  That's a normal number. 
> reports_clean_20141020 has only 314163 entries for 10/22.  That number is
> quite low.
> 
> I suspect that the reports clean stored procedure has not handled this well.
> we'll investigate in the morning.

The ones that were filtered out should be in reports_bad, probably worth starting there.
(In reply to Robert Helmer [:rhelmer] from comment #7)
> About 1/6th of the crashes were submitted after UTC midnight (the original
> date_processed is not preserved) and will appear the following day - not
> sure if this problem is worth worrying about.

In terms of the matviews and our data, that sounds like it probably will end up as quite an issue. :(
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11)
> (In reply to Robert Helmer [:rhelmer] from comment #7)
> > About 1/6th of the crashes were submitted after UTC midnight (the original
> > date_processed is not preserved) and will appear the following day - not
> > sure if this problem is worth worrying about.
> 
> In terms of the matviews and our data, that sounds like it probably will end
> up as quite an issue. :(

OK we should be able to correct for this per comment 8, in that case.
my concern for 'reports_clean_20141020' having too few crashes for 10/22 has been waylaid.  This morning the number is up to 350128, which is well within the normal range in comparison to other Wednesdays.
(In reply to Robert Helmer [:rhelmer] from comment #12)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11)
> > (In reply to Robert Helmer [:rhelmer] from comment #7)
> > > About 1/6th of the crashes were submitted after UTC midnight (the original
> > > date_processed is not preserved) and will appear the following day - not
> > > sure if this problem is worth worrying about.
> > 
> > In terms of the matviews and our data, that sounds like it probably will end
> > up as quite an issue. :(
> 
> OK we should be able to correct for this per comment 8, in that case.

My guesstimage of 1/6th is wrong - turns out to be 5.6% that are in the wrong day (18801 of 331327)
It's not going to be trivial to move these crashes around. We could do it in postgres only, but then it would not match ES and HBase.

5.6% of these delayed crashes ended up in the 23rd instead of the 22nd. Is this still worth fixing?
Flags: needinfo?(kairo)
Flags: needinfo?(chris.lonnen)
We are in a very critical phase of figuring out stability on two release trains (33 and 34), having two days of skewed data (one with too few and one with too many crashes counted) is extremely suboptimal.
Flags: needinfo?(kairo)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16)
> We are in a very critical phase of figuring out stability on two release
> trains (33 and 34), having two days of skewed data (one with too few and one
> with too many crashes counted) is extremely suboptimal.

OK. I know how to fix this right now for Postgres, but not ES/HBase. I guess we could get this fixed in Postgres and backfilling while we figure those out.
Actually lars has a new processor rule he's been working on which would correct all the data stores:

https://github.com/twobraids/socorro/blob/time-machine/socorro/processor/timemachine.py

This will allow us to change the date_processed for all of these crashes in all data stores, and we can backfill to make the reports consistent.

Not sure how long this will take, several hours at minimum.
I've written a processing rule that will load a processed crash from HB.  It will decrement the 'date_processed' by a configurable time delta.  I've configured the processor to have this as the only rule.  I'm calling this one-off processor, the TimeMachine Processor.  

Then I created a 'new_crash_source' that will iterate over crashes specified by a configurable SQL statement.

When run together with an SQL statement that selects only the crashes with the "date_processed" problem, this processor will push the date_processed back to the correct day, and then save the result back to PG, HB & ES.  

Here is a transcript of a test running on stage:  http://pastebin.com/JJKVS2kp

In this transcript, I show the HB and PG versions of a single crash, run the TimeMachine Processor, then show the change in HB & PG.  The result shows that the TimeMachine Processor successfully pushes the crash's 'date_processed' eight hours into the past, correcting the date.

ES doesn't have a command line interface that I know, so I verified the change in the webapp and found the change to be successful.


Action Plan:  pending approval of interested parties, I will configure this to run in production using a command line like this:

    $ socorro processor --admin.conf=k.ini --new_crash_source.crash_id_query="select uuid from reports_20141020 where uuid like '%141022 and date_processed > '2014-10-23';"

This will take all the crashes with the bad 'date_processed' and push them back into the past to the correct day.  

When it completes, we can run the backfill to correct the 'reports_clean...' tables and then the materialized views.
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/1fdf0e3fd278942062464db7bffec0c6a9ccfc2b
Merge pull request #2453 from twobraids/time-machine

associated with Bug 1087414 - added time machine class
We've tested the "time machine" processing rule on stage and everything looks good. Currently running on production.

All modified crashes have "DateProcessedTimeMachine" in their processor_notes, so we'll know which were modified.

This process will probably take about another hour, and then we'll backfill (which will take at least a few hours). I'd anticipate it being finished this evening PDT.
Attached file timemachine.stdout.txt
log from production "timemachine" run
(In reply to Robert Helmer [:rhelmer] from comment #22)
> Created attachment 8511412 [details]
> timemachine.stdout.txt
> 
> log from production "timemachine" run

lars - there are 10 crashes leftover, presumably they are the ones that failed with error messages in the log.
Flags: needinfo?(chris.lonnen) → needinfo?(lars)
Backfilling 2014-10-22 through 2014-10-23 in Postgres now.
I see the errant 10 crashes, all are degenerate in some manner: processed crash irretrievable from HB.  The raw_crash and dumps look ok, so I queued them for reprocessing. They all fail processing in peculiar ways, taking extraordinarily long times to process in the stackwalker and then seem to hang the processor thread.  Eventually the threads finish, say they're successful, but then don't manage to save a processed crash.

These crashes warrant more study.  Both the HybridProcessor and Processor2015 fail to process these crashes, though Processor2015 doesn't spend as long not saving them as the HybridProcessor does.
Flags: needinfo?(lars)
ES shows 19,900 results for "DateProcessedTimeMachine":

https://crash-stats.mozilla.com/search/?processor_notes=DateProcessedTimeMachine&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform

And it looks like ES, HBase and PG were all reprocessed and date_processed "time travel" looks to have worked based on spot checks so far.
This was completed on 2014-10-25th and looks fine, I was out yesterday which is why I haven't closed this yet.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.