Closed Bug 1087414 Opened 11 years ago Closed 10 years ago

back-fill unprocessed crashes from disk

Categories

(Socorro :: Backend, task)

Product:

Component:

Platform:

XScale

BeOS

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dmaher, Assigned: rhelmer)

References

Details

Attachments

(1 file)

timemachine.stdout.txt 11 years ago Robert Helmer [:rhelmer] 1.74 MB, text/plain		Details

Daniel Maher [:phrawzty]

Reporter

Description

•

11 years ago

Following the manipulation in bug 1079642, there was a period of a few hours where crashes were received, written to disk, but not processed. This is manually fix-able (WIP).

K Lars Lohn [:lars] [:klohn]

Comment 1

•

11 years ago

I'm currently working on getting these crashes moved out of file system on each collector and then properly moved on to HBase & RabbitMQ. The crashes became "orphans" in the filesystem when the collectors and crashmovers were configured to use different file system types. While the crashmovers had the correct config, the crashmovers were not restarted. They didn't get restarted and therefore pick up the new config for several hours. The crashmovers continued to work but were unable to actually find the crashes. the method that I'm using to fix this problem is here: https://etherpad.mozilla.org/I1OM4MiPfg

Adrian J Fernandez [:Aj]

Comment 2

•

11 years ago

socorro-collector1.webapp.phx1 re-enabled in Zeus (from "draining"). socorro-collector2.webapp.phx1 set to "Draining" in Zeus.

Adrian J Fernandez [:Aj]

Comment 3

•

11 years ago

socorro-collector3.webapp.phx1 was drained and re-added. socorro-collector4.webapp.phx1 currently in "Draining"

Adrian J Fernandez [:Aj]

Comment 4

•

11 years ago

socorro-collector4.webapp.phx1 re-added. socorro-collector5.webapp.phx1 draining.

Adrian J Fernandez [:Aj]

Comment 5

•

11 years ago

socorro-collector6.webapp.phx1 was drained but back in now.

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

11 years ago

We got everything in before daily matviews are generated, but need to make sure reports_clean is up to date: SELECT backfill_reports_clean('2014-10-22 00:00:00', '2014-10-22 23:00:00'); I think this will be done before nightly jobs run, if not we'll need to backfill_matviews for 2014-10-22

Assignee: nobody → rhelmer

Status: NEW → ASSIGNED

Robert Helmer [:rhelmer]

Assignee

Comment 7

•

11 years ago

About 1/6th of the crashes were submitted after UTC midnight (the original date_processed is not preserved) and will appear the following day - not sure if this problem is worth worrying about.

K Lars Lohn [:lars] [:klohn]

Comment 8

•

11 years ago

we can make a time correction in PG. if the crash_id ends in 141022 and the processed_crash date is 141023, then take the processed_crash date back by 8 hours. that won't have any effects on the ES and HB versions of the crashes, but it will make the corrections for the materialized views.

K Lars Lohn [:lars] [:klohn]

Comment 9

•

11 years ago

according to "select count(*) from reports_20141020 where uuid like '%141022' and date_processed > '2014-10-23'" about 19907 crash have the next days 'date_processed'. also concerning: reports_20141020 has 367946 entries for 10/22. That's a normal number. reports_clean_20141020 has only 314163 entries for 10/22. That number is quite low. I suspect that the reports clean stored procedure has not handled this well. we'll investigate in the morning.

Robert Helmer [:rhelmer]

Assignee

Comment 10

•

11 years ago

(In reply to K Lars Lohn [:lars] [:klohn] from comment #9) > according to "select count(*) from reports_20141020 where uuid like > '%141022' and date_processed > '2014-10-23'" about 19907 crash have the next > days 'date_processed'. > > also concerning: > > reports_20141020 has 367946 entries for 10/22. That's a normal number. > reports_clean_20141020 has only 314163 entries for 10/22. That number is > quite low. > > I suspect that the reports clean stored procedure has not handled this well. > we'll investigate in the morning. The ones that were filtered out should be in reports_bad, probably worth starting there.

Comment 11

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #7) > About 1/6th of the crashes were submitted after UTC midnight (the original > date_processed is not preserved) and will appear the following day - not > sure if this problem is worth worrying about. In terms of the matviews and our data, that sounds like it probably will end up as quite an issue. :(

Robert Helmer [:rhelmer]

Assignee

Comment 12

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11) > (In reply to Robert Helmer [:rhelmer] from comment #7) > > About 1/6th of the crashes were submitted after UTC midnight (the original > > date_processed is not preserved) and will appear the following day - not > > sure if this problem is worth worrying about. > > In terms of the matviews and our data, that sounds like it probably will end > up as quite an issue. :( OK we should be able to correct for this per comment 8, in that case.

K Lars Lohn [:lars] [:klohn]

Comment 13

•

11 years ago

my concern for 'reports_clean_20141020' having too few crashes for 10/22 has been waylaid. This morning the number is up to 350128, which is well within the normal range in comparison to other Wednesdays.

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #12) > (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11) > > (In reply to Robert Helmer [:rhelmer] from comment #7) > > > About 1/6th of the crashes were submitted after UTC midnight (the original > > > date_processed is not preserved) and will appear the following day - not > > > sure if this problem is worth worrying about. > > > > In terms of the matviews and our data, that sounds like it probably will end > > up as quite an issue. :( > > OK we should be able to correct for this per comment 8, in that case. My guesstimage of 1/6th is wrong - turns out to be 5.6% that are in the wrong day (18801 of 331327)

Robert Helmer [:rhelmer]

Assignee

Comment 15

•

11 years ago

It's not going to be trivial to move these crashes around. We could do it in postgres only, but then it would not match ES and HBase. 5.6% of these delayed crashes ended up in the 23rd instead of the 22nd. Is this still worth fixing?

Flags: needinfo?(kairo)

Flags: needinfo?(chris.lonnen)

Comment 16

•

11 years ago

We are in a very critical phase of figuring out stability on two release trains (33 and 34), having two days of skewed data (one with too few and one with too many crashes counted) is extremely suboptimal.

Flags: needinfo?(kairo)

Robert Helmer [:rhelmer]

Assignee

Comment 17

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16) > We are in a very critical phase of figuring out stability on two release > trains (33 and 34), having two days of skewed data (one with too few and one > with too many crashes counted) is extremely suboptimal. OK. I know how to fix this right now for Postgres, but not ES/HBase. I guess we could get this fixed in Postgres and backfilling while we figure those out.

Robert Helmer [:rhelmer]

Assignee

Comment 18

•

11 years ago

Actually lars has a new processor rule he's been working on which would correct all the data stores: https://github.com/twobraids/socorro/blob/time-machine/socorro/processor/timemachine.py This will allow us to change the date_processed for all of these crashes in all data stores, and we can backfill to make the reports consistent. Not sure how long this will take, several hours at minimum.

K Lars Lohn [:lars] [:klohn]

Comment 19

•

11 years ago

I've written a processing rule that will load a processed crash from HB. It will decrement the 'date_processed' by a configurable time delta. I've configured the processor to have this as the only rule. I'm calling this one-off processor, the TimeMachine Processor. Then I created a 'new_crash_source' that will iterate over crashes specified by a configurable SQL statement. When run together with an SQL statement that selects only the crashes with the "date_processed" problem, this processor will push the date_processed back to the correct day, and then save the result back to PG, HB & ES. Here is a transcript of a test running on stage: http://pastebin.com/JJKVS2kp In this transcript, I show the HB and PG versions of a single crash, run the TimeMachine Processor, then show the change in HB & PG. The result shows that the TimeMachine Processor successfully pushes the crash's 'date_processed' eight hours into the past, correcting the date. ES doesn't have a command line interface that I know, so I verified the change in the webapp and found the change to be successful. Action Plan: pending approval of interested parties, I will configure this to run in production using a command line like this: $ socorro processor --admin.conf=k.ini --new_crash_source.crash_id_query="select uuid from reports_20141020 where uuid like '%141022 and date_processed > '2014-10-23';" This will take all the crashes with the bad 'date_processed' and push them back into the past to the correct day. When it completes, we can run the backfill to correct the 'reports_clean...' tables and then the materialized views.

Comment 20

•

11 years ago

Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/1fdf0e3fd278942062464db7bffec0c6a9ccfc2b Merge pull request #2453 from twobraids/time-machine associated with Bug 1087414 - added time machine class

Robert Helmer [:rhelmer]

Assignee

Comment 21

•

11 years ago

We've tested the "time machine" processing rule on stage and everything looks good. Currently running on production. All modified crashes have "DateProcessedTimeMachine" in their processor_notes, so we'll know which were modified. This process will probably take about another hour, and then we'll backfill (which will take at least a few hours). I'd anticipate it being finished this evening PDT.

Robert Helmer [:rhelmer]

Assignee

Comment 22

•

11 years ago

Attached file timemachine.stdout.txt — Details

log from production "timemachine" run

Robert Helmer [:rhelmer]

Assignee

Comment 23

•

11 years ago

(In reply to Robert Helmer [:rhelmer] from comment #22) > Created attachment 8511412 [details] > timemachine.stdout.txt > > log from production "timemachine" run lars - there are 10 crashes leftover, presumably they are the ones that failed with error messages in the log.

Flags: needinfo?(chris.lonnen) → needinfo?(lars)

Robert Helmer [:rhelmer]

Assignee

Comment 24

•

11 years ago

Backfilling 2014-10-22 through 2014-10-23 in Postgres now.

K Lars Lohn [:lars] [:klohn]

Comment 25

•

11 years ago

I see the errant 10 crashes, all are degenerate in some manner: processed crash irretrievable from HB. The raw_crash and dumps look ok, so I queued them for reprocessing. They all fail processing in peculiar ways, taking extraordinarily long times to process in the stackwalker and then seem to hang the processor thread. Eventually the threads finish, say they're successful, but then don't manage to save a processed crash. These crashes warrant more study. Both the HybridProcessor and Processor2015 fail to process these crashes, though Processor2015 doesn't spend as long not saving them as the HybridProcessor does.

Flags: needinfo?(lars)

Robert Helmer [:rhelmer]

Assignee

Comment 26

•

11 years ago

ES shows 19,900 results for "DateProcessedTimeMachine": https://crash-stats.mozilla.com/search/?processor_notes=DateProcessedTimeMachine&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform And it looks like ES, HBase and PG were all reprocessed and date_processed "time travel" looks to have worked based on spot checks so far.

Robert Helmer [:rhelmer]

Assignee

Comment 27

•

11 years ago

This was completed on 2014-10-25th and looks fine, I was out yesterday which is why I haven't closed this yet.

Robert Helmer [:rhelmer]

Assignee

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.