Closed Bug 1314775 Opened 8 years ago Closed 7 years ago

Process crashes that haven't been processed between Oct 12 and Oct 21

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: marco, Unassigned)

Details

There was a bug on Socorro that prevented some crashes to be processed (bug 1311697).

It isn't clear how many crashes haven't been processed in that timespan.

The missing data could make it more difficult to find regression ranges; could make people spend time findind non-existing regressions (as happened in bug 1279293); could make crash volume comparisons between different Firefox versions more difficult.

The downside of the reprocessing is that it might be an expensive operation.
The challenge with post-processing is that we'd need to download every raw crash and look at its `legacy_processing` to see if it was supposed to have been processed or not. At roughly 1.5M raw crashes per day it's a lot of download so it'd need to be done in a production environment (i.e. in AWS, not in a dev's laptop). Once that is done it wouldn't horribly slow to do the actual processing.
Peter, do you have an eta for this? We are in RC week and we would like to be able to trust the results.
Thanks
Flags: needinfo?(peterbe)
(In reply to Sylvestre Ledru [:sylvestre] from comment #2)
> Peter, do you have an eta for this? We are in RC week and we would like to
> be able to trust the results.
> Thanks

No ETA. Right now we haven't planned the actual work. The fact that it'd take a while is purely gutteral in that we first need to write the code, then review it, the deploy it and then run it.
Flags: needinfo?(peterbe)
I want to clarify some things. This requires engineering work, ops work and it's going to take a long time just to run. We don't have solid estimates on those, but I'm guessing wildly at 2 days to implement and 1 week to run.

Before putting other high priority things down to pick this up, I wanted to gauge tangible impact of doing this work. So far, I've seen "well, it could cause problems in things". That's totally valid, but that's a statement of impact.

Will this hold up a release? Does this prevent crash analysis work that needs to happen now? Will we care about the lack of data here in a month?

I'm wary about getting into a situation where we put down priority work that affects you and has real impact to do a bunch of work that we're unsure about and possibly by the point it's done, no one cares anymore.

Does that make sense? Is it clearer what we're looking for in terms of statement of impact?
Oops... This:

> That's totally valid, but that's a statement of impact.

should have been this:

> That's totally valid, but that's not a statement of impact.
(In reply to Will Kahn-Greene [:willkg] from comment #4)
> I want to clarify some things. This requires engineering work, ops work and
> it's going to take a long time just to run. We don't have solid estimates on
> those, but I'm guessing wildly at 2 days to implement and 1 week to run.

Could we have a clearer idea of how expensive this is? Would the implementation that Peter described in comment 1 take this long to run? Would it take 2 days to implement? I was under the impression that this was easier to implement and shorter to run, but perhaps I misunderstood.

> Before putting other high priority things down to pick this up, I wanted to
> gauge tangible impact of doing this work. So far, I've seen "well, it could
> cause problems in things". That's totally valid, but that's a statement of
> impact.
> 
> Will this hold up a release? Does this prevent crash analysis work that
> needs to happen now? Will we care about the lack of data here in a month?

The missing data:
- could make it more difficult (or impossible) to find regression ranges;
- could make people spend time analyzing non-existing regressions (as happened in bug 1279293);
- could make crash volume comparisons between different Firefox versions more difficult (e.g. in the channel meeting someone raised the issue of whether the improved crash rates we were seeing in 50b vs 49b were real or an artefact).

Since we don't know what is actually missing, it's hard to tell if these 'could make's might become 'make's and how severely.
> Will this hold up a release? Does this prevent crash analysis work that needs to happen now? Will we care about the lack of data here in a month?

Well, the issue is that I don't know what I don't know. In these 1.5M crashes, we might have something which would cause us to do a dot release in 50. 
For now, I don't have a way to know if this is going to have an impact or not.

In a month, we will still care about that because I guess nightly and aurora crashes are in that list of unprocessed crashes.
I can't do a better estimate because I've never built this before. I'm guessing 2 days with the hope I can reuse some code. It's a very off-the-cuff estimate--could take longer.

I get that it's nice to have complete sets of data and that we don't know what we missed, but we can't do everything. In cases where things are hand-wavey and theoretical, I'm inclined to defer them until they become less hand-wavey.

Was there any analysis done on the crashes that we know weren't processed because of the Socorro bug? Do those crashes share properties such that we could extrapolate what the kinds of crashes we're missing look like? That would be a good way to clarify the impact.
(In reply to Will Kahn-Greene [:willkg] from comment #8)
> I get that it's nice to have complete sets of data and that we don't know
> what we missed, but we can't do everything. In cases where things are
> hand-wavey and theoretical, I'm inclined to defer them until they become
> less hand-wavey.

There's already an example where this did bite us (bug 1279293).
Unfortunately, given the nature of the problem, you're asking for something impossible :)

> Was there any analysis done on the crashes that we know weren't processed
> because of the Socorro bug? Do those crashes share properties such that we
> could extrapolate what the kinds of crashes we're missing look like? That
> would be a good way to clarify the impact.

I think Peter did this and concluded that it wasn't possible to explain which failed or why. Peter, correct me if I'm wrong.
(In reply to Marco Castelluccio [:marco] from comment #9)
> (In reply to Will Kahn-Greene [:willkg] from comment #8)
> > I get that it's nice to have complete sets of data and that we don't know
> > what we missed, but we can't do everything. In cases where things are
> > hand-wavey and theoretical, I'm inclined to defer them until they become
> > less hand-wavey.
> 
> There's already an example where this did bite us (bug 1279293).
> Unfortunately, given the nature of the problem, you're asking for something
> impossible :)
> 
> > Was there any analysis done on the crashes that we know weren't processed
> > because of the Socorro bug? Do those crashes share properties such that we
> > could extrapolate what the kinds of crashes we're missing look like? That
> > would be a good way to clarify the impact.
> 
> I think Peter did this and concluded that it wasn't possible to explain
> which failed or why. Peter, correct me if I'm wrong.

Short-term memory is already fading but when we (unfortunately) introduce the bug crashes were still coming through and we didn't notice the dip. 

I don't know what was specific about the crash that didn't work. I know what the bug was but can't explain why it doesn't happen to every bug. Sorry.
It was unfortunate but has since gone way beyond our retention period. We're working hard on avoiding this happening again.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.