Closed Bug 643201 Opened 13 years ago Closed 13 years ago

Some Fennec 4.0b5 crash reports have still not been reprocessed

Categories

(Socorro :: General, task, P1)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jdm, Assigned: rhelmer)

References

Details

Attachments

(2 files)

4.0b5 crashes reports are still filled with numerous nonsensical crashes that require reprocessing.  Take, for example, https://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&date=2011-03-19%2016%3A00%3A00&signature=nsTableFrame%3A%3AInsertCol&version=Fennec%3A4.0b5, which contains crashes going back to March 7 which look like prime fodder for reprocessing.  Is the cron job still working correctly?  The majority of named signatures in the 4.0b5 crashes (ie. not libc.so or libdvm.so crash) are completely bogus, which makes triaging them quite difficult.
(In reply to comment #0)
> 4.0b5 crashes reports are still filled with numerous nonsensical crashes that
> require reprocessing.  Take, for example,
> https://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&date=2011-03-19%2016%3A00%3A00&signature=nsTableFrame%3A%3AInsertCol&version=Fennec%3A4.0b5,
> which contains crashes going back to March 7 which look like prime fodder for
> reprocessing.  Is the cron job still working correctly?  The majority of named
> signatures in the 4.0b5 crashes (ie. not libc.so or libdvm.so crash) are
> completely bogus, which makes triaging them quite difficult.

I just checked and it looks like the cron job is running correctly (according to the logs), I will dig deeper tomorrow.
Assignee: nobody → rhelmer
Priority: -- → P1
Looking at the first one in the list
https://crash-stats.mozilla.com/report/index/d68f1c02-5d79-419a-96d4-179502110319

I see this one has not been fixed. If specifically applying the fix on this crash doesn't work, maybe some of the assumptions of the fix don't work here, in which case it'd be helpful to get the corresponding minidump.
(In reply to comment #2)
> Looking at the first one in the list
> https://crash-stats.mozilla.com/report/index/d68f1c02-5d79-419a-96d4-179502110319
> 
> I see this one has not been fixed. If specifically applying the fix on this
> crash doesn't work, maybe some of the assumptions of the fix don't work here,
> in which case it'd be helpful to get the corresponding minidump.

From the cron logs, I don't think the fix was applied, and from the processor notes I don't think these were submitted for re-processing.

I've been going through the logic for the "fixBrokenDumps" cron job, and don't see yet how we could be missing these:

--
t1 = last_processed_date
"""
SELECT uuid,date_processed FROM reports WHERE product = 'Fennec'
  AND version = '4.0b5'
  AND date_processed > t1
  AND date_processed < (now() - INTERVAL '30 minutes')
"""
for each row:
  fix crash dump
  re-insert into hbase
  mark for re-processing
  update last_processed_date
save last_processed_date
--

Looking at the logs, I don't see d68f1c02-5d79-419a-96d4-179502110319 (each step described above is logged).

I am going to add a bit more debug logging and get that into prod, since I don't have enough info now to reconstruct the query after the fact (specifically the value of last_date_processed).
Status: NEW → ASSIGNED
Added debug statements for the "update last_processed_date" and "save last_processed_date":

Committed revision 3012.

Filed bug 643483 to get that in production.

Continuing to go over the logic here in the meantime.
One thing I have noticed:

(In reply to comment #3)
> """
> SELECT uuid,date_processed FROM reports WHERE product = 'Fennec'
>   AND version = '4.0b5'
>   AND date_processed > t1
>   AND date_processed < (now() - INTERVAL '30 minutes')
> """

should probably be using "ORDER BY date_processed", since it looks like that table isn't perfectly ordered in the natural ordered. However, if anything this would cause needless re-processing, it shouldn't cause any records to be skipped.
Digging into one day (2011-03-19), I can see that we logged fixing 60 crashes, and there are 60 records which contain the string 'replacement' in the processor_notes field. 

However, I see 1266 for that day which *do not* contain that string.

This leads me to suspect that the the query isn't returning what we are expecting (perhaps related to the time at which it's being executed?).
Committed revision 3014.
Here's the problem - each call to the fixBrokenDumps.fix() stores the last_date_processed in the persistent file, and we call this twice (first for Firefox Linux then for Fennec), so Fennec gets a much-too-recent last_date_processed.

Committed revision 3015.

I've tested this (read-only) against production, and it now matches what I expect when running the SQL queries by hand.
Depends on: 643594
Filed bug 643594 to get these corrections into production.

I'll schedule a time to rebuild the top crashers list table and correct the Fennec crashes since the cron job went live (everything before that should be ok, the bug here is only in how we store the date for hourly cron purposes).
No longer depends on: 643594
Incoming crashes are now being fixed correctly (bug 643594, I'll do some further testing verify it tomorrow), and we're working on scheduling a time to reprocess the window where we were missing most Fennec crashes (2011-03-07 through 2011-03-21) in bug 643599.

This will require some downtime to rebuild the top crashers list. We don't want to do that tonight since we're shipping Fx4 tomorrow, but I'll make sure it gets done as soon as is feasible.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Depends on: 643599
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: