Closed Bug 867342 Opened 7 years ago Closed 7 years ago

crash reason missing since 2013-04-09

Categories

(Socorro :: General, task, critical)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

Attachments

(1 file, 2 obsolete files)

Blocks: 867345
scanning through the reports table in production, I see no lack of 'reason'.  There are some records that have no value for 'reason', but those  are all directly associated with failures in MDSW.    Where are you seeing 'reason' missing?
(In reply to K Lars Lohn [:lars] [:klohn] from comment #1)
> scanning through the reports table in production, I see no lack of 'reason'.
> There are some records that have no value for 'reason', but those  are all
> directly associated with failures in MDSW.    Where are you seeing 'reason'
> missing?

I had the same experience looking at the reports table, the reason I suspect that this is what's happening is because I am seeing this from debugging I've added to the correlation script:

Traceback (most recent call last):
  File "./crash-data-tools/per-crash-interesting-modules.py", line 122, in <module> 
    signame = signame + "|" + crash["reason"]
TypeError: coercing to Unicode: need string or buffer, NoneType found
[2013-04-20 19:01:54] per-crash-interesting-modules.py > /tmp/tmp.5BDsxS1n4l/20130420_Firefox_20.0.1-interesting-modules-with-versions.txt
rhelmer debug: crash["reason"] is None

I'll go ahead and get it to print out crash ID as well, it may be that there's only a subset of these. In any case, this started appearing only since 2013-04-09
Here is an example of a crash with an empty reason field:

https://crash-stats.mozilla.com/report/index/67fd2ffd-8d97-415b-a4d2-0a4a32130414
notice that MDSW failed on that crash.  The processor cannot supply a 'reason' if MDSW failed...
also note that this problem apparently begins two days _before_ the new processor went into production.  What changed on the 9th?
I've looked at the same crash signature from the 1st week of April vs the 2nd week of April.  MDSW succeeds in the first week and fails in the second week on the same crash...  

Did we get a new MDSW on the 9th?
failing MDSW: 5daa2a7e-666d-4d4e-97b7-31e032130409
succeeding MDSW: c91b981d-a453-458b-b69f-290962130403
Assignee: lars → rhelmer
Component: Backend → General
Severity: normal → critical
I don't recall what action we decided on here, but I see this is assigned to me so I am going to make the correlation scripts ignore a blank crash reason.
Status: NEW → ASSIGNED
Testing a fix with 20130510 now.
https://crash-analysis.mozilla.com/crash_analysis/20130510/ looks better, but a little odd:

* I notice this at the top of the reports, where OS should go (which is new) in e.g. https://crash-analysis.mozilla.com/crash_analysis/20130510/20130510_Firefox_21.0-interesting-addons-with-versions.txt.gz:

None
  EMPTY: no crashing thread identified; corrupt dump (5796 crashes)

* The .txt version seems to be 0-byte for https://crash-analysis.mozilla.com/crash_analysis/20130510/20130510_Firefox_21.0-interesting-addons-with-versions.txt but the .gz is fine

The second point is really odd, I am not sure why we produce both of these anyway. I think the first point is an artifact of the client-side bug suspected of causing this.

Ted, do you happen to know the bug# of that client-side bug? ^
Flags: needinfo?(ted)
OK I have removed existing 0-byte files and am backfilling for 2013 April and May (backwards from the May 9th).

I will work on getting this patch into the upstream crash data tools and deployed.
(In reply to Robert Helmer [:rhelmer] from comment #11)
> None
>   EMPTY: no crashing thread identified; corrupt dump (5796 crashes)


That's expected, most of those have empty dumps and the information on OS is actually in the dump, so we don't know an OS there right now (Ted has filed a bug against himself to help improve that).
(In reply to Robert Helmer [:rhelmer] from comment #11)
> * I notice this at the top of the reports, where OS should go (which is new)
> in e.g.
> https://crash-analysis.mozilla.com/crash_analysis/20130510/
> 20130510_Firefox_21.0-interesting-addons-with-versions.txt.gz:
> None
>   EMPTY: no crashing thread identified; corrupt dump (5796 crashes)
There were no crash correlations for that signature (with no crash reason) previously.
Backfilled files give a 403 Forbidden error.
(In reply to Scoobidiver from comment #15)
> Backfilled files give a 403 Forbidden error.

I've corrected this for the files done so far.
That's bug 870165.
Flags: needinfo?(ted)
(In reply to Robert Helmer [:rhelmer] from comment #12)
> I will work on getting this patch into the upstream crash data tools and
> deployed.
Do you have a deadline for that because the release of a new version demands up-to-date crash correlations?
(In reply to Scoobidiver from comment #18)
> (In reply to Robert Helmer [:rhelmer] from comment #12)
> > I will work on getting this patch into the upstream crash data tools and
> > deployed.
> Do you have a deadline for that because the release of a new version demands
> up-to-date crash correlations?

The previous backfill completed and looks ok, I'll put up the patch shortly.

Just kicked off 201305{15..11} right now too.
Attached patch ignore empty crash reason (obsolete) — Splinter Review
Bug 870165 is causing crash reason to be empty for some crashes.
Attachment #749942 - Flags: review?(dbaron)
Comment on attachment 749942 [details] [diff] [review]
ignore empty crash reason

Seems reasonable, though please use diff -u.

(Not sure what the added "import sys" is for, though it shouldn't do any harm either.)
Attachment #749942 - Flags: review?(dbaron) → review+
Attached patch ignore empty crash reason v2 (obsolete) — Splinter Review
Bug 870165 is causing crash reason to be empty for some crashes (same as previous patch but remove debugging and make it apply cleanly to hg checkout)
Attachment #749942 - Attachment is obsolete: true
Attachment #749944 - Flags: review?(dbaron)
Attachment #749944 - Attachment is obsolete: true
Attachment #749944 - Flags: review?(dbaron)
(In reply to Robert Helmer [:rhelmer] from comment #23)
> Created attachment 749951 [details] [diff] [review]
> ignore empty crash reason (as landed)

Landed as http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/rev/0d9be01ab7ce (accidentally pushed a commit for bug 788055 originally, so backed it out and pushed this instead)
There's still the forbidden permission for recently backfilled files.

(In reply to Robert Helmer [:rhelmer] from comment #24)
> Landed as
> http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/rev/
> 0d9be01ab7ce (accidentally pushed a commit for bug 788055 originally, so
> backed it out and pushed this instead)
It is in prod or should we wait Socorro 47?
(In reply to Scoobidiver from comment #25)
> There's still the forbidden permission for recently backfilled files.

Fixed this and continued backfilling. I think this is a bug in the cron job, it's depending on a side-effect of another job to set this correctly - I'll file a separate bug for this.

> (In reply to Robert Helmer [:rhelmer] from comment #24)
> > Landed as
> > http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/rev/
> > 0d9be01ab7ce (accidentally pushed a commit for bug 788055 originally, so
> > backed it out and pushed this instead)
> It is in prod or should we wait Socorro 47?

I filed bug 874113 to deploy this, it doesn't need to wait for a Socorro release.
Depends on: 874113
Forbidden permission on May 23 and 25.

Bug 870029 hasn't been taken into account.
(In reply to Scoobidiver from comment #27)
> Forbidden permission on May 23 and 25.

Fixed this, I'll make it part of the backfill script if this needs to be done again.

> Bug 870029 hasn't been taken into account.

I had thought bug 874113 would be closed by now :/ I will poke on that one.
(In reply to Robert Helmer [:rhelmer] from comment #28)
> (In reply to Scoobidiver from comment #27)
> > Bug 870029 hasn't been taken into account.
> I had thought bug 874113 would be closed by now :/ I will poke on that one.
Bug 870029 was dependent on a Socorro release while bug 874113 isn't so it's odd they are related.
(In reply to Scoobidiver from comment #29)
> (In reply to Robert Helmer [:rhelmer] from comment #28)
> > (In reply to Scoobidiver from comment #27)
> > > Bug 870029 hasn't been taken into account.
> > I had thought bug 874113 would be closed by now :/ I will poke on that one.
> Bug 870029 was dependent on a Socorro release while bug 874113 isn't so it's
> odd they are related.

The script I am using to backfill is outside of Socorro, which is why bug 870029 was not picked up automatically. Bug 874113 should fix the real underlying problem here which is why I lamented it :)
I have moved the overrides from bug 870029 into the backfill script (sorry for missing that before!) and have also put in a fix for the "permission denied" problem. Backfill is running for 2013-05-{25..15} now.

BTW, one of the reasons I broke this backfill into a separate script is so that it does not interfere with normal runs - so when bug 874113 I expect this to be fixed but it will not step on the normal nightly run (or vice versa). Missing things like manual overrides is an unfortunate side-effect though.
(In reply to Robert Helmer [:rhelmer] from comment #31)
> I have moved the overrides from bug 870029 into the backfill script (sorry
> for missing that before!) and have also put in a fix for the "permission
> denied" problem. Backfill is running for 2013-05-{25..15} now.

Oops that should read {28..15}
Also for anyone interested, we are tracking the replacement for this version of the correlations system in bug 875990. The current system has many moving parts that aren't really necessary anymore, and (obviously) has not been a well-maintained as we would like.
This should be fixed now (!)

I will continue to monitor, please do let me know if I've missed anything.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.