Closed Bug 1360873 Opened 4 years ago Closed 4 years ago

Crash in nsMsgDatabase::GetUint32Property

Categories

(Thunderbird :: General, defect)

55 Branch
Unspecified
Windows 10
defect
Not set
critical

Tracking

(thunderbird_esr5254+ fixed, thunderbird53 wontfix, thunderbird54 fixed, thunderbird55 fixed)

VERIFIED FIXED
Thunderbird 55.0
Tracking Status
thunderbird_esr52 54+ fixed
thunderbird53 --- wontfix
thunderbird54 --- fixed
thunderbird55 --- fixed

People

(Reporter: wsmwk, Assigned: mkmelin)

References

Details

(Keywords: crash, regression, topcrash-thunderbird, Whiteboard: [regression:TB55])

Crash Data

Attachments

(2 files)

New #1 crash for nightly builds, avg 15 per day, starting with 20170422030207 bp-b70bed9e-e467-4ba8-97d7-45a310170427. 
most are this user who I am attempting to contact bp-4bbd7cdc-0bc1-4e7f-9de1-d5b160170429

All crash comments:
bp-f4b846fb-2d07-4266-bad2-898e50170424  Was activating an auto compacting fllowing message deletion.
bp-c61732f1-fc68-43b1-93d5-682b70170424  I just renamed a folder. It moved accordingly to its new position, as alphabetically determined. I then noticed another folder I hadn't peered into in a while. The app crashed when I clicked bp-b5444f5c-2d8e-4887-bcc1-fdfff0170423  Now I was reading emails in Local email sent to my POP3 address, pfd@torfree.net
bp-65aeb6ce-2732-460e-81f2-334e30170423  I was clicking on and briefly examing the contents of IMAP gmail folders. 

Anything in https://hg.mozilla.org/comm-central/pushloghtml?startdate=2017-04-20+03%3A05%3A00&enddate=2017-04-22+03%3A05%3A00 ?

All the Windows crashes I examined have ucrtbase.dll on stack.  But there is one linux crash bp-73c9bcb4-7d30-4ae0-b66c-d167e0170428
You can't win. We've just fixed the #1 top crash, IMAP compaction, so here's another one. I suspect this might be caused by bug 853881 where we reworked the hash table in nsMsgDatabase.cpp. That landed on 2017-04-22. It's not in the range above, but if you extend it by a few hours, you'll see it, so:
https://hg.mozilla.org/comm-central/pushloghtml?startdate=2017-04-20+03%3A05%3A00&enddate=2017-04-22+10%3A05%3A00
Kent, any insight here, since we both broke this together with Joshua in absentia :-(
Flags: needinfo?(rkent)
Yeah, looking at build times, it's pretty clear they are starting later than 03:00
Blocks: 853881
Keywords: regression
Whiteboard: [regression:TB55]
Don't know the actual cause, but this should help
Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED
Attachment #8863223 - Flags: review?(jorgk)
How about instead of papering over the problem, we make an attempt to see where it comes from. We let this run in the wild for a while, then back it out and land your patch instead with or without addressing the cause.
Attachment #8863264 - Flags: review?(mkmelin+mozilla)
Landed the diagnostic patch which should be backed out in a few days:
https://hg.mozilla.org/comm-central/rev/6f64b91e9b8ca636e98f9d05998e8d22e420f75b

Wayne, the crash signature will disappear now and be replaced by another one coming from a programmed crash. Should be easy to recognise as coming from nsMsgDatabase.cpp:3799. Can you please keep an eye out for it and let me know.
Flags: needinfo?(vseerror)
There is code in nsMsgDatabase, ForceClosed(), which nulls m_mdbStore. Any time there is code such as that, it is important that we always check for null m_mdbStore before access. At least two of the comments concern situations where I would expect ForceClosed() to be called (compact and rename).

So I do not consider the checks in Magnus' patch to be papering over issues, but necessary. After all any addon can keep the database alive easily enough.

The whole mork/msgDatabase model is really inadequate, so checks such as these are the only approach we can take. If you can identify a particular path that causes the issues, fine fix that, but the checks are needed anyway.
Flags: needinfo?(rkent)
Yes, the intention is to land Magnus' patch. However, this will then return an error to a caller which might not be expecting it, so I'd like to know who that caller is. All the crash reports indicate that this is called more or less directly from JS, so I added some code to dump out the JS call stack. I tested it locally and I did get the stack.
(In reply to Jorg K (GMT+2) from comment #8)
> Y However, this will then return an error to a caller which might not be expecting it

Callers should always be expecting it :)

Anyway, since this is a top crash, I don't know if it's very appropriate to have users crash for one cycle just to gather some info (which we partly know already).
"One cycle"? I'd like to collect the information for a few days, then land your patch. With 15 crashes per day we should be done here mid-week.
I agree experimentation is worthwhile. It's a topcrash but based on crash rate [1] and comments it is not debilitating [2]. 

[1] https://crash-stats.mozilla.com/topcrashers/?product=Thunderbird&version=55.0a1

[2] https://crash-stats.mozilla.com/signature/?signature=nsMsgDatabase%3A%3AGetUint32Property&date=%3E%3D2017-04-24T19%3A23%3A00.000Z&date=%3C2017-05-01T19%3A23%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#comments


(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #0)
> New #1 crash for nightly builds, avg 15 per day, starting with
> 20170422030207 bp-b70bed9e-e467-4ba8-97d7-45a310170427. 
> most are this user who I am attempting to contact
> bp-4bbd7cdc-0bc1-4e7f-9de1-d5b160170429

This same user sandpk crashed with today's build bp-444ddfaa-cd89-4fd9-b9d3-404cc0170501
Flags: needinfo?(vseerror) → needinfo?(jorgk)
Comment on attachment 8863264 [details] [diff] [review]
1360873-diagnostic.patch

Review of attachment 8863264 [details] [diff] [review]:
-----------------------------------------------------------------

Ok!
Attachment #8863264 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 8863223 [details] [diff] [review]
bug1360873_Crash_in_nsMsgDatabase___GetUint32Property.patch

Already landed ;-)
Attachment #8863223 - Flags: review?(jorgk) → review+
https://hg.mozilla.org/comm-central/rev/6f64b91e9b8ca636e98f9d05998e8d22e420f75b (diagnostic patch, comment #6)
https://hg.mozilla.org/comm-central/rev/866a0a77cc0bef57a6cb43ec246d1f5a310cea61 (backout of diagnostic patch)
https://hg.mozilla.org/comm-central/rev/95a8c829e1d69b5c729f6fa6d0318a29693be982 (real fix)

So sad, my diagnostic patch was really nice ;-)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 55.0
Comment on attachment 8863223 [details] [diff] [review]
bug1360873_Crash_in_nsMsgDatabase___GetUint32Property.patch

Simple patch to avoid crashes by dereferencing null pointers. Good to have on other branches, too, although we don't have crash reports there.
Attachment #8863223 - Flags: approval-comm-esr52?
Attachment #8863223 - Flags: approval-comm-beta+
(In reply to Jorg K (GMT+2) from comment #12)
> ...
> 
> All show the very same cause:
> MOZ_CRASH Reason - gloda-id: 0 PendingCommitTracker_commitCallback()
> ["resource:///modules/gloda/index_msg.js":165]
> 
> https://dxr.mozilla.org/comm-central/rev/
> fa94fa8799570d1cad75f2f44a667ff3f514779c/mailnews/db/gloda/modules/index_msg.
> js#165

I looked at a few more crashes submitted today and they all show the same reason.
Enough info to file a new bug?

Also, no crashes reported so far with today's nightly build 20170502, so v.fixed
Status: RESOLVED → VERIFIED
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #18)
> I looked at a few more crashes submitted today and they all show the same
> reason.
> Enough info to file a new bug?
I don't understand.

The diagnostic patch that produced the reason stated above was live for one day in the Daily of May 1st. So the only binary ever to report this has now been replaced with one that doesn't crash any more, as you confirmed.

So which bug would you like to report now? Or give me a crash ID and I'll look into it.
The premise of your diagnostic patch was to learn the "actual cause". And you pointed to the code in comment 12. No?
We found the cause, somewhat. Gloda accesses a stale header, and there's a long comment explaining that:
https://dxr.mozilla.org/comm-central/rev/fa94fa8799570d1cad75f2f44a667ff3f514779c/mailnews/db/gloda/modules/index_msg.js#158
The JS code in question is wrapped in a try/catch, so now we return an error from the C++ code instead of crashing and Gloda deals with it. It will forever remain a mystery how bug 853881 slightly changed code paths so now that potential hibernating crash cause was exposed. There's nothing else we can do, we're done here. For more bad news, re-read the last paragraph of Kent's comment #7.
We can remove tracking esr52?
Flags: needinfo?(jorgk)
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #22)
> We can remove tracking esr52?
Does it bother you? It's not an issue in TB 2, but I'd like to uplift it eventually for good house-keeping.
Flags: needinfo?(jorgk)
Attachment #8863223 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.