1368786 - Crash in morkRow::GetCell via XPTC__InvokebyIndex

Reporter

Description

•

7 years ago

#1 crash for nightly 55.0a1. First seen for build 20170422030207 bp-5975a332-7eeb-4708-8ea5-4e72b0170422.
0 	xul.dll	morkRow::GetCell(morkEnv*, unsigned int, int*)	C:/builds/moz2_slave/tb-c-cen-w64-ntly-000000000000/build/db/mork/src/morkRow.cpp:516
1 	xul.dll	XPTC__InvokebyIndex	xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_64.asm:58
2 		@0x20bf405d7ff	
3 	xul.dll	XPTC__InvokebyIndex	xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_64.asm:97
4 		@0xe3d7ff7187	
5 	xul.dll	XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)	js/xpconnect/src/XPCWrappedNative.cpp:1296
6 	xul.dll	XPC_WN_GetterSetter(JSContext*, unsigned int, JS::Value*)	js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1019
7 	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:470
8 	xul.dll	js::CallGetter(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>)	js/src/vm/Interpreter.cpp:649
9 	xul.dll	CallGetter	js/src/vm/NativeObject.cpp:1788
10 	xul.dll	js::NativeGetProperty(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<JS::Value>, JS::Handle<jsid>, JS::MutableHandle<JS::Value>)	js/src/vm/NativeObject.cpp:2101
11 	xul.dll	js::GetProperty(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, js::PropertyName*, JS::MutableHandle<JS::Value>)	js/src/jsobj.h:845
12 	xul.dll	js::GetProperty(JSContext*, JS::Handle<JS::Value>, JS::Handle<js::PropertyName*>, JS::MutableHandle<JS::Value>)	js/src/vm/Interpreter.cpp:4399
13 	xul.dll	GetPropertyOperation	js/src/vm/Interpreter.cpp:193
14 	xul.dll	Interpret	js/src/vm/Interpreter.cpp:2740
15 	xul.dll	js::RunScript(JSContext*, js::RunState&)	js/src/vm/Interpreter.cpp:410
16 	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:488
17 	xul.dll	js::jit::DoCallFallback	js/src/jit/BaselineIC.cpp:2452
18 		@0x1349838c169	

Does not appear to be add-on related. 

Some of the users see other crash signatures. like  morkRow::GetRowSpaceStore bp-5d1ea52f-52c2-47e4-9652-9048e0170516 - but that is not a new signature

Wayne Mery (:wsmwk)

Reporter

Comment 1

•

7 years ago

> #1 crash for nightly 55.0a1.

#1 crash for 55.0b2. And still happening in recent nightlies, like bp-c45fd99f-0883-41bd-b8dc-9a0430170718

Most user are reporting delete related problems. Reporter of bp-277c6aa4-a7d0-488c-bc9a-8a0b40170720 tells me repairing trash folder solved the problem

> First seen for build 20170422030207...

At rate of several crashes per day [1], so I'd say the regression checkin occurred in the prior 1-2 days https://hg.mozilla.org/comm-central/pushloghtml?startdate=2017-04-18+03%3A05%3A00&enddate=2017-04-22+03%3A05%3A00


[1] https://crash-stats.mozilla.com/signature/?release_channel=nightly&signature=morkRow%3A%3AGetCell&date=%3E%3D2017-01-21T09%3A23%3A04.000Z&date=%3C2017-04-25T09%3A23%3A00.000Z&_columns=date&_columns=build_id&_columns=user_comments&_sort=-date&page=1

status-thunderbird55: --- → affected

OS: Windows 10 → All

Wayne Mery (:wsmwk)

Reporter

Comment 2

•

7 years ago

bug 1358341 ?

Jorg K (CEST = GMT+2)

Assignee

Comment 3

•

7 years ago

No way.

I don't see anything in the range you gave in comment #1. I looked at bug 1358444, in particular
https://hg.mozilla.org/comm-central/rev/2bf4d1d2bf58#l1.12
but that looks right, too. Kent, can you confirm, please.

https://crash-stats.mozilla.com/report/index/277c6aa4-a7d0-488c-bc9a-8a0b40170720 has a little more call stack.

It doesn't need to be a C-C change, something can have changed slightly in M-C exposing a memory mishandling which now results in a crash.

Flags: needinfo?(rkent)

Jorg K (CEST = GMT+2)

Assignee

Comment 4

•

7 years ago

On average, less than one crash per day:
https://crash-stats.mozilla.com/signature/?release_channel=nightly&signature=morkRow%3A%3AGetCell&date=%3E%3D2017-07-07T11%3A38%3A24.000Z&date=%3C2017-07-21T11%3A38%3A24.000Z&_columns=date&_columns=build_id&_columns=user_comments&_sort=-date&page=1

Some interesting and varied call stacks in there.

Kent James (:rkent)

Comment 5

•

7 years ago

> I looked at bug 1358444, in particular https://hg.mozilla.org/comm-central/rev/2bf4d1d2bf58#l1.12
but that looks right, too. Kent, can you confirm, please.

It looks OK to me too.

I looked over this bug, and could not get any good ideas on the cause. I'm not sure I really understand the crash itself. Like, how could we crash on line #510:

morkCell* cells = mRow_Cells;

Flags: needinfo?(rkent)

Jorg K (CEST = GMT+2)

Assignee

Comment 6

•

7 years ago

(In reply to Kent James (:rkent) from comment #5)
> Like, how could we crash on line #510:
> morkCell* cells = mRow_Cells;
Agreed, however, this one https://crash-stats.mozilla.com/report/index/e11b8c61-cc19-4f54-b858-642550170709 claims to crash on line #516:
mork_column col = cells->GetColumn();
where cells was assigned on line #510:
morkCell* cells = mRow_Cells;

Wayne Mery (:wsmwk)

Reporter

Comment 7

•

7 years ago

FWIW, the user of these crashes also crashed with bug 1377277 signature, which is also a topcrash for 55.0b2

bp-1b07b684-18d3-4af1-b3b3-cc33b0170726 
bp-43e0cb0b-0518-40da-abf4-511a00170725

Flags: needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 8

•

7 years ago

(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #7)
> FWIW, the user of these crashes also crashed with bug 1377277 signature,
> which is also a topcrash for 55.0b2
> 
> bp-1b07b684-18d3-4af1-b3b3-cc33b0170726 

comment #6.  But I don't know why cells is invalid address.

> bp-43e0cb0b-0518-40da-abf4-511a00170725

simply shutdown hang like bug 1377277.  But 2 sqlite3 thread tries accessing database without deadlock, so this might be slow on user PC/disk.

Flags: needinfo?(m_kato)

Wayne Mery (:wsmwk)

Reporter

Comment 9

•

7 years ago

Still #1 crash for 55.0b2 and exists for 56.0a1, so expecting to see this in 56.0b1.

This signature might be coming in part from users who had different crash signatures prior to 55.0b2. For example bug 1255903

status-thunderbird56: --- → affected

tracking-thunderbird56: --- → ?

Wayne Mery (:wsmwk)

Reporter

Comment 10

•

7 years ago

#3 crash for 56.0b2 (but only because it's behind *new* topcrashes).

Might this be relalated to gcc6?

Edmund Wong (:ewong)

Comment 11

•

7 years ago

it's been a while since I've last looked at this part of the code (if ever), and I have
zero idea what's going on.  that said, I was wondering if someone can clarify:  

  MORK_USED_1(ev);
  morkCell* cells = mRow_Cells;
  if ( cells )
  {
    morkCell* end = cells + mRow_Length;
    while ( cells < end )
    {
      mork_column col = cells->GetColumn();


As I understand this (which is extremely limited), class->GetColumn() calls :
https://hg.mozilla.org/comm-central/file/tip/db/mork/src/morkCell.h#l48
     mork_column  GetColumn() const { return morkDelta_Column(mCell_Delta); }

Which morkDelta_Column is defined in line 20:

  #define morkDelta_Column(self) ((self) >> morkDelta_kShift)

So just wondering if:

1) mRow_Cells is defined.  If so, what is it defined to?

I'm looking at the crash reason "EXCEPTION_ACCESS_VIOLATION_READ" and I'm wondering
if cells is defined.

Looking at https://dxr.mozilla.org/comm-central/source/db/mork/src/morkRow.cpp#187,

mRow_Cells = 0 (at line 199) only if (ioSpace && iOPool && inOid && inLength <= morkRow_kMaxLength &&
inOid->mOid_Id != morkRow_kMinusOneRid ) are all true.    Assuming they are all true, and 
in line 212,

  if (inLength)  is true

(from https://dxr.mozilla.org/comm-central/source/db/mork/src/morkRow.cpp#213)

does ioPool->NewCells(ev, inLength, zone);  return something valid?


Again, I have no clue, but just asking since the easiest for me is to figure out
where mRow_Cells is defined and work from there.  I apologize if it's irrelevant
to the issue.

Kent James (:rkent)

Comment 12

•

7 years ago

Edmund, I wish I could answer your questions, but I cannot. The mork code was written in the earliest days of Mozilla, and adapted in a non-standard way to XPCOM. None of us really understand it, or want to understand it. Replacing Mork has been a high priority for a very long time with no traction.

Edmund Wong (:ewong)

Comment 13

•

7 years ago

Has anyone else encountered this crash?
(In reply to Jorg K (GMT+2) from comment #6)
> (In reply to Kent James (:rkent) from comment #5)
> > Like, how could we crash on line #510:
> > morkCell* cells = mRow_Cells;
> Agreed, however, this one
> https://crash-stats.mozilla.com/report/index/e11b8c61-cc19-4f54-b858-
> 642550170709 claims to crash on line #516:
> mork_column col = cells->GetColumn();
> where cells was assigned on line #510:
> morkCell* cells = mRow_Cells;

So if cells == 0,  would it crash?

Jorg K (CEST = GMT+2)

Assignee

Comment 14

•

7 years ago

To answer the question, let's look at the code again:

morkCell*
morkRow::GetCell(morkEnv* ev, mdb_column inColumn, mork_pos* outPos) const
{
  MORK_USED_1(ev);
  morkCell* cells = mRow_Cells;
  if ( cells )
  {
    morkCell* end = cells + mRow_Length;
    while ( cells < end )
    {
516:  mork_column col = cells->GetColumn();

If cells==0 then line 516 is not executed. It will most likely crash is cells contains an invalid value it obtained from the assignment a few lines above.

Wayne Mery (:wsmwk)

Reporter

Comment 15

•

7 years ago

what's the conclusion then - msf corrupted, by something else much earlier, and we should not avoid the crash but rather find the regression range causing the corruption?

Jorg K (CEST = GMT+2)

Assignee

Comment 17

•

7 years ago

Triggered by the new bug 1398402 I've looked at this again. That bug has some very consistent crash reports.

If you look at them you see what happens: Some JS part of Thunderbird uses messageHeader.messageId and that invokes nsMsgHdr::GetMessageId() at mailnews/db/msgdb/src/nsMsgHdr.cpp:580. There is plenty of JS that does this:
https://dxr.mozilla.org/comm-central/search?q=.messageId&redirect=false

The regressing bug is bug 853881 which landed on 22nd April 2017 in two changesets, bug 853881 comment #34 and bug 853881 comment #42.

In bug 853881 Joshua replaced the "old technology" PLDHashTable with a more modern one. The bug was abandoned, Aceman and later Kent had the initiative to get it landed. So I spent a considerable amount of time on it. In the end, there was some doubt left re. reference counting, see for example bug 853881 comment #34.

So I believe what happens is that the new caching mechanism somehow hangs on to a stale row. When we come to access that header again, the new hash table hands back the stale row and when accessing it ... Crash!

Jorg K (CEST = GMT+2)

Assignee

Comment 18

•

7 years ago

I have the impression that what I described above is related to the code that got removed here:
https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371#l2.259

Jorg K (CEST = GMT+2)

Assignee

Comment 19

•

7 years ago

Attached patch 1368786-mork-crash.patch (obsolete) — Details — Splinter Review

Joshua, looks like your hash table replacement has done some damage, in fact, caused the #1 top crash :-(

What do you think of this proposal?

Attachment #8906332 - Flags: feedback?(Pidgeot18)

Jorg K (CEST = GMT+2)

Assignee

Comment 20

•

7 years ago

Comment on attachment 8906332 [details] [diff] [review]
1368786-mork-crash.patch

Maybe Kent passes by before Joshua comments. I have a keen user in bug 1398402 eager to try this out, but sadly C-C try builds don't work (bug 1398554), so I can't get a binary to him.

Attachment #8906332 - Flags: review?(rkent)

Jorg K (CEST = GMT+2)

Assignee

Comment 21

•

7 years ago

Kent, user "doug2" over in bug 1398402 has been running a try build with this patch for a few days and his crashes have gone away.

Repeating: The patch restores part of the code that was removed here:
https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371#l2.259
to avoid stale handles to Mork rows, if we can believe the comment.

Wayne Mery (:wsmwk)

Reporter

Updated

•

7 years ago

Blocks: 853881

Keywords: regressionwindow-wanted

doug2

Comment 22

•

7 years ago

20:37 no crashes today.  A couple of instances of disappearing message windows (like the window crashed, but not the app).  Otherwise stable. I'll stay on daily until I hear otherwise.

Jorg K (CEST = GMT+2)

Assignee

Comment 23

•

7 years ago

Kent, this user has been crash-free ever since he started using a version containing this patch, so for a few days now. Before he crashed a few times a day.

Doug2, the Daily version you're using is pretty stable, I'm using it myself. Please stay on it.

Kent James (:rkent)

Comment 24

•

7 years ago

Comment on attachment 8906332 [details] [diff] [review]
1368786-mork-crash.patch

Review of attachment 8906332 [details] [diff] [review]:
-----------------------------------------------------------------

I spent many hours looking at this, trying to understand the underlying memory model a little better. I'm still not comfortable that I understand it, but really that relates more to the changes in bug 853881 than this one.

As I understand the memory model pre bug 853881, the COM reference to the nsIMdbRow of m_mdbRow in nsMsgHdr belongs to nsMsgHdr, such that any nulling of m_mdbRow required NS_RELEASE(m_mdbRow). In the implementation of bug 853881, you went over this issue quite a bit, and ultimately removed the call to NS_RELEASE(m_mdbRow) in nsMsgHdr::~nsMsgHdr(). You requested many times that I investigate that, but I don't really understand the mork memory model, and it is non-trivial (as in days or weeks) to try to really understand it.

So I'm really not very happy with patch https://hg.mozilla.org/comm-central/rev/5c3191edbbd4fe552450b582d520ba9be4cd431b from bug 853881. That patch seems to me to be incorrect. You believed that the destructor code in morkNode::CutWeakRef and morkNode::ZapOld (which calls the destructor explicitly!) was removing the object, but I'm not sure that is correct. It seems to me that there is no longer anything that releases m_mdbRow, which would result in a fairly substantial memory leak.

So prior to that patch, I would have required here in the current patch NS_IF_RELEASE(m_mdbRow) (actually of iter.Data()->m_mdbRow) prior to nulling, but that assumes that the last patch in bug 853881 is invalid. I'm really not sure what to do here because I don't really like the previous patch.

Can you explain where you believe that the m_mdbRow is being deleted? Do you believe it is in mork somewhere? Yet prior to bug 853881 there were clearly NS_RELEASE calls that would have resulted in it being deleted. What has changed?

Really the problem here is that the technical debt of the mork code is catching up with us. We all hope that someone else will figure this out, when the reality is that the code is hopelessly obsolete and none of us is going to be able to figure it out.

If you can justify the last patch in bug 853881, then this patch is fine. But really what we should do is revert that previous patch, and try to really understand what is going on there.

Or you can just leave the current patch as is, and we can accept the massive leaks that we may get from these two changes. Or prove why they don't occur.

Jorg K (CEST = GMT+2)

Assignee

Comment 25

•

7 years ago

(In reply to Kent James (:rkent) from comment #24)
> So I'm really not very happy with patch
> https://hg.mozilla.org/comm-central/rev/
> 5c3191edbbd4fe552450b582d520ba9be4cd431b from bug 853881.

Hi Kent, thanks for taking the time to look at this thoroughly. As you recall, Joshua's work was abandoned, I refreshed the patches, addressed compile errors, and did a "pick-and-chose" from Joshua's patch not using some hunks.

The original code at that spot can be seen here:
https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371#l3.46
and in a second patch, I changed it to what you're not happy about:
https://hg.mozilla.org/comm-central/rev/5c3191edbbd4fe552450b582d520ba9be4cd431b#l1.20

Why did I do that? Two reasons:
The first version crashed, due to double free, see bug 853881 comment #39 and bug 853881 comment #41. Back then, I looked at it with the debugger, and clearly, things got double free'd. I came into Mork and it tried to free a location that contained garbage.

Since https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371#l3.46 was my own picking and choosing from Joshua's patch, I then went to his original hunk, attachment 728244 [details] [diff] [review]:
 nsMsgHdr::~nsMsgHdr()
 {
-  if (m_mdbRow)
-  {
-    if (m_mdb)
-    {
-      NS_RELEASE(m_mdbRow);
-      m_mdb->RemoveHdrFromUseCache((nsIMsgDBHdr *) this, m_messageKey);
-    }
-  }
+  if (m_mdb)
+    m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
   NS_IF_RELEASE(m_mdb);
 }
and the problem went away. I trusted that he and the author and you as the reviewer had inspected that original patch. In the review, no one complained about that hunk, so I assumed it was right.

Sorry, that's the practical answer. You know I have multiple roles here, sheriff, developer and jack-of-all-trades. So this is the sheriff's answer. I'll post this comment in case you're still awake, and then I'll look into the issue a little more.

Jorg K (CEST = GMT+2)

Assignee

Comment 26

•

7 years ago

Attached patch 1368786-mork-crash-take2.patch (obsolete) — Details — Splinter Review

OK, if I understood your comment #24 correctly, you want to undo two hunks of Joshua's original patch. Note that releasing with the macro also nulls the pointer.

I started the application, clicked through a few folders and no crash. Would you like me to give this to our test user to try? Sadly debug builds are currently busted (as of just now, bug 1399746), so they crash anyway and I cannot try this on try before fixing that issue.

Attachment #8907961 - Flags: feedback?(rkent)

doug2

Comment 27

•

7 years ago

As the test user, I would be happy to help when you can get me a daily or test build. 
I wish I could help understand the larger issue (memory model), but it's probably over my head.

Jorg K (CEST = GMT+2)

Assignee

Comment 28

•

7 years ago

Thanks, Doug, a new test version is on the way. I'll advise you here.

Jorg K (CEST = GMT+2)

Assignee

Comment 29

•

7 years ago

Kent, I noticed that the crashes from bug 1399746 are Mac only for some reason. So here's a try with the new patch on Windows:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=ecb54ea446274101ea438d0a2353ad82d2c14f51

doug2

Comment 30

•

7 years ago

This is the test user.
1. I would be happy to test the modified code when you can create a "daily" or try for me to test.
2. Being a little radical here, am I the only user who cares about this configuration: MS Windows + POP + new messages in windows-not-tabs?  Don't know if POP (vs MAPI) has anything to do with it.  If I'm the only user in that config, maybe I should move to some other mail reader?  (Not my preference, but several people pulling hair out to fix one person's problem is not effective use of your time.)
3. I'd love to try to help understand the memory model, but it would probably take more of your time than you have to get me started.

Jorg K (CEST = GMT+2)

Assignee

Comment 31

•

7 years ago

(In reply to doug2 from comment #30)
> This is the test user.
I know, thanks for helping us.

> 1. I would be happy to test the modified code when you can create a "daily"
> or try for me to test.
It will be ready soon.

> 2. Being a little radical here, am I the only user who cares about this
> configuration: MS Windows + POP + new messages in windows-not-tabs?  Don't
> know if POP (vs MAPI) has anything to do with it.  If I'm the only user in
> that config, maybe I should move to some other mail reader?  (Not my
> preference, but several people pulling hair out to fix one person's problem
> is not effective use of your time.)
I use Windows and POP, but not "message in window". Anyway, that should be supported and with your help, we can sort out the problems. First we need to address the crash since you're not the only person crashing.

> 3. I'd love to try to help understand the memory model, but it would
> probably take more of your time than you have to get me started.
Please understand that we have some legacy code dating back to the Netscape days in 2000. The original developers have left and even the current head developers tear their hair out, see comment #24.

I'll let you know when the new binary becomes available.

Jorg K (CEST = GMT+2)

Assignee

Comment 32

•

7 years ago

Kent, the try is green apart from the "usual" failures, so proceed to review ;-)

Doug, your new binary is here, it should perform just like the previous one, although there are internal changes:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-ecb54ea446274101ea438d0a2353ad82d2c14f51/try-comm-central-win32/thunderbird-57.0a1.en-US.win32.installer.exe

Kent James (:rkent)

Comment 33

•

7 years ago

Comment on attachment 8907961 [details] [diff] [review]
1368786-mork-crash-take2.patch

Review of attachment 8907961 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +653,5 @@
>  {
> +  // Clear out m_mdbRow member variable - the db is going away, which means
> +  // that this member variable might very well point to a mork db that is gone.
> +  for (auto iter = m_headersInUse.ConstIter(); !iter.Done(); iter.Next()) {
> +    NS_IF_RELEASE(iter.Data()->m_mdbRow);

Actually I believe this line is likely the root cause of the double release. We are releasing m_mdbRow here - but not nulling m_mdbRow in the process.

In nsMsgHdr::~nsMsgHdr, which will still get called, we avoid the m_mdbRow release if m_mdbRow is null - but where is the nulling here? Won't the destructor release it again?

So actually I believe that you need to replace that with:

if (iter.Data()->m_mdbRow)
{
  NS_RELEASE(iter.Data()->m_mdbRow);
  iter.Data()->m_mdbRow = null;
}

Or so I think today and part of yesterday. I get so easily confused by this that my opinion changes regularly. I would be more comfortable if we could reproduce the crash from bug 853881, and convince ourselves that this change fixes it.

Attachment #8907961 - Flags: feedback?(rkent) → feedback+

Jorg K (CEST = GMT+2)

Assignee

Comment 34

•

7 years ago

The crash is easily reproduced by applying:

 nsMsgHdr::~nsMsgHdr()
 {
-  if (m_mdb)
-    m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
+  if (m_mdbRow)
+  {
+    if (m_mdb)
+    {
+      NS_RELEASE(m_mdbRow);
+      m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
+    }
+  }
   NS_IF_RELEASE(m_mdb);
 }

Then I can see whether the other hunk fixes it, which it does, since I ran TB with both patches applied, and the try is green. When it crashes, the try was red. But I can repeat the exercise with the two patches separately.

As for:
  NS_RELEASE(iter.Data()->m_mdbRow);
  iter.Data()->m_mdbRow = null;

Well, the NS_IF_RELEASE() is safer since it won't dereference a nullptr, and as I said, it nulls the pointer:http://searchfox.org/mozilla-central/source/xpcom/base/nsISupportsUtils.h#96

I'll get back to you in 30 minutes, when I made it crash, OK?

Kent James (:rkent)

Comment 35

•

7 years ago

I see what you mean about NS_IF_RELEASE, I was not aware of that.

I'm a little confused though about what set of changes crashes or not. Is there a variation that does not crash that includes a release of m_mdbRow in ~nsMsgHdr?

Jorg K (CEST = GMT+2)

Assignee

Comment 36

•

7 years ago

OK, this code crashes:

nsMsgHdr::~nsMsgHdr()
{
  if (m_mdbRow)
  {
    if (m_mdb)
    {
      NS_RELEASE(m_mdbRow);
      m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
    }
  }
  NS_IF_RELEASE(m_mdb);
}

That's the original code *before* Joshua got to it.

The NS_IF_RELEASE() in the loop will fix it.

Jorg K (CEST = GMT+2)

Assignee

Comment 37

•

7 years ago

OK, as I said, restoring that code above crashes at application shutdown time after visiting a few folders in TB and clicking a few messages, very reproducible. Stack:

>	xul.dll!morkRowObject::CloseRowObject(morkEnv * ev) Line 572	C++
 	xul.dll!morkRowObject::~morkRowObject() Line 61	C++
 	[External Code]	
 	xul.dll!morkObject::Release() Line 35	C++
 	xul.dll!morkRowObject::Release() Line 88	C++
 	xul.dll!nsMsgHdr::~nsMsgHdr() Line 111	C++
We're accessing junk.

Now, the new loop:
void nsMsgDatabase::ClearUseHdrCache()
{
  // Clear out m_mdbRow member variable - the db is going away, which means
  // that this member variable might very well point to a mork db that is gone.
  for (auto iter = m_headersInUse.ConstIter(); !iter.Done(); iter.Next()) {
    NS_IF_RELEASE(iter.Data()->m_mdbRow);
  }
  m_headersInUse.Clear();
}

makes the crash go away. That's the confirmation you asked for, no?

We can wait for Doug's report on the binary I gave him.

doug2

Comment 38

•

7 years ago

Running the "daily" from comment #32 now.  Have not had a crash yet. Definitely still has the "new message window disappears" issue, however.  Let me know when you want me to run something else.

Jorg K (CEST = GMT+2)

Assignee

Comment 39

•

7 years ago

While we're here, why is this so twisted? Why do I only call NS_RELEASE(m_mdbRow) if m_mdb != null?

nsMsgHdr::~nsMsgHdr()
{
  if (m_mdbRow)
  {
    if (m_mdb)
    {
      NS_RELEASE(m_mdbRow);
      m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
    }
  }
  NS_IF_RELEASE(m_mdb);
}

Why can't it just be:

nsMsgHdr::~nsMsgHdr()
{
  NS_IF_RELEASE(m_mdbRow); <== Do we need this?
  if (m_mdb)
    m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
  NS_IF_RELEASE(m_mdb);
}

Kent James (:rkent)

Comment 40

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #39)
> While we're here, why is this so twisted? ...
> 
> Why can't it just be:
> 
> nsMsgHdr::~nsMsgHdr()
> {
>   NS_IF_RELEASE(m_mdbRow); <== Do we need this?
>   if (m_mdb)
>     m_mdb->RemoveHdrFromUseCache(this, m_messageKey);
>   NS_IF_RELEASE(m_mdb);
> }

That seems like it should work.

Jorg K (CEST = GMT+2)

Assignee

Comment 41

•

7 years ago

Attached patch 1368786-mork-crash-take2.patch (v2) — Details — Splinter Review

Doesn't crash. Can you enlighten me why you think we need to release in the DTOR? All those got released when we tossed the cache away, no?

Attachment #8906332 - Attachment is obsolete: true

Attachment #8907961 - Attachment is obsolete: true

Attachment #8906332 - Flags: review?(rkent)

Attachment #8906332 - Flags: feedback?(Pidgeot18)

Attachment #8908213 - Flags: review?(rkent)

Kent James (:rkent)

Comment 42

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #41)
> Created attachment 8908213 [details] [diff] [review]
> 1368786-mork-crash-take2.patch (v2)
> 
> Doesn't crash. Can you enlighten me why you think we need to release in the
> DTOR? All those got released when we tossed the cache away, no?

Yes, but blowing away the use cache is not the typical way that a nsMsgHdr gets released. It is those other cases that need to release in the DTOR.

Kent James (:rkent)

Comment 43

•

7 years ago

Comment on attachment 8908213 [details] [diff] [review]
1368786-mork-crash-take2.patch (v2)

Review of attachment 8908213 [details] [diff] [review]:
-----------------------------------------------------------------

I'm still a little confused about what crashed and what does not. You say this is not crashing now, but isn't this what crashed previously?

But if this seems to solve the issues in these bugs, then the code looks correct to me.

Attachment #8908213 - Flags: review?(rkent) → review+

Jorg K (CEST = GMT+2)

Assignee

Comment 44

•

7 years ago

Maybe the confusion comes from the fact that there are two crashes.

In the current state, no patch, TB crashes for some users, in fact, it's topcrash #1 on beta right now. I'm using Daily and it never crashed for me. So this is a "rare" crash.

Adding the loop in nsMsgDatabase.cpp makes that "rare" crash go away, as confirmed by Doug over the last few days.

There is a second very reliable crash obtained by releasing the row in the DTOR, see comment #36. That's what crashed "previously", hence my change with which you weren't happy at all since it leaked.

This second very reliable crash is also fixed by that loop.

I hope it's clear now. I'm doing a build for Doug to try with the final version before we release this to the masses ;-)

Jorg K (CEST = GMT+2)

Assignee

Comment 45

•

7 years ago

Doug, thanks for your patience, this would be the final version:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-82c7d73473a74b459884c56fddeb29dd23ea770a/try-comm-central-win32/thunderbird-57.0a1.en-US.win32.installer.exe

Once we're done with this bug here, I promise to look at bug 711204 for you.

doug2

Comment 46

•

7 years ago

Running fine so far.  Thanks.  711204 is a minor pain, not an app crash.

Pulsebot

Comment 47

•

7 years ago

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/a0bd8f2b70be
release m_mdbRow in hash table entry and also in DTOR. r=rkent

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Jorg K (CEST = GMT+2)

Assignee

Updated

•

7 years ago

Assignee: nobody → jorgk

status-thunderbird55: affected → wontfix

status-thunderbird57: --- → fixed

tracking-thunderbird56: ? → ---

Target Milestone: --- → Thunderbird 57.0

Jorg K (CEST = GMT+2)

Assignee

Comment 48

•

7 years ago

Comment on attachment 8908213 [details] [diff] [review]
1368786-mork-crash-take2.patch (v2)

Let's uplift this if we don't get negative feedback from Doug.

Attachment #8908213 - Flags: approval-comm-beta+

Wayne Mery (:wsmwk)

Reporter

Comment 49

•

7 years ago

We're so close to merge I'm thinking we won't do another beta 56, so uplift should not be needed.  As long as Tom's watershed/56.0b3 test can be done using 57.0b

Jorg K (CEST = GMT+2)

Assignee

Comment 50

•

7 years ago

OK, but what about the things we need for TB 52.4 ESR?

doug2

Comment 51

•

7 years ago

No crashes since I installed the latest try. (See comment #45)

Jorg K (CEST = GMT+2)

Assignee

Comment 52

•

7 years ago

Thanks so much for your help. Since the changes are now integrated into the code base, you could switch to an official Daily from here:https://archive.mozilla.org/pub/thunderbird/tinderbox-builds/comm-central-win32/1505428582/thunderbird-57.0a1.en-US.win32.installer.exe
I suggest to switch automatic updates off since some Dailies don't work, that's life and the course of the development. Then at the end of the month, you switch to TB 57 beta when it's released. Sounds like a plan?

doug2

Comment 53

•

7 years ago

You are very welcome. Glad to help anytime. Thank you for all the time you spent on this and that you spend creating & updating Thunderbird. 
Got the official daily and running with it.

Jorg K (CEST = GMT+2)

Assignee

Comment 54

•

7 years ago

Beta (TB 56):
https://hg.mozilla.org/releases/comm-beta/rev/8cb3077054c23c5590469eafd25f1e5c78db2f4c

status-thunderbird56: affected → fixed

status-thunderbird_esr52: --- → unaffected

doug2

Comment 55

•

7 years ago

Beta 56.0b3  2 crashes today.  Both while reading new mail. 1st one does not show in my crash reports.  Crash file contents "1507845030"   Was on daily (57.0a1, 9/22/17) until yesterday. See comment #52 above.  ?? when I went to the Beta Channel, I got 56 instead of 57, so I guess it is not released yet.  That means crashes are probably just same as before.  I'll go back to daily.

Jorg K (CEST = GMT+2)

Assignee

Comment 56

•

7 years ago

It would be in TB 56 beta 4. We've been struggling with that release, it's not ready so far. TB 57 beta will come after that.

Wayne Mery (:wsmwk)

Reporter

Comment 57

•

7 years ago

graph looks good and with >50% uptake so far there are no crashes with 56.0b4

https://crash-stats.mozilla.com/signature/?release_channel=beta&signature=morkRow%3A%3AGetCell&date=%3E%3D2017-07-23T19%3A15%3A09.000Z&date=%3C2017-10-23T19%3A15%3A09.000Z#graphs

Status: RESOLVED → VERIFIED

doug2

Comment 58

•

7 years ago

Thanks.  Just installed it.  Will monitor.

1368786-mork-crash.patch 7 years ago Jorg K (CEST = GMT+2) 1023 bytes, patch		Details \| Diff \| Splinter Review
1368786-mork-crash-take2.patch 7 years ago Jorg K (CEST = GMT+2) 1.67 KB, patch	rkent : feedback+	Details \| Diff \| Splinter Review
1368786-mork-crash-take2.patch (v2) 7 years ago Jorg K (CEST = GMT+2) 1.53 KB, patch	rkent : review+ jorgk-bmo : approval-comm-beta+	Details \| Diff \| Splinter Review