Closed Bug 1250723 Opened 4 years ago Closed 4 years ago

startup CRASH: C-C TB ASSERTION: can't be your own parent: 'false', ... mailnews/db/msgdb/src/nsMsgHdr.cpp, then CRASH in morkTable::AddRow

Categories

(MailNews Core :: Backend, defect, critical)

Unspecified
All
defect
Not set
critical

Tracking

(thunderbird45+, thunderbird46 fixed, thunderbird47 fixed, thunderbird48 fixed, thunderbird_esr38 unaffected, thunderbird_esr45 fixed)

VERIFIED FIXED
Thunderbird 48.0
Tracking Status
thunderbird45 + ---
thunderbird46 --- fixed
thunderbird47 --- fixed
thunderbird48 --- fixed
thunderbird_esr38 --- unaffected
thunderbird_esr45 --- fixed

People

(Reporter: ishikawa, Assigned: rkent)

References

Details

(Keywords: crash, regression, topcrash-thunderbird, Whiteboard: [regression:TB45][startupcrash])

Crash Data

Attachments

(6 files, 7 obsolete files)

13.72 KB, application/gzip
Details
17.96 KB, application/gzip
Details
34.98 KB, text/plain
Details
124.70 KB, text/plain
Details
42.63 KB, application/gzip
Details
16.69 KB, patch
rkent
: review+
Details | Diff | Splinter Review
(How come there is no entry for imap or other mail protocol in the component list ? We should have one IMHO)

(I am reporting this for C-C TB, but this bug could be in ESR too. I
could not figure out the reliable test procedure to trigger the bug
with ESR.  Now I can reproduce this bug with C-C TB fairly easily.)

My test environment.
Debian GNU/Linux 64-bit.
IMAP server on the local PC: dovecot-imapd (this serves both as pop3
and imap daemon).
I don't know franckly about the subtle interaction of imap protocol. I
have no idea if this bug I am reporting is dovecot-specific or not,
for example. 

Repeat-by:

Pre-requisite:

1. Install thunderbird.

1-a: Create the default account (pop3). [below it is ishikawa, my address
on a local PC. I doubt if this matters, though.]

1-b - create a folder in which a lot of duplicate messages exist.
   Let us call it folder-C.
   A sample folder content will be uploaded [1].
   (I think this duplicated/triplicated/... messages may be the key.
   C-C TB seems to get confused by the existance of same messages.)

   Such a folder content can be made by copying existing messages to a
   DIFFERENT folder twice or three times..  Then copying them to yet
   another DIFFERENT folder twice, three times, etc.  (I happened to
   create such a folder when I tested my local patch set for Bug
   1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect
   Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.

   My copying/moving back then was rather random, but the current
   content of the folder serves our purpose of triggering the bug
   rather reliably.  I am attaching the compressed folder-C locally
   used for testing and reproducing the bug.

1-c: Create ANOTHER account (imap) [mtest3, another address on a local PC.]
       (I think use of imap is very crucial to reproduce this bug.
       Probably pop3 account does not have this issue.)
       
       I don't know if "mtest3" is important here. Maybe if you are
       going to use the sample folder content which I upload[2]. Maybe
       when a message is copied into folder in imap account, imap
       daemon may do something based on the |To| address of the
       message especially when it matches the account's e-mail
       address. But I am not sure.)

1-d: create a local folder under this imap account by the name of folder-C.
   Copy a bunch of messages from folder-C above (from the default
   account)
   Maybe 40 messages at a time. Repeat this a few times.
   Again, I am attaching a compressed folder for reproducing the
   bug on the reader's machine[2].
   (But note that probably you have to copy the content of this
   uploaded content into your local folder first,
   and THEN copy them to the created imap
   account's folder-C since copying the files probably has some
   interaction with the imap daemon. Again I am not sure.)

So now you have, in the folder pane: 
  default account (ishikawa, pop3)
  mtest3 account   (mtest3, imap)
  Local Folders
  
To reproduce the bug. Do this.

(1) Select messages at the top of the message list pane.
(listed in ascending order of time: latest mesasges last)
Say 50 messages. (I experienced the bug when I copy 30~51 messages.)
MOVE them from folder-C of default account to
folder-C of the additional imap account (mtest2).
No CRASH at this stage yet.

(2) THEN MOVE BACK from the folder-C of the additional imap account
to the folder-C of default account, this time, somewhat larger number
of messages from the top of the message list pane
(listed in ascending order of time: latest mesasges last).
Say, 100-180 would do.

Then C-C TB crash.

Note: 
IN both steps above MOVE is the key, I think. When I COPY, it is hard to reproduce the bug.

Assertion and stack dump.

Just before the crash I see messages in the tty where TB was started

[21414] ###!!! ASSERTION: can't be your own parent: 'false', file /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgHdr.cpp, line 638
[21414] WARNING: NS_ENSURE_TRUE(hdrRow) failed: file /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp, line 3558

Basically a nullptr (0x0) is passed to a function where it is deferenced.
(morkTable::AddRow is the function.)

The following is the gdb stack trace when this happened.
(I copied 51 messages from folder-C of default account(pop3) and
then copied back 138 messages from the folder-C of the additional imap
account(mtest2) to the folder-C of default account when TB crashes.
The counter that showed the copying process stopped at
1/138. But these numbers do not have to be followed exactly. MOVE
40-120, and then MOVE back 100-150 ought to do.
(MOVE is the key, I think as I said above.)

Stacktrace:

#0  0x00007ff20ca2e70d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:81
#1  0x00007ff20ca2e5a4 in __sleep (seconds=0)
    at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00007ff1ff3fcfb7 in ah_crap_handler (signum=11)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsSigHandlers.cpp:103
#3  0x00007ff1ff3e3a47 in nsProfileLock::FatalSignalHandler (signo=11, 
    info=0x7ffe2a201e30, context=0x7ffe2a201d00)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/profile/nsProfileLock.cpp:191
#4  0x00007ff20015d2c6 in AsmJSFaultHandler (signum=11, info=0x7ffe2a201e30, 
    context=0x7ffe2a201d00)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/js/src/asmjs/WasmSignalHandlers.cpp:1157
#5  <signal handler called>
#6  0x00007ff1fb13f7ba in morkTable::AddRow (this=0x201f200, mev=0xed6f18, 
    ioRow=0x0)
    at /NREF-COMM-CENTRAL/comm-central/db/mork/src/morkTable.cpp:628
#7  0x00007ff1fb3fc2b5 in nsMsgThread::AddChild (this=0x4dae8a0, child=
    0x3a6f480, inReplyTo=0x0, threadInThread=false, announcer=0x338af60)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgThread.cpp:245
#8  0x00007ff1fb3ec68c in nsMsgDatabase::AddToThread (this=0x338af60, 
    newHdr=0x3a6f480, thread=0x4dae8a0, inReplyTo=0x0, threadInThread=false)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp:4542
#9  0x00007ff1fb3ed3f3 in nsMsgDatabase::AddNewThread (this=0x338af60, 
    msgHdr=0x3a6f480)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp:4776
#10 0x00007ff1fb3ec5d5 in nsMsgDatabase::ThreadNewHdr (this=0x338af60, 
    newHdr=0x3a6f480, newThread=@0x7ffe2a2026af: false)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp:4527
#11 0x00007ff1fb3e8a3a in nsMsgDatabase::AddNewHdrToDB (this=0x338af60, 
    newHdr=0x3a6f480, notify=true)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp:3464
#12 0x00007ff1fb50fb93 in nsMsgLocalMailFolder::EndMessage (this=0x2f00860, 
    key=585)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsLocalMailFolder.cpp:2725
#13 0x00007ff1fb1ded3b in nsCopyMessageStreamListener::EndMessage (
    this=0x384aa10, key=585)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/base/src/nsCopyMessageStreamListener.cpp:77
#14 0x00007ff1fb470c9f in nsImapMailFolder::EndMessage (this=0x2f2ad30, aUrl=
    0x2ea1338, uidOfMessage=585)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/imap/src/nsImapMailFolder.cpp:5693
#15 0x00007ff1fb4fb603 in (anonymous namespace)::SyncRunnable2<nsIImapMailFolderSink, nsIMsgMailNewsUrl*, unsigned int>::Run (this=0x7ff17c00e360)
    at /NREF-COMM-CENTRAL/comm-central/mailnews/imap/src/nsSyncRunnableHelpers.cpp:146
#16 0x00007ff1fb7206c9 in nsThread::ProcessNextEvent (this=0x53b840, 
    aMayWait=false, aResult=0x7ffe2a202a6f)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/xpcom/threads/nsThread.cpp:1018
#17 0x00007ff1fb7741d4 in NS_ProcessNextEvent (aThread=0x53b840, 
    aMayWait=false)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/xpcom/glue/nsThreadUtils.cpp:297
#18 0x00007ff1fbc2bb87 in (anonymous namespace)::ipc::MessagePump::Run (
    this=0x53c280, aDelegate=0x53ba20)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/ipc/glue/MessagePump.cpp:95
#19 0x00007ff1fbbd296b in MessageLoop::RunInternal (this=0x53ba20)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/ipc/chromium/src/base/message_loop.cc:234
#20 0x00007ff1fbbd28fe in MessageLoop::RunHandler (this=0x53ba20)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/ipc/chromium/src/base/message_loop.cc:227
#21 0x00007ff1fbbd28d7 in MessageLoop::Run (this=0x53ba20)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/ipc/chromium/src/base/message_loop.cc:201
#22 0x00007ff1fe54b4b4 in nsBaseAppShell::Run (this=0xbcd9b0)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/widget/nsBaseAppShell.cpp:156
#23 0x00007ff1ff36c87d in nsAppStartup::Run (this=0xc9f500)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/components/startup/nsAppStartup.cpp:281
#24 0x00007ff1ff3f68dc in XREMain::XRE_mainRun (this=0x7ffe2a202f70)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsAppRunner.cpp:4284
#25 0x00007ff1ff3f6cb0 in XREMain::XRE_main (this=0x7ffe2a202f70, argc=1, 
    argv=0x7ffe2a204418, aAppData=0x7ffe2a203130)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsAppRunner.cpp:4381
#26 0x00007ff1ff3f6f5a in XRE_main (argc=1, argv=0x7ffe2a204418, 
    aAppData=0x7ffe2a203130, aFlags=0)
    at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsAppRunner.cpp:4483
#27 0x0000000000404f0c in do_main (argc=1, argv=0x7ffe2a204418, xreDirectory=
    0x4aadd0) at /NREF-COMM-CENTRAL/comm-central/mail/app/nsMailApp.cpp:195
#28 0x0000000000405275 in main (argc=1, argv=0x7ffe2a204418)
    at /NREF-COMM-CENTRAL/comm-central/mail/app/nsMailApp.cpp:332
(gdb)

When the crash occurs, the exactly the same stack trace is observed.

Observation.

Assertion is triggered in the following function.

  634	NS_IMETHODIMP nsMsgHdr::SetThreadParent(nsMsgKey inKey)
  635	{
  636	  m_threadParent = inKey;
  637	  if (inKey == m_messageKey)
  638	    NS_ASSERTION(false, "can't be your own parent");
  639	  SetUInt32Column(m_threadParent, m_mdb->m_threadParentColumnToken);
  640	  m_initedValues |= THREAD_PARENT_INITED;
  641	  return NS_OK;
  642	}

I added the dump of inKey inside nsMsgHdr::SetThreadParent(nsMsgKey
inKey) quoted above, in the if-clause where the assertion is triggered
in my locally patched version of C-C TB and when the error occurred
inKey was 0xffffffff.  But this means m_messageKey is also 0xffffffff.
Very strange situation (and buggy, I suppose).

[IMAP code is buggy?]

If the additional account to which the messages are copied and then
from which copied back uses pop3 instead of imap, it is very unlikely
that this bug happens.  I tried the same procedures a few times with
additional pop3 account and did not see this bug.

So imap account seems to be the key here.

Re pop3 is unlikely to cause the bug.

During the testing of Bug 1121842, I had an additional pop3 account to
testing the copying of messages between folders, but did not realize
this bug.

Only within the last month or so, having learned that there seem to be
an issue of TB when one has multiple imap accounts, I created such imap
accounts and see if I can experience the same bug reported in
Bug 1203977 - Thunderbird crashes in nsImapMailFolder::ParseMsgHdrs.
No.

But, instead, I hit the current new bug which I am reporting.  

So I think it is easist to check up on this bug with an additional
imap account first where it is likely to happen. pop3 may not have
this issue or the probability of occurence is very log :-(

I wanted to make sure that the bug is not caused by my patch set for
bug 1121842 and tried to see if ESR TB produces it. I think it did.  
But I never figured out how to reproduce this reliably until yesterday.

Now I have a way to produce the bug with stock C-C TB (without my patch).

So C-C TB tree has this issue intrinsically without my patch.  A sigh
of relief in that sense, but we need to fix THIS dataloss bug ASAP, I think.

TIA

Attachment
[1] folder-C for default account (ishikawa,pop3)
[2] folder-C for additional account (mtest3, imap)

    I am not sure if the to address in the folder content for
    additional mtest3 account may have a bearing when the reader of
    this bugzilla tries to reproduce the bug. It is possible that it
    may, especially regarding the messages with "to" address that
    matches the address of imap account.

PS: If my copying some messages from pop3 and other folders into imap
folders which should be under the control of imap daemon is illegal,
such action should be prohibited and should not be doable in the UI of
TB.
This is [2].
I am afraid I referred to mtest2 instead of mtest3 in my original report.
mtest3 is the address used in the imap account. (This may not matter.)

For the folder-C in default account, it is too big to upload even after compression. (There are some artificially large e-mail messages for testing the buffering issues of TB.)
Any preferred place to upload such a big file?

TIA
This is shrunk version of [1]

I figured that since only the first two hundred or so messages in folder-C of the default account seem to be important, maybe I could reproduce the bug with only three hundred or so messages of folder-C.
So I created a smaller folder-B that contains 300+ messages only.

Voila, I could reproduce the bug easily.
So copy the messages of the attached folder data into folder-C of default and repeat the procedure.
Hopefully you can reproduce the bug.

TIA
Confirmation.

Following the steps mentioned in the previous posts,
I could reproduce the bug on a totally different PC (Debian GNU/Linux 64-bit).
I have to create a new profile to reproduce the bug as described.
(Trying to re-use the existing accounts by shuffling around did not work ???)

On the same machine, ESR 38.6, did not produce this particular bug.
It did not crash.

But C-C TB I locally created did crash.
The nightly also did crash.
So it is possibly a regression (?)
TIA
This is the crash signature I obtained from the nightly.
https://crash-stats.mozilla.com/report/index/75a85525-ce52-45da-827e-dbed72160224
(In reply to ISHIKAWA, Chiaki from comment #4)
> This is the crash signature I obtained from the nightly.
> https://crash-stats.mozilla.com/report/index/75a85525-ce52-45da-827e-
> dbed72160224

Oops, as I said, before the crash happens, we see


[21414] ###!!! ASSERTION: can't be your own parent: 'false', file /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgHdr.cpp, line 638
[21414] WARNING: NS_ENSURE_TRUE(hdrRow) failed: file /NREF-COMM-CENTRAL/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp, line 3558

and I found that
--- begin quote
Assertion is triggered in the following function.

  634	NS_IMETHODIMP nsMsgHdr::SetThreadParent(nsMsgKey inKey)
  635	{
  636	  m_threadParent = inKey;
  637	  if (inKey == m_messageKey)
  638	    NS_ASSERTION(false, "can't be your own parent");
  639	  SetUInt32Column(m_threadParent, m_mdb->m_threadParentColumnToken);
  640	  m_initedValues |= THREAD_PARENT_INITED;
  641	  return NS_OK;
  642	}

I added the dump of inKey inside nsMsgHdr::SetThreadParent(nsMsgKey
inKey) quoted above, in the if-clause where the assertion is triggered
in my locally patched version of C-C TB and when the error occurred
inKey was 0xffffffff.  But this means m_messageKey is also 0xffffffff.
Very strange situation (and buggy, I suppose).

--- end quote

so the internal database is either corrupted by the time this happens, or the
message threading handling does not consider this possibility somehow, I think.

TIA
I think the regression occurred on Dec 5th.
Tested under debian gnu/linux 64-bit. Tested X86_64 binaries.

https://ftp.mozilla.org/pub/thunderbird/nightly/2015/12/2015-12-05-03-03-46-comm-central/
Dec 5th version: OK  Could not crash it after several tries.

https://ftp.mozilla.org/pub/thunderbird/nightly/2015/12/2015-12-06-03-05-45-comm-central/
Dec 6th version: NG Crashed after a trial or two.

TIA
It looks that the following patch has changed the message key handling a little bit, and
message threading code probably did not follow the change properly.
http://hg.mozilla.org/comm-central/rev/4d8d5d6eecfb
https://bugzilla.mozilla.org/show_bug.cgi?id=1183490

That I saw 0xFFFFFFFF (== nsMsgKey_None) in NS_ASSERTION() rings a bell.
(The value is used, for example, in 
http://hg.mozilla.org/comm-central/diff/4d8d5d6eecfb/mailnews/local/src/nsParseMailbox.cpp
)

Is my guess correct (needinfo from rkent)?

TIA
Flags: needinfo?(rkent)
I was literally going to file this bug report just now, and I came across yours.  Thanks!
 
We must have this fixed for version 45 - it's virtually tied as #1 crash
  https://crash-stats.mozilla.com/topcrasher/products/Thunderbird/versions/45.0b1
Mac signature is [@ non-virtual thunk to morkTable::AddRow ]


(In reply to ISHIKAWA, Chiaki from comment #0)
> (How come there is no entry for imap or other mail protocol in the component
> list ? We should have one IMHO)

https://bugzilla.mozilla.org/enter_bug.cgi?product=Mailnews%20Core  ;)

I agree bug 1183490 is the likely cause - the two earliest 45.0a1 buildid crashes are  20151206030545 and 
20151207030219
Blocks: 1183490
Severity: major → critical
Crash Signature: [@ morkTable::AddRow ] [@ non-virtual thunk to morkTable::AddRow ]
Component: Untriaged → Backend
Flags: needinfo?(acelists)
OS: Linux → All
Product: Thunderbird → MailNews Core
Summary: CRASH: C-C TB ASSERTION: can't be your own parent: 'false', ... mailnews/db/msgdb/src/nsMsgHdr.cpp, then CRASH! → startup CRASH: C-C TB ASSERTION: can't be your own parent: 'false', ... mailnews/db/msgdb/src/nsMsgHdr.cpp, then CRASH!
Whiteboard: [startupcrash]
OK, so it is enough to have 1 imap account (not multipe ones) ?
The conclusion is that the crash happens when moving FROM imap to Local folders? Not the other way round?
Flags: needinfo?(acelists)
(In reply to :aceman from comment #10)
> OK, so it is enough to have 1 imap account (not multipe ones) ?
> The conclusion is that the crash happens when moving FROM imap to Local
> folders? Not the other way round?

Right.
I only created additional IMAP account on top of the default POP3 account.

And the crash happens when moving from the second IMAP account to a
folder of the default POP3 account. 
That latter is, in my case, it is folder-C right next to Inbox of the default account.
(I think the use of "Local folders" is rather ambiguous and confusing.)

Before the crash, there is strange assertion as I noted and so the
handling of message keys got a little screwed up in the scenario I mentioned.

TIA
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #9)
> I was literally going to file this bug report just now, and I came across
> yours.  Thanks!
>  
> We must have this fixed for version 45 - it's virtually tied as #1 crash
>  
> https://crash-stats.mozilla.com/topcrasher/products/Thunderbird/versions/45.
> 0b1
> Mac signature is [@ non-virtual thunk to morkTable::AddRow ]
> 

Wow, I did not know this.

> 
> (In reply to ISHIKAWA, Chiaki from comment #0)
> > (How come there is no entry for imap or other mail protocol in the component
> > list ? We should have one IMHO)
> 
> https://bugzilla.mozilla.org/enter_bug.cgi?product=Mailnews%20Core  ;)
>

OK, I will try to remember. But 
mozilla bugzilla component list not the most user-friendly classification IMHO.
 
> I agree bug 1183490 is the likely cause - the two earliest 45.0a1 buildid
> crashes are  20151206030545 and 
> 20151207030219

Yes, my bisection and the list of patches I checked over Dec 4 - 6 suggests that
the bug 1183490 is the most likely culprit. I think we need to tweak the rest of the C-C tree a littleeto take advantage of the changes introduced in the patch.
(Sorry, I am not familiar with the threading code at all. Maybe I can insert value dumps to figure out what is going on...)
Flags: needinfo?(rkent)
The crashes reported here are different from the TB 45 top crash. TB45 is during POP3 download, these are during IMAP->Local copy
I've put a few hours into looking into this. There are many different paths to the nsMsgDatabase::AddNewHdrToDB most of which seem to come from locations that should not be calling that (like nsLocalFile::Release()) But I suspect they probably all have the same route cause, including the stack here. Still not sure what that might be though. It looks like the new header does not get created properly, with a null row. We could probably error out earlier and prevent the crash, but that may or may not actually help.
(In reply to Kent James (:rkent) from comment #13)
> The crashes reported here are different from the TB 45 top crash. TB45 is
> during POP3 download, these are during IMAP->Local copy

I see.
As a matter of fact, I saw a few DIFFERENT error patterns
during tests with my patches (I see warning messages 
that say expected header lines (X-MOZILLA-...) is not found, etc.

I was worried that my patch corrupted the file, but
it seems that the internal summary file or something became incorrect
imap -> local (pop3) account folder.


(In reply to Kent James (:rkent) from comment #14)
> I've put a few hours into looking into this. There are many different paths
> to the nsMsgDatabase::AddNewHdrToDB most of which seem to come from
> locations that should not be calling that (like nsLocalFile::Release()) But
> I suspect they probably all have the same route cause, including the stack
> here. Still not sure what that might be though. It looks like the new header
> does not get created properly, with a null row. We could probably error out
> earlier and prevent the crash, but that may or may not actually help.

Thank you for taking the time for analysis when you are very busy with the new council business.

I think I need to learn the life cycle of the header(s).
(For that matter, anyone working on this bug needs to !)

One of its field or fields may not be created (filled in) properly.
I will add some print statements to learn how they get changed and
checks for 0xFFFFFFFF (MsKey_None) before some actions are taken to see if it helps.
My current theory of this crash is that, with the changes in bug 1183490, we are using much more frequently the code path that generates a new key from the DB rather than assigning directly. This is code that is not as well tested, in particular error paths. I suspect we are getting db errors that are not well handled.

One particular path is morkStore::NewRow where a null row can be returned with NS_OK but downstream code only checks for failure in the error code.
(In reply to Kent James (:rkent) from comment #16)
> My current theory of this crash is that, with the changes in bug 1183490, we
> are using much more frequently the code path that generates a new key from
> the DB rather than assigning directly. This is code that is not as well
> tested, in particular error paths. I suspect we are getting db errors that
> are not well handled.
> 
> One particular path is morkStore::NewRow where a null row can be returned
> with NS_OK but downstream code only checks for failure in the error code.

I added a few checks and found that the crash signature changes.
I will upload the used patch, and stack trace.

TIA
Here is the patch that added the necessary checks.

With this patch, and the reported steps, I now hit a different assert.
See the excerpt below.

recreating the error condition.
After the patch, a different assertion triggers.
(Because of the patch, crash via nullptr referernce is avoided, but
we hit into another assertion.)

Excerpt from a log printed on the tty console which invoked C-C TB.
Note the error line: So NewRow() does return with nullptr (!)
{debug} nsMsgDatabase::CreateNewHdr returning ERROR. Missed Before.
{debug} nsMsgKey = -1


Excerpt:
 ...

--DOMWINDOW == 24 (0x3cd5cd0) [pid = 31834] [serial = 49] [outer = (nil)] [url = about:blank]
++DOMWINDOW == 25 (0x5ce27b0) [pid = 31834] [serial = 54] [outer = 0x44c0f80]
[31834] WARNING: Failed to retarget HTML data delivery to the parser thread.: file /NREF-COMM-CENTRAL/comm-central/mozilla/parser/html/nsHtml5StreamParser.cpp, line 967
++DOMWINDOW == 26 (0x2e88000) [pid = 31834] [serial = 55] [outer = 0x44c0f80]
++DOMWINDOW == 27 (0x31d0f20) [pid = 31834] [serial = 56] [outer = 0x44bed50]
GetDiskSpaceAvailable returned: 4773371904 bytes
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsLocalMailFolder.cpp, line 2125
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /NREF-COMM-CENTRAL/comm-central/mailnews/base/src/nsCopyMessageStreamListener.cpp, line 103
{debug} nsMsgDatabase::CreateNewHdr returning ERROR. Missed Before.
{debug} nsMsgKey = -1
{debug} nsMsgDatabase::CreateNewHdr returning ERROR. Missed Before.
{debug} nsMsgKey = -1
[31834] ###!!! ASSERTION: error creating message header: 'false', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 1678
[31834] ###!!! ASSERTION: error finalizing headers parsing mailbox: 'NS_SUCCEEDED(rv)', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 680
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8007000E: file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 681
{debug} nsMsgDatabase::CreateNewHdr returning ERROR. Missed Before.
{debug} nsMsgKey = -1
[31834] ###!!! ASSERTION: error creating message header: 'false', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 1678
[31834] ###!!! ASSERTION: error finalizing headers parsing mailbox: 'NS_SUCCEEDED(rv)', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 680
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8007000E: file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 681
{debug} nsMsgDatabase::CreateNewHdr returning ERROR. Missed Before.
{debug} nsMsgKey = -1
[31834] ###!!! ASSERTION: error creating message header: 'false', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 1678
[31834] ###!!! ASSERTION: error finalizing headers parsing mailbox: 'NS_SUCCEEDED(rv)', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 680
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8007000E: file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 681
Assertion failure: buf_length > 1 && (buf[buf_length - 1] == '\r' || buf[buf_length - 1] == '\n') (Header text should always end in a newline), at /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp:941
#01: ???[/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/libxul.so +0x110d3e9]
#02: ???[/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/libxul.so +0x110cb54]
#03: ???[/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/libxul.so +0x110caf9]

     ...

#22: ???[../objdir-tb3/dist/bin/thunderbird +0x4409]
#23: ??? (???:???)

Program ../objdir-tb3/dist/bin/thunderbird (pid = 31834) received signal 11.
Stack:
#01: ???[/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/libxul.so +0x5d3aeb0]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0xf8d0]
#03: ???[/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/libxul.so +0x110d3ee]

     ...

#23: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x21b45]
#24: ???[../objdir-tb3/dist/bin/thunderbird +0x4409]
#25: ??? (???:???)

Sleeping for 300 seconds.
Type 'gdb ../objdir-tb3/dist/bin/thunderbird 31834' to attach your debugger to this thread.
Done sleeping...


Stack back trace is in another attached file.

I think that, by the time the final assertion that triggers the crash is asserted, some internal database or the so called cache became corrupted.

I have no idea why NewRow() returns, though.
I have plenty of memory. 

TIA
Stack trace from GDB.

Somehow re-reading the message from the local "cache" gets confused?

(I added a comment about strange lineLength being 1 when CRLF "\r\n" is passed
to a routine. It may  or may not be related to the issue.)

TIA
Sorry uploaded a patch before the final |hg qrefresh|.
Updated one.
Attachment #8724386 - Attachment is obsolete: true
Hi,

I think we need to figure out WHY NewRow() fails in the first place (mork).

I am uploading an enhanced patch that address a few issues of failing to catch the error [nullptr being returned despite NS_OK] (and one of them seems to be very important: others would have tripped TB (crash) if the path was taken. There could have been silent error.)

With the new patch, I again see some random behavior.
I am using locally created DEBUG BUILD.
1. Either I see a flurry of CreateNewHdr failed with key = -1) error as before
 (and in this case, the error return is taken due to the proper error value, but
  the invoking UI is not quite prepared for the error and NO error message is
  shown to UI and error comparison, AND so I got stuck with the bunch of messages selected in the folder-C of non-default account that uses IMAP, and wondered
what is going on until I check the dump log messages printed to the invoking tty console window where the thunderbird program is invoked.
  and in another case, I noticed a flurry of strange error messages in the error console of TB which is the same as in case 2 although I did not notice the big key # as in case 2 below.)

2. In another case, I don't see any error messages in the tty console at all. *BUT*
nothing happens as far as MOVE of message is concerned, and TB becomes very sluggish
and I realize something is working on the folder-C of the default account that uses pop3. This is because when I try to repair/compact the folder-C I got the error dialog
to try operation later since something else is using it (!) so try later. Obviously another task has
a lock on it.
And in this case, several  minutes later as I write this message, I noticed compaction suddenly starts to work and a flurry of messages are shown in the error console: the early ones flew by. But I see these messages repeated.
(With increasing key # following "skethcy key:". I wonder what is "sketchy key:", that is).
But note the msg key printed for the last few entries!!!

2016-02-28 07:09:23	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 1242 subject: Cron <ishikawa@ip030> /home/ishikawa/bin/cron-daily.sh

   ... many repetitions ...



2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 1540 subject: Cron <root@ip030> /usr/sbin/ntpdate ntp.nict.jp

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 1541 subject: Cron <root@ip030> /usr/sbin/ntpdate ntp.nict.jp

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 1542 subject: Cron <root@ip030> /usr/sbin/ntpdate ntp.nict.jp

(This is where things get interestng. Notice the big #.)

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 4294967285 subject: Cron <ishikawa@ip030> /home/ishikawa/bin/cron-daily.sh

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 4294967286 subject: Cron <ishikawa@ip030> /home/ishikawa/bin/cron-daily.sh

  ... again repeated with key # incremented ...

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 4294967293 subject: Cron <ishikawa@ip030> /home/ishikawa/bin/cron-daily.sh

2016-02-28 07:09:30	gloda.index_msg	WARN	Observed header that claims to be gloda indexed but that gloda has never heard of during compaction. In folder: mailbox://ishikawa@localhost/folder-C sketchy key: 4294967294 subject: Cron <ishikawa@ip030> /home/ishikawa/bin/cron-daily.sh

This is the end of mesasge in the console...
Ooops, can it be that this folder has reached 4GB limit!? No way.
It has 1536 messages and mere 85.1MB on the disk.
Worse, I thought the compaction finished, but the folder-C is still locked.
When I tried to invoke property->repair, the operation was denied.
So maybe TB is busily creating a new OID or something (!?)
I did not see any suspicious system calls using strace.
So it is doing something in the user space and is holding the lock on folder-C, still.

I thought msgkey starts at 1? the property dialog states that there are 1536 messages and so I am not sure where the large key message # comes from (OK, maybe it was the old key msg left before compact worked.)

I am not sure what is going on. But we need to investigate both

1. - why NewRow() fails IN THE FIRST PLACE. Above messages suggests there is a logic flaw
  in the way newkey or whatever is assigned (or memory corruption?), 

2. Catch the errors and make TB more robust (I think the uploaded patch does catch them somehow), and 

3. REPORT such errors to users in a error dialog (it is not done!)

TIA
After quitting TB, which seemed to take time also, I restarted it.

I repaired folder-C of the default account which I could not compact.
This time, though, I got the following message in the error console of TB.

Timestamp: 2016年02月28日 07時41分02秒
Error: 2016-02-28 07:41:02	gloda.index_msg	ERROR	Exception while attempting to mark message with gloda state afterdb commit [Exception... "Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getUint32Property]"  nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)"  location: "JS frame :: resource:///modules/gloda/index_msg.js :: PendingCommitTracker_commitCallback :: line 161"  data: no]

Source File: resource:///modules/gloda/log4moz.js
Line: 691

There are some subtle interactions.
Note the following existing comment: "// ### probably should be an error if hdrRow is NULL..."
in the following function in
mailnews/db/msgdb/src/nsMsgDatabase.cpp


nsresult nsMsgDatabase::RowCellColumnToUInt32(nsIMdbRow *hdrRow, mdb_token columnToken, uint32_t *uint32Result, uint32_t defaultValue)
{
  nsresult  err = NS_OK;

  if (uint32Result)
    *uint32Result = defaultValue;
  if (hdrRow)  // ### probably should be an error if hdrRow is NULL...
  {
    struct mdbYarn yarn;
    err = hdrRow->AliasCellYarn(GetEnv(), columnToken, &yarn);
    if (NS_SUCCEEDED(err))
      YarnToUInt32(&yarn, uint32Result);
  }
  else
  {
#if DEBUG
    fprintf(stderr,"{debug} ERROR!? hdrRow was nullptr in "
            "nsMsgDatabase::RowCellColumnToUInt32()\n");
#endif
  }
  return err;
}

Can it be the failure/bug of NewRow() generated key that is larger than 32-bit value?
Wait, the largest key # reported in the log (see my previous comment)
4294967294


According to my calculation, that number is precariously close to 2^32...

ishikawa@ip030:/NREF-COMM-CENTRAL/comm-central$ bc
bc 1.06.95
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
2^32
4294967296
2^32-1
4294967295
t=4294967294
2^32 - t
2
quit

Any thought?
(In reply to Kent James (:rkent) from comment #16)
> My current theory of this crash is that, with the changes in bug 1183490, we
> are using much more frequently the code path that generates a new key from
> the DB rather than assigning directly. This is code that is not as well
> tested, in particular error paths. I suspect we are getting db errors that
> are not well handled.
> 

I have a question regarding the key.

Is the message key unique  within the scope of 
1 - only each folder, 
2 - many folders of each account in one's profile, or
3 - across one's entire profile?

In the case of 1, two messages in different folders
can have the same message key.
In the case of 2, the messages in one's account across many folders have unique keys.
But a message in another account may have an overlapping key.
In the case of 3, the entire messages in many folders of plural accounts in one's profile
have unique keys.

What is the current framework?

I am not sure how what "mork" database handles generation of unique keys.
But usually, a database for a folder would only generate a unique key within the scope of the opened file.
So I suspect case-1 is the current one. Am I correct?

I have no idea where to look for NewRow() returning nullptr, but
at the same time,

BTW, I am looking at the patch in 
Bug 1202105 - After fix of bug 854798 in Tb 38, "mail which was not removed by Compact of BerkleyStore_LocalMailFolder" can be replaced by newly added/copied mail at Thread pane 

and realize that
sorry, hit [save changes] by mistake.

> and realize that

the patch in question is NOT pushed to C-C TB tree yet.
(Maybe because of the build bustage later December kept rkent from pushing it.)
I don't see the code in MXR (???)

But I suspect that when "the code path that generates a new key from
> the DB rather than assigning directly" was mentioned, this patch was probably
which rkent had in mind.

No, worse than that, I realize now that
something is amiss very much.
I don't even see the supposedly old lines that the patch in
Bug 1202105 is supposed to replace.

    // if mbox is close to 4GB, auto-assign the msg key.
    nsMsgKey key = nsMsgKey_None;
    nsMsgKey key = filePos > 0xFFFFFF00 ? nsMsgKey_None : (nsMsgKey) filePos;

The current C-C TB tree is:
http://mxr.mozilla.org/comm-central/source/mailnews/local/src/nsMsgBrkMBoxStore.cpp#653

653   int64_t filePos;
654   seekable->Tell(&filePos);
655 
656   if (db && !*aNewMsgHdr)
657   {
658     db->CreateNewHdr(nsMsgKey_None, aNewMsgHdr);
659   }
660 
661   if (*aNewMsgHdr)
662   {
663     char storeToken[100];
664     PR_snprintf(storeToken, sizeof(storeToken), "%lld", filePos);
665     (*aNewMsgHdr)->SetMessageOffset(filePos);
666     (*aNewMsgHdr)->SetStringProperty("storeToken", storeToken);
667   }
668   return rv;
669 }
670 

The above is the code in my local tree and that is what I tried to tighten up by checking
the return code of CreateNewHdr().

Maybe something is wrong with the C-C TB tree?
(Some patches that should have landed in unison to handle 4GB folders
have not landed at all or the landing may have missed a piece or two)?

TIA
Flags: needinfo?(rkent)
Flags: needinfo?(acelists)
A few comments:

1) The mork code seems to routinely follow a pattern that can return a null ptr but NS_OK due to errors. I am reluctant to fix that in one spot without changing the pattern everywhere, which I don't want to do this close to TB 45, so the fix I am proposing at https://hg.mozilla.org/try-comm-central/rev/ef37ccaff2b0 just adds some error checks downstream of Mork.

2) The bug key warnings in comment 21 makes me wonder how the code will respond when the key gets too large. If a folder previously (pre TB45) was near the size limit, got compacted down, then run under TB 45 with its constantly increasing keys, I can imagine that we could hit a max size key. I'll try to do a test case of this where I artificially set a large key value.
Flags: needinfo?(rkent)
I artificially forced new message keys to be greater than 0xfffffffa  Then copying messages to that folder, when I reach the limit I crash with our signature. Here is an example:

http://crash-stats.mozilla.com/report/index/bp-227048dc-8623-4476-92ce-6d0ee2160229

What is odd is that after the crash, coming back to the folder, what I see is that the message keys are no longer the large values. Something is resetting those keys when the limit is reached - a compact? Let me investigate further.
When the key hits the ffffffff limit, mork is correctly detecting the error, but this code in nsMsgBrkMBoxStore.cpp is not checking the error returned by CreateNewHdr:


  if (db && !*aNewMsgHdr)
  {
    db->CreateNewHdr(nsMsgKey_None, aNewMsgHdr);
  }

  if (*aNewMsgHdr)

So at this point, the new msg hdr exists, but it was created with an error and a null row.
My current issue involves a test case, where I copy from an IMAP folder to a local folder, and the key on the local folder hits the limit. The problem is, with the crash, the db seems to end up invalid, and when I restart after the crash, the db is noticed as invalid, and the local folder is parsed when opened, repairing the problem.

If I just detect the crash and prevent it, then the db never gets marked invalid, and so permanently and silently refuses to accept new messages. I tried to force the db invalid, but this code in nsLocalMailFolder.cpp is resetting it as valid:

      // passing true because the messages that have been successfully
      // copied have their corresponding hdrs in place. The message that has
      // failed has been truncated so the msf file and berkeley mailbox
      // are in sync.
      (void) OnCopyCompleted(mCopyState->m_srcSupport, true);
(In reply to Kent James (:rkent) from comment #28)
> My current issue involves a test case, where I copy from an IMAP folder to a
> local folder, and the key on the local folder hits the limit. The problem
> is, with the crash, the db seems to end up invalid, and when I restart after
> the crash, the db is noticed as invalid, and the local folder is parsed when
> opened, repairing the problem.
> 
> If I just detect the crash and prevent it, then the db never gets marked
> invalid, and so permanently and silently refuses to accept new messages. I
> tried to force the db invalid, but this code in nsLocalMailFolder.cpp is
> resetting it as valid:
> 
>       // passing true because the messages that have been successfully
>       // copied have their corresponding hdrs in place. The message that has
>       // failed has been truncated so the msf file and berkeley mailbox
>       // are in sync.
>       (void) OnCopyCompleted(mCopyState->m_srcSupport, true);

Thank you for the analysis. I will try to follow the patch you checked in and the
few comments. TB certainly looks very complicated. It is a wonder tht it works at all.
Attached patch WIP v1 (obsolete) — Splinter Review
This is a WIP of a fix.
I see from the try server test failures that my WIP makes an incorrect assumption. Apparently existing code handles the case of adding messages to a store with an invalid database by adding to the store without adding to the database, which will be fixed when the store is reparsed. In my WIP I prevented that, but I suppose I should change it back to keep the existing behavior.
How about adding to a store with a *missing* database? That's the root of the problem in bug 1216914. The MSF file goes missing, no idea why, and then drafts get added and superseded and when the database is finally rebuilt, they all come back. That is reproducible. Just delete the MSF of the drafts folder, save some drafts and then visit the drafts folder.
Flags: needinfo?(rkent)
My understanding of the existing design intent, after looking at test_copyToInvalidDB.js, is that the new drafts get saved, and are then restored when the folder is rebuilt. That sounds like the same result as bug 1216914, but it is expected behavior if the msf file is invalid.
Flags: needinfo?(rkent)
And missing counts as invalid?
(In reply to Jorg K (GMT+1) from comment #34)
> And missing counts as invalid?

I'm not sure what you mean. See nsMsgBrkMBoxStore::IsSummaryFileValid for the process to check validity of mbox files.
(Sorry, I'm in Spain, we don't have internet here, the neighbours have switched off their router and I'm on a volume-paid 3G plan right now.)

Why is my question so hard to understand. You delete the MSF file while TB is running. You need to find the right time so Windows hasn't locked the file. So the database is missing now. Does this run through the same code as if it were invalid, IOW: Does "missing" count/get treated as "invalid"?
Summary: startup CRASH: C-C TB ASSERTION: can't be your own parent: 'false', ... mailnews/db/msgdb/src/nsMsgHdr.cpp, then CRASH! → startup CRASH: C-C TB ASSERTION: can't be your own parent: 'false', ... mailnews/db/msgdb/src/nsMsgHdr.cpp, then CRASH in morkTable::AddRow
Assignee: nobody → rkent
Preventing the crash was not hard, what was hard was convincing the database to reindex. SetSummaryValid(false) does not work because there are too many places that set it to true, so I ended up having to create a new dbFolderInfo variable to flag the need to reindex.

I started on a test, but when you hit the key limits lots of assertions get hit, and I was uncomfortable removing them all. So this was all tested using UI with a patched version of TB that created msgHdrs with a large key value when you copied a message onto a folder.

This is blocking he next TB 45 beta so this is important. Not easy to review or test though.
Attachment #8724388 - Attachment is obsolete: true
Attachment #8724472 - Attachment is obsolete: true
Attachment #8725619 - Attachment is obsolete: true
Attachment #8729280 - Flags: review?(acelists)
Attachment #8729280 - Flags: review?(Pidgeot18)
Comment on attachment 8729280 [details] [diff] [review]
use forceReparse folderInfo variable

Review of attachment 8729280 [details] [diff] [review]:
-----------------------------------------------------------------

These are some of my comments.

Not much code left, but this represents about 3 days of work. It would be good if jcranmer reviewed this, but aceman would be a backup reviewer.

::: db/mork/src/morkRowSpace.cpp
@@ +379,5 @@
>    mRowSpace_NextTableId = id + 1;
>    return outTid;
>  }
>  
> +#define MAX_AUTO_ID (morkRow_kMinusOneRid - 2)

Previously mork had no protection from overflow of the key value, it would just roll over to 0. The values near max are used for special purposes, so I am -2 from the limit.

::: mailnews/local/src/nsMsgBrkMBoxStore.cpp
@@ +822,5 @@
>  
> +  rv = mailboxService->ParseMailbox(aMsgWindow, pathFile, parser, aListener,
> +                                    nullptr);
> +  if (NS_SUCCEEDED(rv))
> +    ResetForceReparse(aMsgDB);

This ResetForceReparse code is not really needed for this to work, but I wanted to make sure that the forceReparse got reset in case people add other paths to reindex in the future.
(In reply to Kent James (:rkent) from comment #39)
> BTW see try server jobs (relative to aurora):
> 
> https://treeherder.mozilla.org/#/jobs?repo=try-comm-
> central&revision=f98a916cf0e3

I have a bad news to tell you.
I checked the patch and produced my own copy of debug version of C-C TB locally,
and I also copy a version from your tryserver submission.

Both now hit the assertion mentioned in comment 18
 file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 1678
[31834] ###!!! ASSERTION: error finalizing headers parsing mailbox: 'NS_SUCCEEDED(rv)', file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 680
[31834] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8007000E: file /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp, line 681
Assertion failure: buf_length > 1 && (buf[buf_length - 1] == '\r' || buf[buf_length - 1] == '\n') (Header text should always end in a newline), at /NREF-COMM-CENTRAL/comm-central/mailnews/local/src/nsParseMailbox.cpp:941

I am attaching the dump from the tryserver version that led to the crash.
Attached file t-dump-tryserver.lst
This is the dump from debug version available from the tryserver run.

Again, this happened after I moved 80-100 mails from a folder in my default account (pop3) to a folder of an additional account (imap), then  moved back slghtly larger # of messages back to the original. I repeated this step a couple of times when the assertion triggered.

I am going to upload another dump from my local debug version.
It prints values of some variables and very lengthy.
This is the dump output from my local debug version leading to the same assertion.

I notice that failing to create a newhdr properly occurred fairly early.
Maybe the new patch attempted the reparsing after the early failure, but
eventually somehow, when we moved on to the fatal message something went wrong
and we hit the assertion.

I will try to put a dump statement to show if the new patch is working well.
(I.e, the setting of reparsing required, etc.)

TIA
I purposely did not remove assertions that get hit in this, so what you are showing me are expected results.
My main concern in this is whether there are conditions in which creating of the msgHdr fails (specifically creating of header row), but are somehow normal operation conditions that should not require a reparse.

I did not remove the assertions because I felt it was too risky this late in the game. Really they should not be there in the first place, since as far as I can tell the design expection is that Thunderbird continues to operate even with an unusable database, as long as it can successfully put messages in the message store.

All of this internal processing is incredibly fragile, and to a large extent we have lost any reasonable sense of design discipline. Very unpleasant to work on, and for that reason I am reluctant to do any but the smallest changes. All that I an really trying to accomplish here is to stop a top crash, and be able to continue operating after the crash detection. Without some of the changes here, I can prevent the crash, but then the folder is in a permanent state of being unable to add new messages until a reparse.
> I did not remove the assertions because I felt it was too risky this late in
> the game.

I understand. 
In this particular case, the folder content remains intact. so I can restart TB and
continue.

We can come back to this issue once 45 is released.

BTW, I could not build C-C TB on tryserver due to the recent M-C chnages, but
I stole your
--mozilla-rev=FIREFOX_AURORA_47_BASE
trick as in https://hg.mozilla.org/try-comm-central/rev/f98a916cf0e3
This trick probably should me mentioned in
https://wiki.mozilla.org/ReleaseEngineering/ThunderbirdTryServer
as a last resort in a crunch time before imminent release deadline.
Right now it says:
Tips and Tricks

    Pushing a try build with try-comm-central will always pick the latest m-c changeset.

Of course, abusing this trick will hinder the motivation to cope with the latest m-c changes head on. So it should be the LAST resort.


TIA
Whiteboard: [startupcrash] → [regression:TB45][startupcrash]
Have we actually found out why there is no new row created in the presented test case?

With old POP3/local folders keys could be exhausted when nearing 4GB in size. In current code, where keys are incremented by one, this should not happen even in that scenario. So I understand that hitting key exhaustion (due to key=offset for a ton of old messages), then reparsing the folder (rebuild?) will actually solve the problem as new keys for those same messages will now be incremented by one so they can't be exhausted anymore (assuming 2^32 keys is enough for now).

But what is the problem when copying the messages as in the presented scenario? Is the account being IMAP type the difference? In what way do you hit the key limit there without having a 4GB folder?
(In reply to :aceman from comment #46)
> Have we actually found out why there is no new row created in the presented
> test case?

I have not actually tried to reproduce the crash from the test files given by Chiaki. Looking at his output, it appeared to me that the crash might be related to large values of keys, and I was able to reproduce easily by setting the key large. I would be good to see if his case could be reproduced and tested with the fix, but I am not convinced that his test case is representative of the majority of the crashes that we have.

> 
> With old POP3/local folders keys could be exhausted when nearing 4GB in
> size. In current code, where keys are incremented by one, this should not
> happen even in that scenario.

But it could happen when somebody has a folder that previously was large and hence had big keys, but perhaps got compacted (which now preserves keys) and is incrementing keys for new key values.

But I think it is likely that there is something else that we do not understand that is resulting in large key values. I've seen one pattern that I am suspicious of, that when a new msgHdr is created but then found in the cache, the old msgHdr is discarded, but mork does not reset the nextKey value.

The proposed patch should stop the crash in those cases. What is less certain is 1) how will it behave post-crash, and 2) how frequently the checks in this code will result in forced reparse. The crash, while frequent enough to be a blocker, is still relatively rare in the absolute sense. I can show that the behavior is reasonable in the case I can easily reproduce. I think it is unlikely that the crash recovery will result in behavior that is worse than the existing crash in other cases.

It is actually fairly unusual that we can reproduce a crash like this. Most of the time we are just making guesses, and fix our guess.

> But what is the problem when copying the messages as in the presented
> scenario? Is the account being IMAP type the difference? In what way do you
> hit the key limit there without having a 4GB folder?

Don't know.
Chiaki, when you test this, can you please enable the 'Order received' column in the folders so that you can see what kind of message keys you get. We need to know if they are properly increaing by 1 or if you get some wildly high numbers or what.

(In reply to ISHIKAWA, Chiaki from comment #23)
> Is the message key unique  within the scope of 
> 1 - only each folder, 
> 2 - many folders of each account in one's profile, or
> 3 - across one's entire profile?
> 
> What is the current framework?
> 
> I am not sure how what "mork" database handles generation of unique keys.
> But usually, a database for a folder would only generate a unique key within
> the scope of the opened file.
> So I suspect case-1 is the current one. Am I correct?

Yes, case 1 is the current operation.
Flags: needinfo?(acelists)
(In reply to :aceman from comment #48)
> Chiaki, when you test this, can you please enable the 'Order received'
> column in the folders so that you can see what kind of message keys you get.
> We need to know if they are properly increaing by 1 or if you get some
> wildly high numbers or what.
> 
Will do.

> (In reply to ISHIKAWA, Chiaki from comment #23)
> > Is the message key unique  within the scope of 
> > 1 - only each folder, 
> > 2 - many folders of each account in one's profile, or
> > 3 - across one's entire profile?
> > 
> > What is the current framework?
> > 
> > I am not sure how what "mork" database handles generation of unique keys.
> > But usually, a database for a folder would only generate a unique key within
> > the scope of the opened file.
> > So I suspect case-1 is the current one. Am I correct?
> 
> Yes, case 1 is the current operation.

I see.

I have not used imap for most of my experience with TB for the last dozen years or so, but I noticed there *is* a strange interaction with imap server when I move messages.
(I *think*, when I copy/move new messages into a folder under the control of imap, 
TB does the operation locally then update the information on the server side (or does it simultaneously like copying the moved/copied message) to the remote imap server, THEN
with dovecot imap server [I am not sure if there is a behavior that is particular to an implemention of an IMAP-server], the server seems to initiate the DOWNLOAD of the messaeges to the client (essentially just overwriting or updating the information of the messages in the local folder). It is hard to explain, but looking at the count of UNREAD messages in the touched folder, I see rather strange update of the number while testing. Don't know if it is related to the bug. I presume that the msgKey value can be solely
selected by local TB alone and not dictated by the remote Imap server, but still there seems to be a redundant copying or updating of messages themselves or meta data.
I tried to reproduce the conditions of Chiaki's crash. This is associated with the the use of pseudoKey in nsImapUndoTxn, which creates a key of 0xfffffff0 in the local folder during a move from the local folder to the imap folder. We'll need to handle this case.
Comment on attachment 8729280 [details] [diff] [review]
use forceReparse folderInfo variable

I'll modify the patch to deal with Chiaki's case.
Attachment #8729280 - Flags: review?(acelists)
Attachment #8729280 - Flags: review?(Pidgeot18)
Here is a simplified STR of Chiaki's case using Thunderbird 45.0b2:

1) Create a new profile, and add an IMAP account that has at least 20 messages in the inbox using all default settings.

2) Create an empty folder in Local Folders, "testFolder".

3) Copy one message from the IMAP Inbox to testFolder.

4) Move the one message in testFolder back to the IMAP inbox.

5) Select 20 messages in the IMAP Inbox, and copy them to testFolder

=> Crash bp-9eb717b6-6419-4589-a533-0676b2160315
Analysis:

The move from testFolder to the IMAP folder creates an undo transaction in nsImapMoveCopyMsgTxn::Init, getting a key from GetNextPseudoMsgKey from the local db. This gets a key of kFirstPseudoKey which is 0xfffffff0. That sets the value of the maximum key in the local database to 0xfffffff0. Then when you try to copy 20 messages into testFolder, each message increments the next message key until you reach the limit after 16 messages, resulting in the crashes described in comments 26, 27, and 28.
The pseudo key is no longer needed. We can just use nsMsgKey_None (since the key is never added to the database). I'm less confident of the second usage, in nsIMsgOfflineImapOperation.

Try run https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=035bc21184e3
Attachment #8729280 - Attachment is obsolete: true
Hi, I tested the linux64 binary availalble at
http://archive.mozilla.org/pub/thunderbird/try-builds/kent@caspia.com-035bc21184e303df267c9052a2688af43a501376/try-comm-central-linux64-debug/
from rkent's job mentioned in comment 54, and I am happy to report I no longer see the crash which I reported any more!
(In reply to Kent James (:rkent) from comment #54)
> Created attachment 8730536 [details] [diff] [review]
...
> I'm less confident of the second usage,
> in nsIMsgOfflineImapOperation.

> 
I will see if I can go off-line with my imap setup and see if there is any adverse 
behavior during copying with your patch (the binary available from your run: I still can't build 
my local C-C TB after the configure problem introduced a few days earlier.)
(In reply to Kent James (:rkent) from comment #53)
> Analysis:
> 
> The move from testFolder to the IMAP folder creates an undo transaction in
> nsImapMoveCopyMsgTxn::Init, getting a key from GetNextPseudoMsgKey from the
> local db. This gets a key of kFirstPseudoKey which is 0xfffffff0. That sets
> the value of the maximum key in the local database to 0xfffffff0. Then when
> you try to copy 20 messages into testFolder, each message increments the
> next message key until you reach the limit after 16 messages, resulting in
> the crashes described in comments 26, 27, and 28.

Ok, so the problem was caused by bug 1183490. Before it, even if we created the pseudokey of 0xfffffff0, new real messages got real keys equalling their offset, so below 0xfffffff0.
After that bug, if 0xfffffff0 got into the database, all new keys started incrementing from it and hit 0xffffffff eventually.

If you drop pseudokeys support, we get a normal key (increment by 1) for the undo operation. If we put new real messages on top of that, they get successive keys. If that undo message is eventually removed from the db, we get a hole in the keys. But that is fine, that is the same as when we delete a normal message.

Is that right?
Status: NEW → ASSIGNED
(In reply to ISHIKAWA, Chiaki from comment #55)
> Hi, I tested the linux64 binary availalble at
> http://archive.mozilla.org/pub/thunderbird/try-builds/kent@caspia.com-
> 035bc21184e303df267c9052a2688af43a501376/try-comm-central-linux64-debug/
> from rkent's job mentioned in comment 54, and I am happy to report I no
> longer see the crash which I reported any more!

Hi, have you looked at the keys generated in the test? With this new patch you only get very low key numbers? Without the patch you get the high key numbers (above the pseudokey constant) ?

With the patch, do you get the "Header text should always end in a newline" warning output? You hit that spot in your tests previously (e.g. comment 18). Is that a separate issue or was that actually a sideeffect of the key overflow?
(In reply to :aceman from comment #58)
> (In reply to ISHIKAWA, Chiaki from comment #55)
> > Hi, I tested the linux64 binary availalble at
> > http://archive.mozilla.org/pub/thunderbird/try-builds/kent@caspia.com-
> > 035bc21184e303df267c9052a2688af43a501376/try-comm-central-linux64-debug/
> > from rkent's job mentioned in comment 54, and I am happy to report I no
> > longer see the crash which I reported any more!
> 
> Hi, have you looked at the keys generated in the test? With this new patch
> you only get very low key numbers? Without the patch you get the high key
> numbers (above the pseudokey constant) ?
> 
> With the patch, do you get the "Header text should always end in a newline"
> warning output? You hit that spot in your tests previously (e.g. comment
> 18). Is that a separate issue or was that actually a sideeffect of the key
> overflow?

Let me sort this out. Funny thing was that I thought I did not see these high numbered keys at all even when the latest patch landed. Maybe I was looking at a wrong place.
(At least Order Received showed only the few thousand range.)
But rkent was talking about the key value passed to Mork API, so I should dump that.
One thing got in the way, though.  The latest change in M-C part of the tree broke the build/configure scripts and I can no longer create a local binary. I could with a trick to use a certain revision in the tryserver build. But I am afraid I tarnished the local copy when I tinkered with the local tree to revert back the M-C portion to a previous revision :-(
So I saved my local patches and restarted from a new tree fetched from the mozilla server, but still it does not build yet.

TIA
Why can't you test again with rkent's try build and look at the keys? Or you mean the test WITHOUT the patch? We have a page somewhere that lists which m-c and c-c revisions you should revert to to get a working build.
Flags: needinfo?(vseerror)
The pseudo key headers are initially created without adding the hdr to the database, so even though the max key value is set, you would not see those big keys in the db. You have to first do the local->imap copy to set the large value, then add hdrs to the local folder to see large key values.

As for "I'm less confident of the second usage, in nsIMsgOfflineImapOperation." I now believe that I have to modify that code to behave differently in the imap and non-imap case. I hate kludges like that, but they are a consequence of the fact that we still have conflated key with UUID values in imap, so we cannot allow simple auto increment key creation in IMAP as we will overrun key values that are needed to map to UUID. But we can use the large values in IMAP because we never use auto-increment keys.
(In reply to :aceman from comment #60)
> Why can't you test again with rkent's try build and look at the keys? Or you
> mean the test WITHOUT the patch? We have a page somewhere that lists which
> m-c and c-c revisions you should revert to to get a working build.

Chiaki, the page is at
https://public.etherpad-mozilla.org/p/thunderbird-tree-status
Flags: needinfo?(vseerror)
I wanted to lower the risk of forcing a reparse for reasons that we do not understand, so I added a scan of the keys to make sure that we have a big key before we force the reparse.

I've started a try run for this: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e93385fdfe37
Attachment #8730536 - Attachment is obsolete: true
Attachment #8731055 - Flags: review?(acelists)
(In reply to :aceman from comment #62)
> (In reply to :aceman from comment #60)
> > Why can't you test again with rkent's try build and look at the keys? Or you
> > mean the test WITHOUT the patch? We have a page somewhere that lists which
> > m-c and c-c revisions you should revert to to get a working build.
> 
> Chiaki, the page is at
> https://public.etherpad-mozilla.org/p/thunderbird-tree-status

Thanks. I will also wait for rkent's new patch to build a binary for me test later.

Right now, my PC is on fire, so to speak. So it may take a day or two before I can report back.
(A disk is dying. So I had to copy disk images to a different disk and start over: I am using virtualbox to run linux guest on windows10, and recreating new virtual machine should be easy, but that erratic behavior of the dying disk happened while I was testing UPS's behavior when power is turned off [to make sure it can safely shut down the system] caused no end of confusion until I figured out that it was not the fault of UPS utility. I got a hung system due to failing disk, but I was thinking it was UPS utility under test that caused a hung system. Good that I did not tarnished the dying disk completely. I could save the virtual machine images to a new disk completely (up to then used image as of last night)... But for the last 18 hours or something I could not touch anything in the guest OS where my development takes place yet. I should be able to rebuild the new virtual machine tomorrow. PC is fun when we have plenty of time, but when time is short, I wish I had a sysadmin at home...
Comment on attachment 8731055 [details] [diff] [review]
Scan keys before setting forceReparse

Review of attachment 8731055 [details] [diff] [review]:
-----------------------------------------------------------------

OK, I think I understand what this is doing now.

So to summarize our IRC discussion, the result of comment 61 is that pseudo keys are still needed, but only in case of IMAP. They must not be used for local/POP3 and stored in the database.

::: mailnews/imap/src/nsImapUndoTxn.cpp
@@ +506,5 @@
> +      // Imap protocols have conflated key/UUID so we cannot use
> +      // auto key with them.
> +      nsCString protocolType;
> +      srcFolder->GetURI(protocolType);
> +      protocolType.SetLength(protocolType.FindChar(':')); 

Trailing space.
Attachment #8731055 - Flags: review?(acelists) → review+
Attachment #8731055 - Attachment is obsolete: true
Attachment #8731336 - Flags: review+
Comment on attachment 8731336 [details] [diff] [review]
Patch to checkin (with trailing space removed)

http://hg.mozilla.org/releases/comm-aurora/rev/a71a52dce14d
http://hg.mozilla.org/releases/comm-beta/rev/4cc13f811904
Attachment #8731336 - Flags: approval-comm-esr45?
Attachment #8731336 - Flags: approval-comm-beta+
Attachment #8731336 - Flags: approval-comm-aurora+
Whiteboard: [regression:TB45][startupcrash] → [regression:TB45][startupcrash][checkin-needed comm-central]
Comment on attachment 8731336 [details] [diff] [review]
Patch to checkin (with trailing space removed)

http://hg.mozilla.org/releases/comm-esr45/rev/8594f62a889f
Attachment #8731336 - Flags: approval-comm-esr45? → approval-comm-esr45+
Keywords: checkin-needed
Whiteboard: [regression:TB45][startupcrash][checkin-needed comm-central] → [regression:TB45][startupcrash]
HI,

I did not see the crash any more with rkent's build at
http://archive.mozilla.org/pub/thunderbird/try-builds/kent@caspia.com-e93385fdfe37c84e2f35349c979c4cb639e6209e/try-comm-central-linux64-debug/
with my random testing of copying/moving messages between folders of local and imap accounts.


I have a suspicion that the file folders which I have been testing *may* have
corrupt lines (despite my repairing the folders before each test).
I recreated the folder from scratch using valid e-mail messages and
repeated the test and I did not see the warning regarding each header line should end
with NEWLINE warning, etc. at all.

I will report if anything suspicious pop up during my tests of local patches for enabling buffering, etc.

TIA
I am puzzled a little bit.
The patch has not been pushed to C-C tree yet?
I could rebuild my virtual machine and built my local version.
But then I realize the current C-C still does not have the patch yet.
Is this intentional?
C-C is closed to to bustage. Use this query to see bugs that need landing:
https://bugzilla.mozilla.org/buglist.cgi?cmdtype=dorem&remaction=run&namedcmd=comm-central%20checkin-needed&sharer_id=227224&list_id=12916683

Some patches have already been landed on aurora, beta and esr45 branches since we urgently need them.
(In reply to Jorg K (GMT+1) from comment #71)
> C-C is closed to to bustage. Use this query to see bugs that need landing:
> https://bugzilla.mozilla.org/buglist.
> cgi?cmdtype=dorem&remaction=run&namedcmd=comm-central%20checkin-
> needed&sharer_id=227224&list_id=12916683
> 
> Some patches have already been landed on aurora, beta and esr45 branches
> since we urgently need them.

I see. Let's hope the build issues will be sorted out real soon now!
https://hg.mozilla.org/comm-central/rev/bb1fbfbdb47f
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 48.0
Folks, 

I am not sure if the latest nightly (after a few weeks)  has solved the failure in AddRow() completely.
https://bugzilla.mozilla.org/show_bug.cgi?id=1250723
I can constantly crash TB.
Crash report id:
bp-a8442442-3129-4fd9-8374-079e82160328
bp-8d860e04-b36a-41fc-97a7-4c2172160328
bp-25fc54ab-2c94-4f2e-947e-f52972160328
bp-1418f627-b641-4c7e-a1e2-4d9fb2160328
bp-63e5c71a-dfac-4203-83cf-8650c2160328
bp-f0cbb6af-f0a8-475e-a644-904902160328
bp-ffb2aef7-af5d-4bf7-ac31-a2df62160328

It is possible that handling of the thread (as in mail thread) or something 
got confused since my folder had many identical messages after a quite a few move/copy operation, but to avoid this I inserted about a hundre new e-mail messages for testing purposes.

Anyway, in the above error report I mentioned something about compaction may play a role, but I can possibly rule that out.

Simply moving a batch of messages from a folder in main account (pop3)
to a folder in a secondary imap account, and THEN
selecting a batch of messages from the BOTTOM part of the 
list of messages in the folder of the imap account and try moving them  to the original folder in the main accont (pop3) caused TB to crash.

TIA
Chiaki, the fixes only just landed in comm-central.
(In reply to Kent James (:rkent) from comment #75)
> Chiaki, the fixes only just landed in comm-central.

I should have known. I will re-check with the newer nightly, then.
Still, having a nightly is great!

TIA
You need to log in before you can comment on or make changes to this bug.