Closed Bug 1887532 Opened 2 years ago Closed 1 year ago

C-C TB crashes when an I/O error is simulated.

Categories

(MailNews Core :: Database, defect)

Thunderbird 126
x86_64
Linux
defect

Tracking

(thunderbird_esr115 wontfix)

RESOLVED FIXED
126 Branch
Tracking Status
thunderbird_esr115 --- wontfix

People

(Reporter: ishikawa, Assigned: mkmelin)

References

Details

(Keywords: crash, leave-open)

Attachments

(2 files)

I am checking the behavior of locally patched C-C TB DEBUG build under Debian GNU/Linux.
The local patch is basically for bug 1242030.
I am thinking of submitting to the fabricator the latest patch that has been merged with latest C-C
tree AFTER some extensive tests for I/O error handling (mostly write
handling).
Such tests revealed many problems in C-C TB code about 6-7 years ago
and so I had to insert many error checks in the code.
(As a matter of fact, the crude transactional handling of mail
incoropration end incorporation or abort incoropration did not work at
all until I invesitgated.)

This time around, I am simulating the I/O error by storing the mails
in a remote cifs server, and unplugging the network cable to the
server and seeing the I/O error occur and handled by C-C TB.
E.g., TB may say there was an error, etc. with a
modal error dialog.
Then I have to re-plug the network cable to continue.
[There are additional issues in that TB's writing also take place for
meta database, etc. and TB needs to handle the timeouts from these
network I/O operations, and I won't be able to progress much. So I have to
re-plug the network cable to continue.

Now that said, unbelievably this is only the second try to simulate an
error, and I get a core dump.
Seems I need to do a lot of investigation until I will feel comfotable
to upload the patch for bug 1242030.

But this time it is NOT my patches, but I think it is mork code which
does not deal with I/O error gracefully.

I know it is IMPOSSIBLE to do something perfect in an I/O error
situation, but I would like TB to

    • cope with the error gracefully, and
    • show a proper error message, AND
    • NOT crash.

This one fails the "Do not crash" criteria above.

crash stack:

...
[Parent 320173, Main Thread] WARNING: (debug) We are enabling buffering for m_outFileStream in nsPop3Sink::IncorporateBegin in nsPop3Sink.cpp.: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:481
(debug) 494: m_outFileStream->Flush() returned 0x0
{debug} creating seekable stream failed: /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:511, rv = 80004002
GetDiskSpaceAvailable returned: 471678976 bytes
(debug) nsMsgBrkMBoxStore::GetMsgInputStream: NS_NewLocalFileInputStream() returned mboxFile.get()=mailbox://ishikawa@127.0.0.1/Inbox
{deubg} LINE=1042, seekable->Seek(PR_SEEK_SET, offset) =  0x0
NS_NewBufferedOutputStream: outputStream (= std::move(aOutputputStream)) =0x563eea842448
(debug) Creating buffered output stream to mboxFile=<<mailbox://ishikawa@127.0.0.1/folder-E>> in nsMsgBrkMBoxStore::GetNewMsgOutputStream in nsMsgBrkMBoxStore.cpp;
Fatal error: glibc detected an invalid stdio handle

Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 320173) received signal 6.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa0be7da]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3c510]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x8a16c]
#04: gsignal[/lib/x86_64-linux-gnu/libc.so.6 +0x3c472]
#05: abort[/lib/x86_64-linux-gnu/libc.so.6 +0x264b2]
#06: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x271ed]
#07: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x7e055]
#08: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x7e885]
#09: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x75e66]
#10: ftell[/lib/x86_64-linux-gnu/libc.so.6 +0x747fd]
#11: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2d581f]
#12: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2d5362]
#13: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2e3583]
#14: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2e3625]
#15: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa3009cc]
#16: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa484d9d]
#17: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2f6756]
#18: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4aa24b]
#19: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4acf83]
#20: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa51f1bc]
#21: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4a38e0]
#22: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4a967f]
#23: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4b1df7]
#24: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x45846f2]
#25: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x504eb20]
#26: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x504ee52]
#27: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x50548cd]
#28: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 320173' to attach your debugger to this thread.


gdb dump
---


[New LWP 323941]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f6e6a0bfaa5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
Redefine command "prun"? (y or n) [answered Y; input not from terminal]
Redefine command "pu"? (y or n) [answered Y; input not from terminal]
Redefine command "ps"? (y or n) [answered Y; input not from terminal]
Loading JavaScript value pretty-printers; see js/src/gdb/README.
If they cause trouble, type: disable pretty-printer .* SpiderMonkey
SpiderMonkey unwinder is disabled by default, to enable it type:
	enable unwinder .* SpiderMonkey
(gdb) where
#0  0x00007f6e6a0bfaa5 in __GI___clock_nanosleep
    (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007f6e6a0c4473 in __GI___nanosleep
    (req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007f6e6a0c43aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f6e644e82e7 in common_crap_handler(int, void const*)
    (signum=6, aFirstFramePC=0x7f6e644be7da <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+186>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:100
#4  0x00007f6e644e830d in ah_crap_handler(int) (signum=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:108
#5  0x00007f6e644be7da in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
    (signo=6, info=0x7ffe47c9cdf0, context=0x7ffe47c9ccc0)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6  0x00007f6e6a02d510 in <signal handler called> () at /lib/x86_64-linux-gnu/libc.so.6
#7  __pthread_kill_implementation
    (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#8  0x00007f6e6a07b1cf in __pthread_kill_internal (signo=6, threadid=<optimized out>)
    at ./nptl/pthread_kill.c:78
#9  0x00007f6e6a02d472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007f6e6a0174b2 in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007f6e6a0181ed in __libc_message (fmt=fmt@entry=0x7f6e6a18a3b6 "%s")
    at ../sysdeps/posix/libc_fatal.c:150
#12 0x00007f6e6a06f055 in __GI___libc_fatal (message=<optimized out>)
    at ../sysdeps/posix/libc_fatal.c:159
#13 0x00007f6e6a06f885 in _IO_vtable_check () at ./libio/vtables.c:67
#14 0x00007f6e6a066e66 in IO_validate_vtable (vtable=0xaa000005ce) at ../libio/libioP.h:945
#15 _IO_seekoff_unlocked (fp=<optimized out>, offset=0, dir=1, mode=0) at ./libio/ioseekoff.c:59
#16 0x00007f6e6a0657fd in __GI__IO_ftell (fp=0x563ee9c27b30) at ./libio/ioftell.c:38
#17 0x00007f6e646d581f in morkStdioFile::Length(morkEnv*) const
    (this=0x563ee7e31110, ev=0x563ee6046d70)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
#18 0x00007f6e646d5362 in non-virtual thunk to morkFile::Eof(nsIMdbEnv*, int*) ()
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.h:197
#19 0x00007f6e646e3583 in morkStore::DoPreferLargeOverCompressCommit(morkEnv*)
    (this=this@entry=0x563ee6d64490, ev=ev@entry=0x563ee6046d70)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:241
#20 0x00007f6e646e3625 in morkStore::LargeCommit(nsIMdbEnv*, nsIMdbThumb**)
    (this=0x563ee6d64490, mev=<optimized out>, acqThumb=0x7ffe47c9d658)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:1900
#21 0x00007f6e647009cc in nsMsgDatabase::Commit(int) (this=0x563eeb8013e0, commitType=<optimized out>)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/nsMsgDatabase.h:127
#22 0x00007f6e64884d9d in nsMsgBrkMBoxStore::SetSummaryFileValid(nsIMsgFolder*, nsIMsgDatabase*, bool)
    (this=0x563ee7077ca0, aFolder=0x563ee7c328f0, aDB=0x563eeb8013e0, aValid=true)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:462
#23 0x00007f6e646f6756 in nsMailDatabase::SetSummaryValid(bool) (this=0x563eeb8013e0, aValid=true)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMailDatabase.cpp:131
#24 0x00007f6e648aa24b in nsMsgMailboxParser::UpdateDBFolderInfo(nsIMsgDatabase*)
    (mailDB=<optimized out>, this=0x563eeaa29fc0)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:330
#25 nsParseNewMailState::MoveIncorporatedMessage(nsIMsgDBHdr*, nsIMsgDatabase*, nsIMsgFolder*, nsIMsgFilter*, nsIMsgWindow*)
    (this=0x563eeaa29fc0, mailHdr=0x563eeb17a3a0, sourceDB=0x563ee9c35330, destIFolder=0x563ee7c328f0, filter=<optimized out>, msgWindow=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:2410
#26 0x00007f6e648acf83 in nsParseNewMailState::ApplyFilterHit(nsIMsgFilter*, nsIMsgWindow*, bool*)
    (this=0x563eeaa29fc0, filter=0x563ee9105de0, msgWindow=0x563ee844c7a0, applyMore=0x7ffe47c9dc1f)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1868
#27 0x00007f6e6491f1bc in nsMsgFilterList::ApplyFiltersToHdr(int, nsIMsgDBHdr*, nsIMsgFolder*, nsIMsgDatabase*, nsTSubstring<char> const&, nsIMsgFilterHitNotify*, nsIMsgWindow*)
    (this=this@entry=0x563ee9d80630, filterType=filterType@entry=1, msgHdr=0x563eeb17a3a0, folder=0x563ee859db50, db=0x563ee9c35330, headers="X-Account-Key: account1", listener=0x563eeaa2a278, msgWindow=0x563ee844c7a0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/search/src/nsMsgFilterList.cpp:349
#28 0x00007f6e648a38e0 in nsParseNewMailState::ApplyFilters(bool*, nsIMsgWindow*)
    (this=this@entry=0x563eeaa29fc0, pMoved=pMoved@entry=0x7ffe47c9de32, msgWindow=msgWindow@entry=0x563ee844c7a0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1730
#29 0x00007f6e648a967f in nsParseNewMailState::PublishMsgHeader(nsIMsgWindow*)
    (this=0x563eeaa29fc0, msgWindow=0x563ee844c7a0)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1672
#30 0x00007f6e648b1df7 in nsPop3Sink::IncorporateComplete(nsIMsgWindow*, int)
    (this=0x563eea9cc620, aMsgWindow=0x563ee844c7a0, aSize=0)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/RefPtr.h:341
#31 0x00007f6e5e9846f2 in NS_InvokeByIndex ()
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/reflect/xptcall/md/unix/xptcinvoke_asm_x86_64_unix.S:101
#32 0x00007f6e5f44eb20 in CallMethodHelper::Invoke() (this=0x7ffe47c9e1d8)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1621
#33 CallMethodHelper::Call() (this=this@entry=0x7ffe47c9e1d8)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1174
#34 0x00007f6e5f44ee52 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
    (ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1120
#35 0x00007f6e5f4548cd in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)
    (cx=<optimized out>, argc=<optimized out>, vp=0x7ffe47c9e4e0)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:966
#36 0x0000161a3a8754a6 in  ()
#37 0x00007ffe47c9e538 in  ()
#38 0x00007ffe47c9e4b8 in  ()
#39 0x0000000000000000 in  ()
(gdb) 

/NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
is

mork_pos morkStdioFile::Length(morkEnv* ev) const {
  mork_pos outPos = 0;

  if (this->IsOpenAndActiveFile()) {
    FILE* file = (FILE*)mStdioFile_File;
    if (file) {
      long start = MORK_FILETELL(file); <--- line 454
      if (start >= 0) {
        long fore = MORK_FILESEEK(file, 0, SEEK_END);
        if (fore >= 0) {
          long eof = MORK_FILETELL(file);
          if (eof >= 0) {
            long back = MORK_FILESEEK(file, start, SEEK_SET);
            if (back >= 0)
              outPos = eof;
            else
              this->new_stdio_file_fault(ev);
          } else
            this->new_stdio_file_fault(ev);
        } else
          this->new_stdio_file_fault(ev);
      } else
        this->new_stdio_file_fault(ev);
    } else if (mFile_Thief)
      mFile_Thief->Eof(ev->AsMdbEnv(), &outPos);
    else
      this->NewMissingIoError(ev);
  } else
    this->NewFileDownError(ev);

  return outPos;
}

So *file contains invalid fileno?

(gdb) up
#1  0x00007f6e6a0c4473 in __GI___nanosleep (req=req@entry=0x7ffe47c9cb90, 
    rem=rem@entry=0x7ffe47c9cb90) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
25	../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) up
#2  0x00007f6e6a0c43aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
55	../sysdeps/posix/sleep.c: No such file or directory.
(gdb) up
#3  0x00007f6e644e82e7 in common_crap_handler (signum=6, 
    aFirstFramePC=0x7f6e644be7da <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+186>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:100
100	  sleep(_gdb_sleep_duration);
(gdb) up
#4  0x00007f6e644e830d in ah_crap_handler (signum=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:108
108	  common_crap_handler(signum, CallerPC());
(gdb) up
#5  0x00007f6e644be7da in nsProfileLock::FatalSignalHandler (signo=6, info=0x7ffe47c9cdf0, 
    context=0x7ffe47c9ccc0)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
183	      oldact->sa_handler(signo);
(gdb) up
#6  <signal handler called>
(gdb) up
#7  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, 
    no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) up
#8  0x00007f6e6a07b1cf in __pthread_kill_internal (signo=6, threadid=<optimized out>)
    at ./nptl/pthread_kill.c:78
78	in ./nptl/pthread_kill.c
(gdb) up
#9  0x00007f6e6a02d472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26	../sysdeps/posix/raise.c: No such file or directory.
(gdb) up
#10 0x00007f6e6a0174b2 in __GI_abort () at ./stdlib/abort.c:79
79	./stdlib/abort.c: No such file or directory.
(gdb) up
#11 0x00007f6e6a0181ed in __libc_message (fmt=fmt@entry=0x7f6e6a18a3b6 "%s")
    at ../sysdeps/posix/libc_fatal.c:150
150	../sysdeps/posix/libc_fatal.c: No such file or directory.
(gdb) up
#12 0x00007f6e6a06f055 in __GI___libc_fatal (message=<optimized out>)
    at ../sysdeps/posix/libc_fatal.c:159
159	in ../sysdeps/posix/libc_fatal.c
(gdb) up
#13 0x00007f6e6a06f885 in _IO_vtable_check () at ./libio/vtables.c:67
67	./libio/vtables.c: No such file or directory.
(gdb) up
#14 0x00007f6e6a066e66 in IO_validate_vtable (vtable=0xaa000005ce) at ../libio/libioP.h:945
945	../libio/libioP.h: No such file or directory.
(gdb) up
#15 _IO_seekoff_unlocked (fp=<optimized out>, offset=0, dir=1, mode=0) at ./libio/ioseekoff.c:59
59	./libio/ioseekoff.c: No such file or directory.
(gdb) up
#16 0x00007f6e6a0657fd in __GI__IO_ftell (fp=0x563ee9c27b30) at ./libio/ioftell.c:38
38	./libio/ioftell.c: No such file or directory.
(gdb) up
#17 0x00007f6e646d581f in morkStdioFile::Length (this=0x563ee7e31110, ev=0x563ee6046d70)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
454	      long start = MORK_FILETELL(file);
(gdb) print file
$1 = (FILE *) 0x563ee9c27b30
(gdb) print *file
$2 = {_flags = 878570760, _IO_read_ptr = 0x0, _IO_read_end = 0x0, 
  _IO_read_base = 0x4000003d9 <error: Cannot access memory at address 0x4000003d9>, 
  _IO_write_base = 0x4000000971 <error: Cannot access memory at address 0x4000000971>, 
  _IO_write_ptr = 0x1d800000040 <error: Cannot access memory at address 0x1d800000040>, 
  _IO_write_end = 0x1d8000001d8 <error: Cannot access memory at address 0x1d8000001d8>, 
  _IO_buf_base = 0x0, 
  _IO_buf_end = 0x30000000000001fc <error: Cannot access memory at address 0x30000000000001fc>, 
  _IO_save_base = 0x2000000000000388 <error: Cannot access memory at address 0x2000000000000388>, 
  _IO_backup_base = 0x20000000000003d9 <error: Cannot access memory at address 0x20000000000003d9>, 
  _IO_save_end = 0x6000003f2 <error: Cannot access memory at address 0x6000003f2>, 
  _markers = 0x100000040a, _chain = 0x150000041b, _fileno = 1069, _flags2 = 23, 
  _old_offset = 150323856454, _cur_column = 1111, _vtable_offset = 0 '\000', _shortbuf = "", 
  _lock = 0x3400000472, _offset = 249108104337, _codecvt = 0x1000004a000004ba, 
  _wide_data = 0x5c000004ee, _freeres_list = 0x6700000512, _freeres_buf = 0x7a0000055d, 
  __pad5 = 545460847978, _mode = 1399, 
  _unused2 = "\204\000\000\000\242\005\000\000\231\000\000\000\263\005\000\000\237\000\000"}
(gdb) 

fileno is 1069, and
some memory buffer areas are invalid addresses.

The maximum file descriptors open in one process under my linux is:
$ ulimit -n
1024
$ ulimit -Sn
1024

It seems fileno, 1069, is a bit too large IMHO.

I am afraid need to insert more printf dumps, etc.
to figure out what is going on.

(gdb) list
449	  mork_pos outPos = 0;
450	
451	  if (this->IsOpenAndActiveFile()) {
452	    FILE* file = (FILE*)mStdioFile_File;
453	    if (file) {
454	      long start = MORK_FILETELL(file);
455	      if (start >= 0) {
456	        long fore = MORK_FILESEEK(file, 0, SEEK_END);
457	        if (fore >= 0) {
458	          long eof = MORK_FILETELL(file);
(gdb) up
#18 0x00007f6e646d5362 in non-virtual thunk to morkFile::Eof(nsIMdbEnv*, int*) ()
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.h:197
197	  NS_IMETHOD Eof(nsIMdbEnv* ev, mdb_pos* outPos) override;
(gdb) list
192	 public:
193	  virtual mork_pos Length(morkEnv* ev) const = 0;  // eof
194	  // nsIMdbFile methods
195	  NS_IMETHOD Tell(nsIMdbEnv* ev, mdb_pos* outPos) const override = 0;
196	  NS_IMETHOD Seek(nsIMdbEnv* ev, mdb_pos inPos, mdb_pos* outPos) override = 0;
197	  NS_IMETHOD Eof(nsIMdbEnv* ev, mdb_pos* outPos) override;
198	  // } ----- end pos methods -----
199	
200	  // { ----- begin read methods -----
201	  NS_IMETHOD Read(nsIMdbEnv* ev, void* outBuf, mdb_size inSize,
(gdb) print ev
No symbol "ev" in current context.
(gdb) print *outPos
No symbol "outPos" in current context.
(gdb) print ev
No symbol "ev" in current context.
(gdb) up
#19 0x00007f6e646e3583 in morkStore::DoPreferLargeOverCompressCommit (this=this@entry=0x563ee6d64490, 
    ev=ev@entry=0x563ee6046d70)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:241
241	    file->Eof(ev->AsMdbEnv(), &fileEof);
(gdb) list
236	// true when mStore_CanWriteIncremental && store has file large enough
237	{
238	  nsIMdbFile* file = mStore_File;
239	  if (file && mStore_CanWriteIncremental) {
240	    mdb_pos fileEof = 0;
241	    file->Eof(ev->AsMdbEnv(), &fileEof);
242	    if (ev->Good() && fileEof > 128) return morkBool_kTrue;
243	  }
244	  return morkBool_kFalse;
245	}
(gdb) print file
$3 = <optimized out>
(gdb) print *file
value has been optimized out
(gdb) print fileEof
$4 = 0
(gdb) print ev
$5 = (morkEnv *) 0x563ee6046d70
(gdb) print *ev
$6 = (morkEnv) {<morkObject> = {<morkBead> = {<morkNode> = {
        _vptr.morkNode = 0x7f6e67142cb0 <vtable for morkEnv+16>, mNode_Heap = 0x563eea8dd3e0, 
        mNode_Base = 20068, mNode_Derived = 17782, mNode_Access = 111 'o', mNode_Usage = 104 'h', 
        mNode_Mutable = 85 'U', mNode_Load = 34 '"', mNode_Uses = 1, mNode_Refs = 1}, 
      mBead_Color = 0}, <nsIMdbObject> = {<nsISupports> = {
        _vptr.nsISupports = 0x7f6e67142dc8 <vtable for morkEnv+296>}, <No data fields>}, 
    mObject_Handle = 0x0, mMorkEnv = 0x0, mRefCnt = {static isThreadSafe = false, mValue = 1}, 
    _mOwningThread = 320173}, <nsIMdbEnv> = {<nsISupports> = {
      _vptr.nsISupports = 0x7f6e67142e40 <vtable for morkEnv+416>}, <No data fields>}, 
  mEnv_Factory = 0x563ee8596ab0, mEnv_Heap = 0x563eea8dd3e0, mEnv_SelfAsMdbEnv = 0x563ee6046db8, 
  mEnv_ErrorHook = 0x0, mEnv_HandlePool = 0x563ee969d0b0, mEnv_ErrorCount = 0, mEnv_WarningCount = 0, 
  mEnv_ErrorCode = nsresult::NS_OK, mEnv_DoTrace = 0 '\000', mEnv_AutoClear = 85 'U', 
  mEnv_ShouldAbort = 0 '\000', mEnv_BeVerbose = 0 '\000', mEnv_OwnsHeap = 1 '\001'}
(gdb) call ev->AsbMdbEnv()
Couldn't find method morkEnv::AsbMdbEnv
(gdb) call ev->AsMdbEnv()
Cannot evaluate function -- may be inlined
(gdb) print *file
value has been optimized out
(gdb) 

I think I need to modify the GCC option so that variables would remain
in the code. Also I need to insert more printf statement to trace the
entering of functions, etc.

It seems that the crash left some files in a locked state and the next startup of thunderbird is not proceeding...
I think I need to clean up the storage of mails and start from scratch...

Surprise, I even restarted X window and got the following dialog. I obviously could not kill thunderbird cleanly...

Thunderbird is already running, but is not responding. To use Thunderbird, you must first close the existing Thunderbird process, restart your device, or use a different profile.

It has been quite a while since I last met a zombie process that could not be culled by kill command.
I am restarting linux itself.
I will post updates on this and other mishaps of I/O error handling by C-C TB.

I think TB got stuck in CIFS subsystem. Hmm... Uncovering an issue of CIFS client code.
systemd-shutdown tries killing the hung process and I saw the current stack dump, but it is still waiting for the process to disappear.
Finally reboot...
This is not going to be a smooth sailing unlike several years ago.

Keywords: crash
Version: unspecified → Thunderbird 126

I filed a bug in samba bugzilla, too.

https://bugzilla.samba.org/show_bug.cgi?id=15614

I hope this usage scenario, storing mail messages on remote CIFS server and experiencing network outage temporarily won't crash thunderbird easily after a suitable fix is discovered.

I find that thunderbird crashes often during my crude attempt to simulate I/O error by unplugging/plugging network cable to the remote CIFS server, or shutting down the network interface to the server.
And the crash position is the ftell mentioned in comment 1.

      long start = MORK_FILETELL(file); <--- line 454

So now my take on the situation is that the raw message buffered write handles the I/O error more or less. The C-C TB code have absorbed some portion of my patches over the years, and many other improvements have been made.
But the mork code that handles meta-data for messages does not seem to cope with I/O error very well.
Tough.

I don't think I saw so many crashes about several years ago when I tested the C-C TB code for I/O error handling.
But it could be that I was more concerned with message data handling, and may not have been able to proceed to the point that meta data handling became an issue.
But I do remember that I tested the mail delivery abort operation and TB continued to work.
So something must have changed to make the above MORK_FILETELL(file) more vulnerable.

And one more data point.
During the heavy testing cycle of several years ago, I experienced a very strange ftell crash.
The low-level routine such as fseek, fell, fwrite, etc. are wrapped and
error conditions are propagated through the wrappers.
But there was single failure of ftell which caused thunderbird crash.
I was very surprised at the crash because it SHOULD NOT HAPPEN.
That is somehow the wrapper failed to contain the error condition (error throw by low-level code)
and that signal caused C-C TB to crash.
I am afraid that there may be a deficiency of the wrapper used by C-C TB for ftell.

Wait, mork uses ftell without a wrapper.

https://searchfox.org/comm-central/source/mailnews/db/mork/morkConfig.h#51

#   define MORK_FILETELL(file) ftell(file)

C-C TB's messaage code uses |stream->Tell()|, and
I recall the following Tell() caused a strange segfault back when I was testing the I/O error handling extensively.
Problem was that the Tell() wrapper ought to return an error code when the error happened, but
somehow it passed signal up the call chain and TB crashed.
https://searchfox.org/comm-central/source/mailnews/local/src/nsMsgBrkMBoxStore.cpp#838

So I looked at mork code re Write() and am quite confused.
Say, https://searchfox.org/comm-central/source/mailnews/db/mork/morkFile.cpp#217

mork_size morkFile::WriteNewlines(morkEnv* ev, mork_count inNewlines)
// WriteNewlines() returns the number of bytes written.
{
  mork_size outSize = 0;
  while (inNewlines && ev->Good())  // more newlines to write?
  {
    mork_u4 quantum = inNewlines;
    if (quantum > morkFile_kNewlinesCount) quantum = morkFile_kNewlinesCount;

    mork_size quantumSize = quantum * mork_kNewlineSize;
    mdb_size bytesWritten;
    this->Write(ev->AsMdbEnv(), morkFile_kNewlines, quantumSize, &bytesWritten);  <--- 214. No error check?
    outSize += quantumSize;
    inNewlines -= quantum;
  }
  return outSize;
}

You can see that there is no check for the error return from Write().
This can happen if an underlying file system is broken, file system fills up, network experiences a transient error and the data is written to a remote file server.

Does this line of mork intentionally ignore the error from |Write()|?
I am afraid that mork does not handle I/O errors gracefully...

Wayne, who is the mork expert to ask?

Flags: needinfo?(vseerror)
Flags: needinfo?(vseerror)

(In reply to ISHIKAWA, Chiaki from comment #7)

Wayne, who is the mork expert to ask?

I'm afraid there isn't really anyone who properly understands mork at the moment...
In the past few years, there's always been the plan to move to another database (Bug 453975), so nobody has really invested the time to get to know the mork code well.
On the plus side, the database migration work is getting started now, so it looks like it will finally happen! Work on Bug 1572000 is getting underway.

(In reply to Ben Campbell from comment #8)

(In reply to ISHIKAWA, Chiaki from comment #7)

Wayne, who is the mork expert to ask?

I'm afraid there isn't really anyone who properly understands mork at the moment...
In the past few years, there's always been the plan to move to another database (Bug 453975), so nobody has really invested the time to get to know the mork code well.
On the plus side, the database migration work is getting started now, so it looks like it will finally happen! Work on Bug 1572000 is getting underway.

I am tied with paper work at this moment for the next week or so.
But based on my observation, the crash at an I/O error is for real. (Strange, C-C TB did not crash like this in my test cycle a few years ago.)
Obviously, mork code faces a write error and it did not recover well. (I have seen C-C TB continues downloading messages after the unplugged network cable is plugged again, but after half a dozen or so messages, it suddenly crashes at the same ftell code line. This is not what I remember.)
Something has changed. Most likely the call patterns.

Anyway, in this state of things, my I/O error simulation won't cover C-C TB's resilience much.
It simply crashes now unlike several years ago when C-C TB would work after the cable was plugged again.

Do you think I should investigate the crash issue and that effor would not be in vain?
I mean the transition to other DB Bug 1572000 is finished within this year ?

I have a hunch that the low-level I/O error code is not propagated within mork very well and
whatever to make this-isOpenAndActiveFile() false is not properly set after an I/O error.

450	
451	  if (this->IsOpenAndActiveFile()) {  <-- if this is  false the problematic ftell is not called.
452	    FILE* file = (FILE*)mStdioFile_File;  <--- Or this mStdioFile_File was not set to nullptr as it should have
453	    if (file) {                                                    when an I/O error occurred and mork gave up.
454	      long start = MORK_FILETELL(file);
455	      if (start >= 0) {

But leading to the root cause of the behavior alone might take a week of time or so...

FYI we've pretty much exhausted the developers who might be able to comment on mork.

(In reply to Wayne Mery (:wsmwk) from comment #10)

FYI we've pretty much exhausted the developers who might be able to comment on mork.

Yeah, I have a feeling that this is a case.
I will place a few debug print statement to see if I can find the easy fix for the crash problem.
But otherwise, C-C TB is in a poorer shape than it was several years ago as far as the robustness in the face of I/O errors.
Sad, but true.

Cool.
I think I have found a reference after free issue.

Before that I found a mundane fix for a irritating problem.
I found a fix for C-C TB's reporting unexpected character NUL '\0' character inside mork (prresumably left when the I/O error simulated by network error to CIFS sever left a file with a block of '\0' in it and mork) reader/parser did not like it.
C-C TB reported it until it reads all the NUL charcters which could mean a long series sometimes. (maybe 512 times?)

With that patched, I could test a few more test runs,
1 download a series of e-mal messages.
2 unplug network cable.
3 See C-C TB shows an error dialog.
4 plug network cable.
Repeat this process.
After the initial error, C-C TB used to stumble on this long series of '\0' of unexpected char warnings, but
now it has been removed.
I could repeat the above process a few times, when I saw the crash and the following dump from gdb.

 up 17
#17 0x00007fd357500def in morkStdioFile::Length (this=0x559500d76960, ev=0x559500d74920)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
454	      long start = MORK_FILETELL(file);
(gdb) print *file
$1 = {_flags = 25028816, _IO_read_ptr = 0x7fd35d1f2380 <main_arena+1792> "p#\037]\323\177", 
  _IO_read_end = 0x5595017de8d0 "ZZZZZZZZ\361 ", _IO_read_base = 0x5595017de8d0 "ZZZZZZZZ\361 ", 
  _IO_write_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_write_ptr = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_write_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_buf_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_buf_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_save_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_backup_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _IO_save_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, 
  _markers = 0x5a5a5a5a5a5a5a5a, _chain = 0x5a5a5a5a5a5a5a5a, _fileno = 1515870810, _flags2 = 1515870810, 
  _old_offset = 6510615555426900570, _cur_column = 23130, _vtable_offset = 90 'Z', _shortbuf = "Z", 
  _lock = 0x5a5a5a5a5a5a5a5a, _offset = 6510615555426900570, _codecvt = 0x5a5a5a5a5a5a5a5a, 
  _wide_data = 0x5a5a5a5a5a5a5a5a, _freeres_list = 0x5a5a5a5a5a5a5a5a, _freeres_buf = 0x5a5a5a5a5a5a5a5a, 
  __pad5 = 6510615555426900570, _mode = 1515870810, _unused2 = 'Z' <repeats 20 times>}
(gdb) 

The 0x5a patterns you see above is that it is uninitialized (either freed or malloced but not initialized yet).
ASCII 'Z' is 0x5a, too.
Everything in the file structure is filled with 0x5a.
For example, _fileno 1515870810 and other values are concatenation of 0x5a octet.

ishikawa@ip030:/NREF-COMM-CENTRAL/mozilla$ bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=16
1515870810
5A5A5A5A
6510615555426900570
5A5A5A5A5A5A5A5A
23130
5A5A

I am using gnu malloc iwth debug feature.
I call mallopt() with 0x5a value in my local source.
See below.

I set the environment variable thusly:

MOZILLA_MALLOPT=0X5A
MALLOC_PERTURB_=0x5A

So C-C TB printed out the following.

Calling mallopt for malloc/free debug. with 0x5a value.

My local change.

# HG changeset patch
# User ISHIKAWA, Chiaki <ishikawa@yk.rim.or.jp>
# Parent  15b2e82875d420d82ed576137b91c9915fb6594f
call malloc_opt()

diff --git a/mail/app/nsMailApp.cpp b/mail/app/nsMailApp.cpp
--- a/mail/app/nsMailApp.cpp
+++ b/mail/app/nsMailApp.cpp
@@ -27,16 +27,21 @@
 #elif defined(XP_UNIX)
 #  include <sys/resource.h>
 #  include <unistd.h>
 #endif
 
 #include <stdio.h>
 #include <stdarg.h>
 #include <time.h>
+#if __linux__
+// for mallopt under linux if possible
+#include <malloc.h>
+#endif
+
 
 #include "nsCOMPtr.h"
 #include "nsIFile.h"
 
 #ifdef XP_WIN
 #  include "mozilla/mscom/ProcessRuntime.h"
 #  include "mozilla/WindowsDllBlocklist.h"
 #  include "mozilla/WindowsDpiInitialization.h"
@@ -157,16 +162,38 @@ static bool IsArg(const char* arg, const
 #endif
 
   return false;
 }
 
 Bootstrap::UniquePtr gBootstrap;
 
 static int do_main(int argc, char* argv[], char* envp[]) {
+
+  // I think the following needs to go into |XRE_main()|, too,
+  // if that spawns another proces.
+
+#if defined(XP_UNIX)
+#if __linux__
+  /* actually depends on gnu libc version. But linux most likely uses
+     a version of GNU libc, and later versions supports mallopt() */
+  {
+    char *cp = getenv("MOZILLA_MALLOPT");
+    if (cp) {
+      int val = 0xC3;
+      if(*cp)
+        sscanf(cp, "%x", &val);
+      Output("Calling mallopt for malloc/free debug. with 0x%02x value.\n", val & 0xFF);
+      mallopt(M_PERTURB, val & 0xFF);
+    }
+  }
+#endif
+#endif
+
+
   // Allow thunderbird.exe to launch XULRunner apps via -app <application.ini>
   // Note that -app must be the *first* argument.
   const char* appDataFile = getenv("XUL_APP_FILE");
   if ((!appDataFile || !*appDataFile) && (argc > 1 && IsArg(argv[1], "app"))) {
     if (argc == 2) {
       Output("Incorrect number of arguments passed to -app");
       return 255;
     }

Given that *file would be initialized to a sane value on allocation, I think there is
a use after free problem in the code.

Anyone has an idea on this?

One problem I have now is that I thought ASAN build would be handy. (valgrind is too slow often if I can run it.)
But I could not build ASAN version of C-C TB. I clearly recall I could build it a couple of years ago, something has changed in the build configuration that makes it hard to build an ASAN version locally. (I used clang back then, but even clang version could not be built due to strange library mismatch or whatever. So I tried GCC after careful anaysis of my local bash script and mozconfig for ASAN, but to no avail too.)

Unfortunately even in the days of Mozilla Messaging, I'm not sure we really had anyone who understood Mork deeply. I would suspect things like this have just been lying in the code for ages - checking the change logs should help there.

Unless there's active crash/bug reports of users hitting these, it is possibly not worth fixing given the current work to move away from Mork.

(In reply to Mark Banner (:standard8) from comment #13)

Unfortunately even in the days of Mozilla Messaging, I'm not sure we really had anyone who understood Mork deeply. I would suspect things like this have just been lying in the code for ages - checking the change logs should help there.

Unless there's active crash/bug reports of users hitting these, it is possibly not worth fixing given the current work to move away from Mork.

Thank you for the overview of the state of the mork code.

I will spend a few more days to test the behavior of C-C TB code (hoefully if I can make ASAN build locally with my patch, that would be great) in the face of I/O errors.

Problem is that C-C TB segfaults when I simulate I/O error. So this is a regression of a sort from what I observed during similar tests several years ago. C-C TB would continue to work once the I/O error introduced by disconnecting network cable was removed (by plugging the cable again).

Anyway, I will report what I find in the next couple of days.

I found double free caused by calling an fclose to already closed stream using my locally built ASAN C-C TB with my patch.

This is inside MORK code.
I think I found a cause of this bug and testing the patch locally.
Obviously, mork code was not tested extensively against I/O error.

The local patch looks to be working in the sense that C-C TB does not crash anymore after a simulated I/O error by disabling the network interface to a remote NFS server on which the downloaded messages are written.
I gave up on CIFS for the testing purposes because C-C TB often gets stuck in the kernel cifs module.
(But before the local patch, C-C TB was calling the local I/O stdio code with bogus fileno such as -1 or for a file number which had already been closed previously.)

Whether we are handling messages during the simulated I/O error, i.e., not losing a message or two is another matter.
It will be a couple of days before I can say for sure.
Before the crash patch was applied, I saw many messages lost (probably due to them marked as deleted by a bug in mork code bug or higherlevel pop3client JS code. I will investigate this now that C-C TB does not crash after I/O error simulation.)

I think I now spot a serious error in transactional processing of POP3 message download.

Thanks to my patch to mork, C-C TB no longer crashes in the face of an I/O error, but
I notice a place we need to improve. I wonder how the old C++/C pop3client handled this.
This is because I think C-C TB behaved sanely after simulated errors several years ago.

My observation is based on error handling sequence.
At the end, I show the debug dump.
When a message download begins (= Incorporate message begin),
my simulated I/O error happened.
Eventually, the error was noticed after the network timeout,
mail message delivery aborted. (= Abort mail message delivery.)
So this last message was not incorporated into the mail folder.
So far so good.

HOWEVER, note the lines marked with "*" in the debug dump.

 * CloseStdio:file FILECLOSE failed
 * morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652933e97b0, ev=0x56528ff717a0,
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf

Mork failed to write the .msf file for a folder (folder-E) due to an I/O error.

BUT this mork error HAPPENED AFTER the "Abort mail message delivery line"
AND after the JavaScript pop3client printed the following error.

 * JavaScript error: , line 0: UnknownError: Could not write to `/home/ishikawa/REMOTE-NFS/Maildir/popstate.dat': failed to open file for writing (NS_ERROR_FAILURE)

BAD!

I think toplevel pop3 client loop should do
something like the following.

Start a sequence of mail delivery

  if no more message then goto success

  before starting the next message download
  save mork file (.msf) if it is not done by mork itself (not sure if it is done by mork automagically?)

!  setup so that mork UPDATES A COPY the mork file (.msf) during download instead of the real .msf file
! that should persist before and after the download.  This may fail.  If so,  goto abort.
  start download
    if failure of download or filtering, etc. goto  abort.
!  if download and filtering, etc. succeeded,
!  then replace old .msf file with the updated mork file.
!  if replacement fails goto abort

  if OK, then repeat for the next message download.
  
abort: abort mail delivery altogether
       exit
success: download of all the messages succeeded.

I think the current behavior shows that the lines marked with"!" probably does not happen
because the error of mork .msf updates are printed AFTER JS pop3client reported a write error???
Top-level POP3client should monitor the mork db update UNTIL it reports success after the
message is incorporated into a folder after a move by filter rule, etc.

Someone needs to check the JS pop3client code.
It does not seem to be tested well for error handling.
Again I wonder what the old C++/C pop3client did.

Debug dump:

= Incorporate message begin:
  uidl string: 0001cde3533eeb2c
  {debug} mboxFile existed and so did not create.
  {streamdebug} value: mboxFile = 0x56528ffbf7a0 'mboxFile being passed to NS_NewLocalFileOutput(): I suspect this is nil under certain circumstances.' in InternalGetNewMsgOutputStream	at line 867 of /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp
  (debug) Creating buffered output stream to mboxFile=<<mailbox://ishikawa@localhost/Inbox>> in nsMsgBrkMBoxStore::GetNewMsgOutputStream in nsMsgBrkMBoxStore.cpp;
  (debug) MsgNewBufferedFileOutputStream(mboxFile, aResult, PR_WRONLY, 00600););returns error rv=0x80004005
  [Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:918
  [Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:781
  (debug) IncorporateBegin: m_msgStore->GetNewMsgOutputStream(....) failed. rv = 0x80004005
  [Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:463
  [Parent 513704, Main Thread] WARNING: NS_ENSURE_TRUE(uri) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/caps/BasePrincipal.cpp:1494
  [Parent 513704, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2585
  [WARN  webrender::renderer::init] asking to enable_gpu_markers but no supporting extension was found
  {streamdebug} closing: m_outFileStream (= (nil) )->Close() 'before possible Close()' in AbortMailDelivery  at line 379 of /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp
= Abort mail message delivery.
  Crash Annotation GraphicsCriticalError: |[0][GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt (t=270.198) [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
  [Parent 513704, Compositor] WARNING: Possibly dropping task posted to updater thread: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/gfx/layers/apz/src/APZUpdater.cpp:395
* JavaScript error: , line 0: UnknownError: Could not write to `/home/ishikawa/REMOTE-NFS/Maildir/popstate.dat': failed to open file for writing (NS_ERROR_FAILURE)
  morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  ~nsMsgDatabase(): closing database    /home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  calling RemoveFromCache(this)
  CloseStdio:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0,
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
* CloseStdio:file FILECLOSE failed
* morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652933e97b0, ev=0x56528ff717a0,
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
  [Parent 513704, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkConfig.cpp:20
  morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
  morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
  morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
  morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
  morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  ~nsMsgDatabase(): closing database    /home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  calling RemoveFromCache(this)
  CloseStdio:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30,
  this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf <-- updating .msf file in place!
  CloseStdio:file FILECLOSE failed
* morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652934190e0, ev=0x56528e9b9d30,
* this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
  [Parent 513704, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkConfig.cpp:20
  morkStdioFile::Length:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940
  this->mFile_Name=/home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
  morkStdioFile::Length:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940
  this->mFile_Name=/home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
  ~nsMsgDatabase(): closing database    /home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
  calling RemoveFromCache(this)
  CloseStdio:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940,

I will file a bugzilla if with my patch to avoid crash, C-C TB still loses e-mails after the simulated I/O errors.
Before my patch, it did.

C-C TB does lose messages with the simple simulated I/O error. It did not happen several years ago.
it seems to pretend that the downloading after the I/O error handling succeeds. But it does not actually.
I do suspect the deficiency of pop3client code now done in JS.
This is all for today.

I added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.

Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED

(In reply to Magnus Melin [:mkmelin] from comment #18)

Created attachment 9395497 [details]
Bug 1887532 - Handle gracefully failed popstate.dat writes. r=gds,Chiaki

I added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.

Thank you Magnus. This is great.
Yeah, I was thinking that popstate.dat might be the next suspicion, so this certainly helps.
I am going to run a few tryserver runs and will post a WIP patch if they build on tryserver, esp. on Windows and MacOSX.

At least, the incorrect double free is now gone and mork does not crash.
Whether they behave sanely is another matter.

Oops, I forgot to apply this patch locally before the tryserver jobs were submitted, but the avoid double-free code will be checked by the running treeserver runs.
After shufflling the patches, I think there was a syntax error somehow.
So I am going to resubmit the jobs after applying this patch locally.

(In reply to Magnus Melin [:mkmelin] from comment #18)

Created attachment 9395497 [details]
Bug 1887532 - Handle gracefully failed popstate.dat writes. r=gds,Chiaki

I added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.

I found Magnus's patch is effective for simple dowload of messages (without filter move.)

The following is the description of test that finally showed the
behavior I saw several years ago. A duplicate download of a single
message or two is tolerable considering what happened.

  • A test to download messages only (without filter move/copy).

I sent many e-mails in a particular order shown below.

The network is down, and the message "Downloading message 49 of 216" got
stuck while the network disruption continued. I simulated the I/O
error by disabling the network interface to the NFS server where the
mail folders are stored.

    1	/var/tmp/12C is sent
    2	/var/tmp/6C is sent
    3	/var/tmp/11C is sent
    4	/var/tmp/5C is sent
    5	/var/tmp/10C is sent
    6	/var/tmp/4C is sent
    7	/var/tmp/9C is sent
    8	/var/tmp/3C is sent
    9	/var/tmp/8C is sent
   10	/var/tmp/2C is sent
   11	/var/tmp/7C is sent
   12	/var/tmp/1C is sent
   13	/var/tmp/12C is sent
   14	/var/tmp/6C is sent
   15	/var/tmp/11C is sent
   16	/var/tmp/5C is sent
   17	/var/tmp/10C is sent
   18	/var/tmp/4C is sent
   19	/var/tmp/9C is sent
   20	/var/tmp/3C is sent
   21	/var/tmp/8C is sent
   22	/var/tmp/2C is sent
   23	/var/tmp/7C is sent
   24	/var/tmp/1C is sent
   25	/var/tmp/12C is sent
   26	/var/tmp/6C is sent
   27	/var/tmp/11C is sent
   28	/var/tmp/5C is sent
   29	/var/tmp/10C is sent
   30	/var/tmp/4C is sent
   31	/var/tmp/9C is sent
   32	/var/tmp/3C is sent
   33	/var/tmp/8C is sent
   34	/var/tmp/2C is sent
   35	/var/tmp/7C is sent
   36	/var/tmp/1C is sent
   37	/var/tmp/12C is sent
   38	/var/tmp/6C is sent
   39	/var/tmp/11C is sent
   40 /var/tmp/5C is sent
   41	/var/tmp/10C is sent
   42	/var/tmp/4C is sent
   43	/var/tmp/9C is sent
   44	/var/tmp/3C is sent
   45	/var/tmp/8C is sent
   46	/var/tmp/2C is sent
   47	/var/tmp/7C is sent
   48 /var/tmp/1C is sent
   49	/var/tmp/12C is sent <<<< An error was introduced.
   50	/var/tmp/6C is sent  <<<< The internal error processing persisted?
   51	/var/tmp/11C is sent
   52	/var/tmp/5C is sent
   53	/var/tmp/10C is sent
   54	/var/tmp/4C is sent
   55	/var/tmp/9C is sent
   56	/var/tmp/3C is sent
    ...
    ... actually many more continues.
    ... different messages were sent the same number  of times.

Message"Downloading message 49 of 216" got stuck while the network disruption continued.

After the network error caused TCP timeout, CC-TB recognized it due to
the error return from the low-level I/O routine (Write()) and error dialog appeared.
Then I re-enabled the network interface, and for a brief moment, I see message
"Downloading message 50 of 216" which disappeared very quickly. (This is something I need to investigate. I/O handling is done by the main thread only, and thus while the main thread got stuck due to
TCP timeout, many other threads got stuck too(?) and many things happen at the same time after the error message dialog is thrown away by my hitting OK.
Now the original download is aborted (abort mail delivery) midway.

NOTE: Until a few days ago, C-C TB used to get stuck in the kernel or crashed in itself
(double free, etc) quite often after this step.
So I could not continue from this point on reliably.
But this is now taken care of by a patch I am about to post in Bug
1887532 (C-C TB crashes when an I/O error is simulated). Without the
patch, I could not proceed very reliably from this point on. I wonder
why the bug in mork surfaced lately. Maybe the subtle timing difference
between now and several years ago where I checked the I/O error
extensively?

Anyway, THEN, I manually continued download (Please recall the network
interface to the NFS server had been enabled.).

After all the downloaded finished, I counted the # of different
messages (with different timestamps in the subject line when it was
inserted at the sent time)). They ought to match.

I found ONLY THE MESSAGES for which C-C experienced the network error
(and after-effect of error processing?) are duped.

(I checked messages in subfolders also, but they had no messages
because filter move/copy is not invoked in this test.)

Message appears how many times...

checking /home/ishikawa/REMOTE-NFS/Maildir
============================================= folder folder-C
/home/ishikawa/REMOTE-NFS/Maildir/folder-C
============================================= folder folder-D
/home/ishikawa/REMOTE-NFS/Maildir/folder-D
============================================= folder folder-E
/home/ishikawa/REMOTE-NFS/Maildir/folder-E
============================================= folder folder-F
/home/ishikawa/REMOTE-NFS/Maildir/folder-F
============================================= folder Inbox
/home/ishikawa/REMOTE-NFS/Maildir/Inbox
=  10C 22
=  11C 22
=  12C 23 *   Dup ! GREAT the one that got stuck!  (msg # 49)
=  1C  22
=  2C  22
=  3C  22
=  4C  22
=  5C  22
=  6C  23 *   Dup !  The one immediately after the network was re-enabled, (msg # 50)
              I think because of the way error dialog appears in the middle of
              error, I think the error processing was still underway internally.
=  7C  22
=  8C  22
=  9C  22

So two messages that were affected by network I/O errors were downloaded twice.
But this is much better than losing many or so many dupes.

OK, GREAT. Downloading without filter move works great with Magnus's patch.

(This is with my local buffered write patch.)

Now what about the mixture of message download mixed with filters move.
Here I found really mysterious loss of e-mails in this case.
I have no idea of the error pattern yet.

This may have something to do with
Bug 1890253 (Compact does not work for C-C TB POP3. Inbox does not get compacted although some messages were moved during download.)
That is, even under normal conditions, C-C TB leaves the space of
message moved from Inbox to the other subfolders.
Something is not quite right in the code right now.

I will re-visit the mixed download case (simple download of messages and
messages that are moved by filters) after Bug 1890253 is fixed.

So based on my observation in comment 20, the Magnus's patch should go in now IMHO.

Please note that I tested Magnus's patch with my other patches, and I
have not tested it agains the current pristine C-C TB. However, the
code is rather straightforward and under nomal conditions, it should
not cause any harm. I have already submitted a few treeserver test
jobs and it did not seem to cause any problems even without my
buffered write patches. So I don't see the problem.

BTW, with Magnus's patch,
I notice a very subtle bug in my local patches which
caused the message in https://searchfox.org/comm-central/source/mailnews/base/src/MboxMsgOutputStream.cpp#303

NS_WARNING("Failed to roll back mbox file");
to print.
After seeing this, I was able to fix a very subtle dormant problem in my patch locally.
I think the bug was here for a year or two.
So many patches have begun to affect each other and produce a long standing
fix for the I/O error handling in a nice way.

But we have a long way to go including the mysterious loss of messages when
some messages are moved by message filters to other subfolders when an error hits.

Flags: needinfo?(mkmelin+mozilla)

My take on this right now is based on my comment
https://bugzilla.mozilla.org/show_bug.cgi?id=1887532#c16

I modified the pseudo code slightly:

Problem with the current C-C code is that checkpointing and rollback of important meta data are not
explicitly mentioned IMHO.
At least for POP3 download, the transactional handling of the message body ONLY is done (but buggy before).
I cannot say for other meta data such as the one handled by MORK.

The following SHOULD BE the POP3 download transaction overview
(Maildir may be similar I don't know much about IMAP case. )

I suspect that when messages are moved, some meta data are not handled properly
in the face of error condition.

Start a sequence of mail delivery

 if no more message then goto success

 while there is message to download

  // Checkpoint step
  before starting the next message download

  Save mork file (.msf) if it is not done by mork itself (not sure if
  it is done by mork automagically?)

  Save other meta information (header database is in .msf (correct?)
  popstate.dat is now saved (and restored) by Magnus's patch. 

  if each of the save step fails
  then
        display error, and goto abort
  endif

  Not sure what is done regarding the transaction rollback.
  Is there a rollback feature in MORK?

  ! Considering the file system full case, I think it is best to
  ! handle that possibility to let Mork modifies a new copy of .msf files.

  ! Set up so that mork UPDATES A COPY the mork file (.msf) during
  ! download (and possible move/copy by filtering) instead of the real
  ! .msf file that should persist before and after the download
  ! failure.
  ! This preparation step may fail.  If so, goto abort.
  (Not sure if this is done anywhere inside MORK.
  nsPop3Sink.cpp should be doing something like this according to Magnus.
  Someone needs to investigate.)

  Start download

  if failure of download or filtering, etc. then goto  abort.

  if download and filtering, etc. succeeded,
  then
    replace old .msf file with the updated mork file.
    do the same for other old meta data including popstate.dat with the
      updated meta data file.
    if replacement step fails goto abort 
  endif

  if all of the above steps OK, then repeat for the next message download. endif
 endwhile

abort: abort mail delivery altogether, and
       make sure that the meta data files are all in the checkpointed state now
       exit

success: download of all the messages succeeded.
       exit

How to mix the message body handling itself in the above process is also important.
We have a long way to go, but the simple case is now handled more or less satisfactory manner.

Mark the structure that contains the FILE structure properly inactive
and not open even if the fclose() operation itself failed.
This avoids the double free and segfaults.

Also make the reading of unexpected characters an error, not a warning.
This happens so often in a corrupted mork data left over from an I/O
error such as artificially introduced network error to the remote CIFS
and/or NFS server where the message folders are stored during POP3
error test. Having to cope with warning messages (maybe 512 times?)
was not very productive, and the incorrect data IS SIGNALING AN ERROR
after all.

Added a few debug statements where error from Write() and other
low-level I/O function is not checked. Can Mork get away from checking
the low-level I/O errors? I don't think so.

Attachment #9396048 - Attachment description: Bug 1887532 - Fix crash due to re-use closed FILE* and quit early on incorrect. r=mkmelin,BenC → Bug 1887532 - Fix crash due to re-use closed FILE* and quit early on incorrect read. r=mkmelin,BenC
Flags: needinfo?(mkmelin+mozilla)
Target Milestone: --- → 126 Branch

Pushed by john@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/1d090eae9aa5
Handle gracefully failed popstate.dat writes. r=gds,Chiaki

Attachment #9396048 - Attachment description: Bug 1887532 - Fix crash due to re-use closed FILE* and quit early on incorrect read. r=mkmelin,BenC → Bug 1887532 - Fix crash due to re-use closed FILE* and quit early on incorrect input. r=mkmelin,BenC

I am checking the legacy code segfaulting in mork code when an I/O error is simulated by disabling the network interface that talks to a remote NFS server where the messages is stored.
I had to fix the code in Bug 1890253 because I use an external program to scan relevant folders to see
if all the e-mails I sent are received and the content is as it should be.
If Inbox has logically unused space that contains the image of an old message, that would be counted falsely by the check program. (I don't try to parse the .msf mork file at all.)

Now, the download to Inbox only is handled rather gracefully in the face of a simulated error, but the
multiple downloads of messages some of which are moved to other folders from Inbox resulted in
loss of many messages. some times.

Now I have begun to suspect that the mork gets quite confused (?) and maybe we are discarding/truncating folders
too eagerly (by looking at the offset of previously downloaded message (many messages apart) and thus
we lose the messages downloaded thus far. <--- This is a pure conjecture, but otherwise I cannot explain the wholesale deletion of messages which I SOMETIMES observe.

The symptom is obviously different based on the timing.
I observed another strange issue that the download could not continue after an I/O error.
This happened relatively few times, but once this happens, I think I have to quit C-C TB and repair folders and
only then I could continue.
(I have an additional patch for this, but since it is very much timing dependent, I need more testing
to say, the additional patch fixed the problem.)

Due to the way the network is disabled externally from a shell prompt currently, I cannot micromanage the
disabling of network interface deterministically and so reproducing the condition which forces C-C TB to lose many messages has been difficult.

I am now tempted to either

  • insert a code in C-C TB locally for debugging purposes to
    tweak the network interface up/down at the entry of key functions, or even
  • insert a code to do so between each line of related functions to
    simulate the error at key positions.

Otherwise, I don't think I can re-create the condition where many messages are lost in one error easily.
I am thinking over it now.
The idea goes like this.
A macro inserted is expanded to a call to system() function that executes a bash script with a
filename and line number of its location.
In the shell script, it looks at an environment variable to see if the file matches an externally specified filename, and if the line number equals to an externally specified number.
If so, the network is disabled, and thus from the next operation to read/write to the NFS server will return an error after TCP timeout. Thus I can manage to create an error condition at the resolution of either at a function entry or
line by line execution.
From what I have observed, the many message loss situation seems to occur when a message is moved to a folder.
So I suspect I should insert such macros in the related functions in the source files and simulate an error...

As I mentioned previously, simple download only to Inbox is now handled rather well. Maybe a duplication a message or two, that were not properly put into the Inbox folder. It is better than losing a message or segfaulting.

See Also: → 1760925

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/870b6cedf710
Fix crash due to re-use closed FILE* and quit early on incorrect input. r=mkmelin,BenC

Gently suggesting we should close this very soon so the bug doesn't cross esr boundaries. (it currently has a target milestone of 126)

Is there more to do for this crash?

Flags: needinfo?(mkmelin+mozilla)
See Also: → 1897667

Yep, let's continue in other bugs if needed.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(mkmelin+mozilla)
Resolution: --- → FIXED

(In reply to Magnus Melin [:mkmelin] from comment #29)

Yep, let's continue in other bugs if needed.

Thank you for closing this.
I have been trying to test the I/O errors using sikulix to simulate human operations via menu/mouse when the
simulated (transient NFS) I/O error is introduced at function entries. It has taken me almost three weeks to get things straightened out, but I feel the end is near. I think we no longer see the crash in the following scenario.
Download four e-mails. One of them is deleted (moved to trash) by a filter rule.
The other threes are copied to different folders by filter rules.
Without the patch, we saw crashes. But so far in about 400- test scenario cases, I have not seen crashes.
OTOH, I have noticed very strange error messages that may confuse users, but I will report them in different bugzilla.
The 400- tests have not finished yet (I think there are still 30 cases or more. I let it run in the background.)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: