C-C TB crashes when an I/O error is simulated.
Categories
(MailNews Core :: Database, defect)
Tracking
(thunderbird_esr115 wontfix)
| Tracking | Status | |
|---|---|---|
| thunderbird_esr115 | --- | wontfix |
People
(Reporter: ishikawa, Assigned: mkmelin)
References
Details
(Keywords: crash, leave-open)
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
Bug 1887532 - Fix crash due to re-use closed FILE* and quit early on incorrect input. r=mkmelin,BenC
48 bytes,
text/x-phabricator-request
|
Details | Review |
I am checking the behavior of locally patched C-C TB DEBUG build under Debian GNU/Linux.
The local patch is basically for bug 1242030.
I am thinking of submitting to the fabricator the latest patch that has been merged with latest C-C
tree AFTER some extensive tests for I/O error handling (mostly write
handling).
Such tests revealed many problems in C-C TB code about 6-7 years ago
and so I had to insert many error checks in the code.
(As a matter of fact, the crude transactional handling of mail
incoropration end incorporation or abort incoropration did not work at
all until I invesitgated.)
This time around, I am simulating the I/O error by storing the mails
in a remote cifs server, and unplugging the network cable to the
server and seeing the I/O error occur and handled by C-C TB.
E.g., TB may say there was an error, etc. with a
modal error dialog.
Then I have to re-plug the network cable to continue.
[There are additional issues in that TB's writing also take place for
meta database, etc. and TB needs to handle the timeouts from these
network I/O operations, and I won't be able to progress much. So I have to
re-plug the network cable to continue.
Now that said, unbelievably this is only the second try to simulate an
error, and I get a core dump.
Seems I need to do a lot of investigation until I will feel comfotable
to upload the patch for bug 1242030.
But this time it is NOT my patches, but I think it is mork code which
does not deal with I/O error gracefully.
I know it is IMPOSSIBLE to do something perfect in an I/O error
situation, but I would like TB to
-
- cope with the error gracefully, and
-
- show a proper error message, AND
-
- NOT crash.
This one fails the "Do not crash" criteria above.
crash stack:
...
[Parent 320173, Main Thread] WARNING: (debug) We are enabling buffering for m_outFileStream in nsPop3Sink::IncorporateBegin in nsPop3Sink.cpp.: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:481
(debug) 494: m_outFileStream->Flush() returned 0x0
{debug} creating seekable stream failed: /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:511, rv = 80004002
GetDiskSpaceAvailable returned: 471678976 bytes
(debug) nsMsgBrkMBoxStore::GetMsgInputStream: NS_NewLocalFileInputStream() returned mboxFile.get()=mailbox://ishikawa@127.0.0.1/Inbox
{deubg} LINE=1042, seekable->Seek(PR_SEEK_SET, offset) = 0x0
NS_NewBufferedOutputStream: outputStream (= std::move(aOutputputStream)) =0x563eea842448
(debug) Creating buffered output stream to mboxFile=<<mailbox://ishikawa@127.0.0.1/folder-E>> in nsMsgBrkMBoxStore::GetNewMsgOutputStream in nsMsgBrkMBoxStore.cpp;
Fatal error: glibc detected an invalid stdio handle
Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 320173) received signal 6.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa0be7da]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3c510]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x8a16c]
#04: gsignal[/lib/x86_64-linux-gnu/libc.so.6 +0x3c472]
#05: abort[/lib/x86_64-linux-gnu/libc.so.6 +0x264b2]
#06: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x271ed]
#07: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x7e055]
#08: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x7e885]
#09: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x75e66]
#10: ftell[/lib/x86_64-linux-gnu/libc.so.6 +0x747fd]
#11: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2d581f]
#12: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2d5362]
#13: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2e3583]
#14: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2e3625]
#15: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa3009cc]
#16: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa484d9d]
#17: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa2f6756]
#18: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4aa24b]
#19: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4acf83]
#20: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa51f1bc]
#21: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4a38e0]
#22: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4a967f]
#23: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0xa4b1df7]
#24: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x45846f2]
#25: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x504eb20]
#26: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x504ee52]
#27: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x50548cd]
#28: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 320173' to attach your debugger to this thread.
gdb dump
---
[New LWP 323941]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f6e6a0bfaa5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48 ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
Redefine command "prun"? (y or n) [answered Y; input not from terminal]
Redefine command "pu"? (y or n) [answered Y; input not from terminal]
Redefine command "ps"? (y or n) [answered Y; input not from terminal]
Loading JavaScript value pretty-printers; see js/src/gdb/README.
If they cause trouble, type: disable pretty-printer .* SpiderMonkey
SpiderMonkey unwinder is disabled by default, to enable it type:
enable unwinder .* SpiderMonkey
(gdb) where
#0 0x00007f6e6a0bfaa5 in __GI___clock_nanosleep
(clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 0x00007f6e6a0c4473 in __GI___nanosleep
(req=req@entry=0x7ffe47c9cb90, rem=rem@entry=0x7ffe47c9cb90)
at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 0x00007f6e6a0c43aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x00007f6e644e82e7 in common_crap_handler(int, void const*)
(signum=6, aFirstFramePC=0x7f6e644be7da <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+186>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:100
#4 0x00007f6e644e830d in ah_crap_handler(int) (signum=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:108
#5 0x00007f6e644be7da in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
(signo=6, info=0x7ffe47c9cdf0, context=0x7ffe47c9ccc0)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6 0x00007f6e6a02d510 in <signal handler called> () at /lib/x86_64-linux-gnu/libc.so.6
#7 __pthread_kill_implementation
(threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#8 0x00007f6e6a07b1cf in __pthread_kill_internal (signo=6, threadid=<optimized out>)
at ./nptl/pthread_kill.c:78
#9 0x00007f6e6a02d472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007f6e6a0174b2 in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007f6e6a0181ed in __libc_message (fmt=fmt@entry=0x7f6e6a18a3b6 "%s")
at ../sysdeps/posix/libc_fatal.c:150
#12 0x00007f6e6a06f055 in __GI___libc_fatal (message=<optimized out>)
at ../sysdeps/posix/libc_fatal.c:159
#13 0x00007f6e6a06f885 in _IO_vtable_check () at ./libio/vtables.c:67
#14 0x00007f6e6a066e66 in IO_validate_vtable (vtable=0xaa000005ce) at ../libio/libioP.h:945
#15 _IO_seekoff_unlocked (fp=<optimized out>, offset=0, dir=1, mode=0) at ./libio/ioseekoff.c:59
#16 0x00007f6e6a0657fd in __GI__IO_ftell (fp=0x563ee9c27b30) at ./libio/ioftell.c:38
#17 0x00007f6e646d581f in morkStdioFile::Length(morkEnv*) const
(this=0x563ee7e31110, ev=0x563ee6046d70)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
#18 0x00007f6e646d5362 in non-virtual thunk to morkFile::Eof(nsIMdbEnv*, int*) ()
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.h:197
#19 0x00007f6e646e3583 in morkStore::DoPreferLargeOverCompressCommit(morkEnv*)
(this=this@entry=0x563ee6d64490, ev=ev@entry=0x563ee6046d70)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:241
#20 0x00007f6e646e3625 in morkStore::LargeCommit(nsIMdbEnv*, nsIMdbThumb**)
(this=0x563ee6d64490, mev=<optimized out>, acqThumb=0x7ffe47c9d658)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:1900
#21 0x00007f6e647009cc in nsMsgDatabase::Commit(int) (this=0x563eeb8013e0, commitType=<optimized out>)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/nsMsgDatabase.h:127
#22 0x00007f6e64884d9d in nsMsgBrkMBoxStore::SetSummaryFileValid(nsIMsgFolder*, nsIMsgDatabase*, bool)
(this=0x563ee7077ca0, aFolder=0x563ee7c328f0, aDB=0x563eeb8013e0, aValid=true)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:462
#23 0x00007f6e646f6756 in nsMailDatabase::SetSummaryValid(bool) (this=0x563eeb8013e0, aValid=true)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMailDatabase.cpp:131
#24 0x00007f6e648aa24b in nsMsgMailboxParser::UpdateDBFolderInfo(nsIMsgDatabase*)
(mailDB=<optimized out>, this=0x563eeaa29fc0)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:330
#25 nsParseNewMailState::MoveIncorporatedMessage(nsIMsgDBHdr*, nsIMsgDatabase*, nsIMsgFolder*, nsIMsgFilter*, nsIMsgWindow*)
(this=0x563eeaa29fc0, mailHdr=0x563eeb17a3a0, sourceDB=0x563ee9c35330, destIFolder=0x563ee7c328f0, filter=<optimized out>, msgWindow=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:2410
#26 0x00007f6e648acf83 in nsParseNewMailState::ApplyFilterHit(nsIMsgFilter*, nsIMsgWindow*, bool*)
(this=0x563eeaa29fc0, filter=0x563ee9105de0, msgWindow=0x563ee844c7a0, applyMore=0x7ffe47c9dc1f)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1868
#27 0x00007f6e6491f1bc in nsMsgFilterList::ApplyFiltersToHdr(int, nsIMsgDBHdr*, nsIMsgFolder*, nsIMsgDatabase*, nsTSubstring<char> const&, nsIMsgFilterHitNotify*, nsIMsgWindow*)
(this=this@entry=0x563ee9d80630, filterType=filterType@entry=1, msgHdr=0x563eeb17a3a0, folder=0x563ee859db50, db=0x563ee9c35330, headers="X-Account-Key: account1", listener=0x563eeaa2a278, msgWindow=0x563ee844c7a0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/search/src/nsMsgFilterList.cpp:349
#28 0x00007f6e648a38e0 in nsParseNewMailState::ApplyFilters(bool*, nsIMsgWindow*)
(this=this@entry=0x563eeaa29fc0, pMoved=pMoved@entry=0x7ffe47c9de32, msgWindow=msgWindow@entry=0x563ee844c7a0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1730
#29 0x00007f6e648a967f in nsParseNewMailState::PublishMsgHeader(nsIMsgWindow*)
(this=0x563eeaa29fc0, msgWindow=0x563ee844c7a0)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsParseMailbox.cpp:1672
#30 0x00007f6e648b1df7 in nsPop3Sink::IncorporateComplete(nsIMsgWindow*, int)
(this=0x563eea9cc620, aMsgWindow=0x563ee844c7a0, aSize=0)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/RefPtr.h:341
#31 0x00007f6e5e9846f2 in NS_InvokeByIndex ()
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/reflect/xptcall/md/unix/xptcinvoke_asm_x86_64_unix.S:101
#32 0x00007f6e5f44eb20 in CallMethodHelper::Invoke() (this=0x7ffe47c9e1d8)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1621
#33 CallMethodHelper::Call() (this=this@entry=0x7ffe47c9e1d8)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1174
#34 0x00007f6e5f44ee52 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
(ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNative.cpp:1120
#35 0x00007f6e5f4548cd in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)
(cx=<optimized out>, argc=<optimized out>, vp=0x7ffe47c9e4e0)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:966
#36 0x0000161a3a8754a6 in ()
#37 0x00007ffe47c9e538 in ()
#38 0x00007ffe47c9e4b8 in ()
#39 0x0000000000000000 in ()
(gdb)
| Reporter | ||
Comment 1•2 years ago
|
||
/NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
is
mork_pos morkStdioFile::Length(morkEnv* ev) const {
mork_pos outPos = 0;
if (this->IsOpenAndActiveFile()) {
FILE* file = (FILE*)mStdioFile_File;
if (file) {
long start = MORK_FILETELL(file); <--- line 454
if (start >= 0) {
long fore = MORK_FILESEEK(file, 0, SEEK_END);
if (fore >= 0) {
long eof = MORK_FILETELL(file);
if (eof >= 0) {
long back = MORK_FILESEEK(file, start, SEEK_SET);
if (back >= 0)
outPos = eof;
else
this->new_stdio_file_fault(ev);
} else
this->new_stdio_file_fault(ev);
} else
this->new_stdio_file_fault(ev);
} else
this->new_stdio_file_fault(ev);
} else if (mFile_Thief)
mFile_Thief->Eof(ev->AsMdbEnv(), &outPos);
else
this->NewMissingIoError(ev);
} else
this->NewFileDownError(ev);
return outPos;
}
So *file contains invalid fileno?
(gdb) up
#1 0x00007f6e6a0c4473 in __GI___nanosleep (req=req@entry=0x7ffe47c9cb90,
rem=rem@entry=0x7ffe47c9cb90) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
25 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) up
#2 0x00007f6e6a0c43aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
55 ../sysdeps/posix/sleep.c: No such file or directory.
(gdb) up
#3 0x00007f6e644e82e7 in common_crap_handler (signum=6,
aFirstFramePC=0x7f6e644be7da <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+186>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:100
100 sleep(_gdb_sleep_duration);
(gdb) up
#4 0x00007f6e644e830d in ah_crap_handler (signum=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:108
108 common_crap_handler(signum, CallerPC());
(gdb) up
#5 0x00007f6e644be7da in nsProfileLock::FatalSignalHandler (signo=6, info=0x7ffe47c9cdf0,
context=0x7ffe47c9ccc0)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
183 oldact->sa_handler(signo);
(gdb) up
#6 <signal handler called>
(gdb) up
#7 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) up
#8 0x00007f6e6a07b1cf in __pthread_kill_internal (signo=6, threadid=<optimized out>)
at ./nptl/pthread_kill.c:78
78 in ./nptl/pthread_kill.c
(gdb) up
#9 0x00007f6e6a02d472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26 ../sysdeps/posix/raise.c: No such file or directory.
(gdb) up
#10 0x00007f6e6a0174b2 in __GI_abort () at ./stdlib/abort.c:79
79 ./stdlib/abort.c: No such file or directory.
(gdb) up
#11 0x00007f6e6a0181ed in __libc_message (fmt=fmt@entry=0x7f6e6a18a3b6 "%s")
at ../sysdeps/posix/libc_fatal.c:150
150 ../sysdeps/posix/libc_fatal.c: No such file or directory.
(gdb) up
#12 0x00007f6e6a06f055 in __GI___libc_fatal (message=<optimized out>)
at ../sysdeps/posix/libc_fatal.c:159
159 in ../sysdeps/posix/libc_fatal.c
(gdb) up
#13 0x00007f6e6a06f885 in _IO_vtable_check () at ./libio/vtables.c:67
67 ./libio/vtables.c: No such file or directory.
(gdb) up
#14 0x00007f6e6a066e66 in IO_validate_vtable (vtable=0xaa000005ce) at ../libio/libioP.h:945
945 ../libio/libioP.h: No such file or directory.
(gdb) up
#15 _IO_seekoff_unlocked (fp=<optimized out>, offset=0, dir=1, mode=0) at ./libio/ioseekoff.c:59
59 ./libio/ioseekoff.c: No such file or directory.
(gdb) up
#16 0x00007f6e6a0657fd in __GI__IO_ftell (fp=0x563ee9c27b30) at ./libio/ioftell.c:38
38 ./libio/ioftell.c: No such file or directory.
(gdb) up
#17 0x00007f6e646d581f in morkStdioFile::Length (this=0x563ee7e31110, ev=0x563ee6046d70)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
454 long start = MORK_FILETELL(file);
(gdb) print file
$1 = (FILE *) 0x563ee9c27b30
(gdb) print *file
$2 = {_flags = 878570760, _IO_read_ptr = 0x0, _IO_read_end = 0x0,
_IO_read_base = 0x4000003d9 <error: Cannot access memory at address 0x4000003d9>,
_IO_write_base = 0x4000000971 <error: Cannot access memory at address 0x4000000971>,
_IO_write_ptr = 0x1d800000040 <error: Cannot access memory at address 0x1d800000040>,
_IO_write_end = 0x1d8000001d8 <error: Cannot access memory at address 0x1d8000001d8>,
_IO_buf_base = 0x0,
_IO_buf_end = 0x30000000000001fc <error: Cannot access memory at address 0x30000000000001fc>,
_IO_save_base = 0x2000000000000388 <error: Cannot access memory at address 0x2000000000000388>,
_IO_backup_base = 0x20000000000003d9 <error: Cannot access memory at address 0x20000000000003d9>,
_IO_save_end = 0x6000003f2 <error: Cannot access memory at address 0x6000003f2>,
_markers = 0x100000040a, _chain = 0x150000041b, _fileno = 1069, _flags2 = 23,
_old_offset = 150323856454, _cur_column = 1111, _vtable_offset = 0 '\000', _shortbuf = "",
_lock = 0x3400000472, _offset = 249108104337, _codecvt = 0x1000004a000004ba,
_wide_data = 0x5c000004ee, _freeres_list = 0x6700000512, _freeres_buf = 0x7a0000055d,
__pad5 = 545460847978, _mode = 1399,
_unused2 = "\204\000\000\000\242\005\000\000\231\000\000\000\263\005\000\000\237\000\000"}
(gdb)
fileno is 1069, and
some memory buffer areas are invalid addresses.
The maximum file descriptors open in one process under my linux is:
$ ulimit -n
1024
$ ulimit -Sn
1024
It seems fileno, 1069, is a bit too large IMHO.
I am afraid need to insert more printf dumps, etc.
to figure out what is going on.
(gdb) list
449 mork_pos outPos = 0;
450
451 if (this->IsOpenAndActiveFile()) {
452 FILE* file = (FILE*)mStdioFile_File;
453 if (file) {
454 long start = MORK_FILETELL(file);
455 if (start >= 0) {
456 long fore = MORK_FILESEEK(file, 0, SEEK_END);
457 if (fore >= 0) {
458 long eof = MORK_FILETELL(file);
(gdb) up
#18 0x00007f6e646d5362 in non-virtual thunk to morkFile::Eof(nsIMdbEnv*, int*) ()
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.h:197
197 NS_IMETHOD Eof(nsIMdbEnv* ev, mdb_pos* outPos) override;
(gdb) list
192 public:
193 virtual mork_pos Length(morkEnv* ev) const = 0; // eof
194 // nsIMdbFile methods
195 NS_IMETHOD Tell(nsIMdbEnv* ev, mdb_pos* outPos) const override = 0;
196 NS_IMETHOD Seek(nsIMdbEnv* ev, mdb_pos inPos, mdb_pos* outPos) override = 0;
197 NS_IMETHOD Eof(nsIMdbEnv* ev, mdb_pos* outPos) override;
198 // } ----- end pos methods -----
199
200 // { ----- begin read methods -----
201 NS_IMETHOD Read(nsIMdbEnv* ev, void* outBuf, mdb_size inSize,
(gdb) print ev
No symbol "ev" in current context.
(gdb) print *outPos
No symbol "outPos" in current context.
(gdb) print ev
No symbol "ev" in current context.
(gdb) up
#19 0x00007f6e646e3583 in morkStore::DoPreferLargeOverCompressCommit (this=this@entry=0x563ee6d64490,
ev=ev@entry=0x563ee6046d70)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkStore.cpp:241
241 file->Eof(ev->AsMdbEnv(), &fileEof);
(gdb) list
236 // true when mStore_CanWriteIncremental && store has file large enough
237 {
238 nsIMdbFile* file = mStore_File;
239 if (file && mStore_CanWriteIncremental) {
240 mdb_pos fileEof = 0;
241 file->Eof(ev->AsMdbEnv(), &fileEof);
242 if (ev->Good() && fileEof > 128) return morkBool_kTrue;
243 }
244 return morkBool_kFalse;
245 }
(gdb) print file
$3 = <optimized out>
(gdb) print *file
value has been optimized out
(gdb) print fileEof
$4 = 0
(gdb) print ev
$5 = (morkEnv *) 0x563ee6046d70
(gdb) print *ev
$6 = (morkEnv) {<morkObject> = {<morkBead> = {<morkNode> = {
_vptr.morkNode = 0x7f6e67142cb0 <vtable for morkEnv+16>, mNode_Heap = 0x563eea8dd3e0,
mNode_Base = 20068, mNode_Derived = 17782, mNode_Access = 111 'o', mNode_Usage = 104 'h',
mNode_Mutable = 85 'U', mNode_Load = 34 '"', mNode_Uses = 1, mNode_Refs = 1},
mBead_Color = 0}, <nsIMdbObject> = {<nsISupports> = {
_vptr.nsISupports = 0x7f6e67142dc8 <vtable for morkEnv+296>}, <No data fields>},
mObject_Handle = 0x0, mMorkEnv = 0x0, mRefCnt = {static isThreadSafe = false, mValue = 1},
_mOwningThread = 320173}, <nsIMdbEnv> = {<nsISupports> = {
_vptr.nsISupports = 0x7f6e67142e40 <vtable for morkEnv+416>}, <No data fields>},
mEnv_Factory = 0x563ee8596ab0, mEnv_Heap = 0x563eea8dd3e0, mEnv_SelfAsMdbEnv = 0x563ee6046db8,
mEnv_ErrorHook = 0x0, mEnv_HandlePool = 0x563ee969d0b0, mEnv_ErrorCount = 0, mEnv_WarningCount = 0,
mEnv_ErrorCode = nsresult::NS_OK, mEnv_DoTrace = 0 '\000', mEnv_AutoClear = 85 'U',
mEnv_ShouldAbort = 0 '\000', mEnv_BeVerbose = 0 '\000', mEnv_OwnsHeap = 1 '\001'}
(gdb) call ev->AsbMdbEnv()
Couldn't find method morkEnv::AsbMdbEnv
(gdb) call ev->AsMdbEnv()
Cannot evaluate function -- may be inlined
(gdb) print *file
value has been optimized out
(gdb)
I think I need to modify the GCC option so that variables would remain
in the code. Also I need to insert more printf statement to trace the
entering of functions, etc.
| Reporter | ||
Comment 2•2 years ago
|
||
It seems that the crash left some files in a locked state and the next startup of thunderbird is not proceeding...
I think I need to clean up the storage of mails and start from scratch...
Surprise, I even restarted X window and got the following dialog. I obviously could not kill thunderbird cleanly...
Thunderbird is already running, but is not responding. To use Thunderbird, you must first close the existing Thunderbird process, restart your device, or use a different profile.
It has been quite a while since I last met a zombie process that could not be culled by kill command.
I am restarting linux itself.
I will post updates on this and other mishaps of I/O error handling by C-C TB.
| Reporter | ||
Comment 3•2 years ago
|
||
I think TB got stuck in CIFS subsystem. Hmm... Uncovering an issue of CIFS client code.
systemd-shutdown tries killing the hung process and I saw the current stack dump, but it is still waiting for the process to disappear.
Finally reboot...
This is not going to be a smooth sailing unlike several years ago.
| Reporter | ||
Comment 4•2 years ago
|
||
I filed a bug in samba bugzilla, too.
https://bugzilla.samba.org/show_bug.cgi?id=15614
I hope this usage scenario, storing mail messages on remote CIFS server and experiencing network outage temporarily won't crash thunderbird easily after a suitable fix is discovered.
| Reporter | ||
Comment 5•2 years ago
|
||
I find that thunderbird crashes often during my crude attempt to simulate I/O error by unplugging/plugging network cable to the remote CIFS server, or shutting down the network interface to the server.
And the crash position is the ftell mentioned in comment 1.
long start = MORK_FILETELL(file); <--- line 454
So now my take on the situation is that the raw message buffered write handles the I/O error more or less. The C-C TB code have absorbed some portion of my patches over the years, and many other improvements have been made.
But the mork code that handles meta-data for messages does not seem to cope with I/O error very well.
Tough.
I don't think I saw so many crashes about several years ago when I tested the C-C TB code for I/O error handling.
But it could be that I was more concerned with message data handling, and may not have been able to proceed to the point that meta data handling became an issue.
But I do remember that I tested the mail delivery abort operation and TB continued to work.
So something must have changed to make the above MORK_FILETELL(file) more vulnerable.
And one more data point.
During the heavy testing cycle of several years ago, I experienced a very strange ftell crash.
The low-level routine such as fseek, fell, fwrite, etc. are wrapped and
error conditions are propagated through the wrappers.
But there was single failure of ftell which caused thunderbird crash.
I was very surprised at the crash because it SHOULD NOT HAPPEN.
That is somehow the wrapper failed to contain the error condition (error throw by low-level code)
and that signal caused C-C TB to crash.
I am afraid that there may be a deficiency of the wrapper used by C-C TB for ftell.
| Reporter | ||
Comment 6•2 years ago
|
||
Wait, mork uses ftell without a wrapper.
https://searchfox.org/comm-central/source/mailnews/db/mork/morkConfig.h#51
# define MORK_FILETELL(file) ftell(file)
C-C TB's messaage code uses |stream->Tell()|, and
I recall the following Tell() caused a strange segfault back when I was testing the I/O error handling extensively.
Problem was that the Tell() wrapper ought to return an error code when the error happened, but
somehow it passed signal up the call chain and TB crashed.
https://searchfox.org/comm-central/source/mailnews/local/src/nsMsgBrkMBoxStore.cpp#838
| Reporter | ||
Comment 7•2 years ago
|
||
So I looked at mork code re Write() and am quite confused.
Say, https://searchfox.org/comm-central/source/mailnews/db/mork/morkFile.cpp#217
mork_size morkFile::WriteNewlines(morkEnv* ev, mork_count inNewlines)
// WriteNewlines() returns the number of bytes written.
{
mork_size outSize = 0;
while (inNewlines && ev->Good()) // more newlines to write?
{
mork_u4 quantum = inNewlines;
if (quantum > morkFile_kNewlinesCount) quantum = morkFile_kNewlinesCount;
mork_size quantumSize = quantum * mork_kNewlineSize;
mdb_size bytesWritten;
this->Write(ev->AsMdbEnv(), morkFile_kNewlines, quantumSize, &bytesWritten); <--- 214. No error check?
outSize += quantumSize;
inNewlines -= quantum;
}
return outSize;
}
You can see that there is no check for the error return from Write().
This can happen if an underlying file system is broken, file system fills up, network experiences a transient error and the data is written to a remote file server.
Does this line of mork intentionally ignore the error from |Write()|?
I am afraid that mork does not handle I/O errors gracefully...
Wayne, who is the mork expert to ask?
| Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 8•2 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #7)
Wayne, who is the mork expert to ask?
I'm afraid there isn't really anyone who properly understands mork at the moment...
In the past few years, there's always been the plan to move to another database (Bug 453975), so nobody has really invested the time to get to know the mork code well.
On the plus side, the database migration work is getting started now, so it looks like it will finally happen! Work on Bug 1572000 is getting underway.
| Reporter | ||
Comment 9•2 years ago
|
||
(In reply to Ben Campbell from comment #8)
(In reply to ISHIKAWA, Chiaki from comment #7)
Wayne, who is the mork expert to ask?
I'm afraid there isn't really anyone who properly understands mork at the moment...
In the past few years, there's always been the plan to move to another database (Bug 453975), so nobody has really invested the time to get to know the mork code well.
On the plus side, the database migration work is getting started now, so it looks like it will finally happen! Work on Bug 1572000 is getting underway.
I am tied with paper work at this moment for the next week or so.
But based on my observation, the crash at an I/O error is for real. (Strange, C-C TB did not crash like this in my test cycle a few years ago.)
Obviously, mork code faces a write error and it did not recover well. (I have seen C-C TB continues downloading messages after the unplugged network cable is plugged again, but after half a dozen or so messages, it suddenly crashes at the same ftell code line. This is not what I remember.)
Something has changed. Most likely the call patterns.
Anyway, in this state of things, my I/O error simulation won't cover C-C TB's resilience much.
It simply crashes now unlike several years ago when C-C TB would work after the cable was plugged again.
Do you think I should investigate the crash issue and that effor would not be in vain?
I mean the transition to other DB Bug 1572000 is finished within this year ?
I have a hunch that the low-level I/O error code is not propagated within mork very well and
whatever to make this-isOpenAndActiveFile() false is not properly set after an I/O error.
450
451 if (this->IsOpenAndActiveFile()) { <-- if this is false the problematic ftell is not called.
452 FILE* file = (FILE*)mStdioFile_File; <--- Or this mStdioFile_File was not set to nullptr as it should have
453 if (file) { when an I/O error occurred and mork gave up.
454 long start = MORK_FILETELL(file);
455 if (start >= 0) {
But leading to the root cause of the behavior alone might take a week of time or so...
Comment 10•2 years ago
|
||
FYI we've pretty much exhausted the developers who might be able to comment on mork.
| Reporter | ||
Comment 11•2 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #10)
FYI we've pretty much exhausted the developers who might be able to comment on mork.
Yeah, I have a feeling that this is a case.
I will place a few debug print statement to see if I can find the easy fix for the crash problem.
But otherwise, C-C TB is in a poorer shape than it was several years ago as far as the robustness in the face of I/O errors.
Sad, but true.
| Reporter | ||
Comment 12•2 years ago
|
||
Cool.
I think I have found a reference after free issue.
Before that I found a mundane fix for a irritating problem.
I found a fix for C-C TB's reporting unexpected character NUL '\0' character inside mork (prresumably left when the I/O error simulated by network error to CIFS sever left a file with a block of '\0' in it and mork) reader/parser did not like it.
C-C TB reported it until it reads all the NUL charcters which could mean a long series sometimes. (maybe 512 times?)
With that patched, I could test a few more test runs,
1 download a series of e-mal messages.
2 unplug network cable.
3 See C-C TB shows an error dialog.
4 plug network cable.
Repeat this process.
After the initial error, C-C TB used to stumble on this long series of '\0' of unexpected char warnings, but
now it has been removed.
I could repeat the above process a few times, when I saw the crash and the following dump from gdb.
up 17
#17 0x00007fd357500def in morkStdioFile::Length (this=0x559500d76960, ev=0x559500d74920)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkFile.cpp:454
454 long start = MORK_FILETELL(file);
(gdb) print *file
$1 = {_flags = 25028816, _IO_read_ptr = 0x7fd35d1f2380 <main_arena+1792> "p#\037]\323\177",
_IO_read_end = 0x5595017de8d0 "ZZZZZZZZ\361 ", _IO_read_base = 0x5595017de8d0 "ZZZZZZZZ\361 ",
_IO_write_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_write_ptr = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_write_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_buf_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_buf_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_save_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_backup_base = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_IO_save_end = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>,
_markers = 0x5a5a5a5a5a5a5a5a, _chain = 0x5a5a5a5a5a5a5a5a, _fileno = 1515870810, _flags2 = 1515870810,
_old_offset = 6510615555426900570, _cur_column = 23130, _vtable_offset = 90 'Z', _shortbuf = "Z",
_lock = 0x5a5a5a5a5a5a5a5a, _offset = 6510615555426900570, _codecvt = 0x5a5a5a5a5a5a5a5a,
_wide_data = 0x5a5a5a5a5a5a5a5a, _freeres_list = 0x5a5a5a5a5a5a5a5a, _freeres_buf = 0x5a5a5a5a5a5a5a5a,
__pad5 = 6510615555426900570, _mode = 1515870810, _unused2 = 'Z' <repeats 20 times>}
(gdb)
The 0x5a patterns you see above is that it is uninitialized (either freed or malloced but not initialized yet).
ASCII 'Z' is 0x5a, too.
Everything in the file structure is filled with 0x5a.
For example, _fileno 1515870810 and other values are concatenation of 0x5a octet.
ishikawa@ip030:/NREF-COMM-CENTRAL/mozilla$ bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
obase=16
1515870810
5A5A5A5A
6510615555426900570
5A5A5A5A5A5A5A5A
23130
5A5A
I am using gnu malloc iwth debug feature.
I call mallopt() with 0x5a value in my local source.
See below.
I set the environment variable thusly:
MOZILLA_MALLOPT=0X5A
MALLOC_PERTURB_=0x5A
So C-C TB printed out the following.
Calling mallopt for malloc/free debug. with 0x5a value.
My local change.
# HG changeset patch
# User ISHIKAWA, Chiaki <ishikawa@yk.rim.or.jp>
# Parent 15b2e82875d420d82ed576137b91c9915fb6594f
call malloc_opt()
diff --git a/mail/app/nsMailApp.cpp b/mail/app/nsMailApp.cpp
--- a/mail/app/nsMailApp.cpp
+++ b/mail/app/nsMailApp.cpp
@@ -27,16 +27,21 @@
#elif defined(XP_UNIX)
# include <sys/resource.h>
# include <unistd.h>
#endif
#include <stdio.h>
#include <stdarg.h>
#include <time.h>
+#if __linux__
+// for mallopt under linux if possible
+#include <malloc.h>
+#endif
+
#include "nsCOMPtr.h"
#include "nsIFile.h"
#ifdef XP_WIN
# include "mozilla/mscom/ProcessRuntime.h"
# include "mozilla/WindowsDllBlocklist.h"
# include "mozilla/WindowsDpiInitialization.h"
@@ -157,16 +162,38 @@ static bool IsArg(const char* arg, const
#endif
return false;
}
Bootstrap::UniquePtr gBootstrap;
static int do_main(int argc, char* argv[], char* envp[]) {
+
+ // I think the following needs to go into |XRE_main()|, too,
+ // if that spawns another proces.
+
+#if defined(XP_UNIX)
+#if __linux__
+ /* actually depends on gnu libc version. But linux most likely uses
+ a version of GNU libc, and later versions supports mallopt() */
+ {
+ char *cp = getenv("MOZILLA_MALLOPT");
+ if (cp) {
+ int val = 0xC3;
+ if(*cp)
+ sscanf(cp, "%x", &val);
+ Output("Calling mallopt for malloc/free debug. with 0x%02x value.\n", val & 0xFF);
+ mallopt(M_PERTURB, val & 0xFF);
+ }
+ }
+#endif
+#endif
+
+
// Allow thunderbird.exe to launch XULRunner apps via -app <application.ini>
// Note that -app must be the *first* argument.
const char* appDataFile = getenv("XUL_APP_FILE");
if ((!appDataFile || !*appDataFile) && (argc > 1 && IsArg(argv[1], "app"))) {
if (argc == 2) {
Output("Incorrect number of arguments passed to -app");
return 255;
}
Given that *file would be initialized to a sane value on allocation, I think there is
a use after free problem in the code.
Anyone has an idea on this?
One problem I have now is that I thought ASAN build would be handy. (valgrind is too slow often if I can run it.)
But I could not build ASAN version of C-C TB. I clearly recall I could build it a couple of years ago, something has changed in the build configuration that makes it hard to build an ASAN version locally. (I used clang back then, but even clang version could not be built due to strange library mismatch or whatever. So I tried GCC after careful anaysis of my local bash script and mozconfig for ASAN, but to no avail too.)
Comment 13•2 years ago
|
||
Unfortunately even in the days of Mozilla Messaging, I'm not sure we really had anyone who understood Mork deeply. I would suspect things like this have just been lying in the code for ages - checking the change logs should help there.
Unless there's active crash/bug reports of users hitting these, it is possibly not worth fixing given the current work to move away from Mork.
| Reporter | ||
Comment 14•2 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #13)
Unfortunately even in the days of Mozilla Messaging, I'm not sure we really had anyone who understood Mork deeply. I would suspect things like this have just been lying in the code for ages - checking the change logs should help there.
Unless there's active crash/bug reports of users hitting these, it is possibly not worth fixing given the current work to move away from Mork.
Thank you for the overview of the state of the mork code.
I will spend a few more days to test the behavior of C-C TB code (hoefully if I can make ASAN build locally with my patch, that would be great) in the face of I/O errors.
Problem is that C-C TB segfaults when I simulate I/O error. So this is a regression of a sort from what I observed during similar tests several years ago. C-C TB would continue to work once the I/O error introduced by disconnecting network cable was removed (by plugging the cable again).
Anyway, I will report what I find in the next couple of days.
| Reporter | ||
Comment 15•2 years ago
|
||
I found double free caused by calling an fclose to already closed stream using my locally built ASAN C-C TB with my patch.
This is inside MORK code.
I think I found a cause of this bug and testing the patch locally.
Obviously, mork code was not tested extensively against I/O error.
The local patch looks to be working in the sense that C-C TB does not crash anymore after a simulated I/O error by disabling the network interface to a remote NFS server on which the downloaded messages are written.
I gave up on CIFS for the testing purposes because C-C TB often gets stuck in the kernel cifs module.
(But before the local patch, C-C TB was calling the local I/O stdio code with bogus fileno such as -1 or for a file number which had already been closed previously.)
Whether we are handling messages during the simulated I/O error, i.e., not losing a message or two is another matter.
It will be a couple of days before I can say for sure.
Before the crash patch was applied, I saw many messages lost (probably due to them marked as deleted by a bug in mork code bug or higherlevel pop3client JS code. I will investigate this now that C-C TB does not crash after I/O error simulation.)
| Reporter | ||
Comment 16•2 years ago
|
||
I think I now spot a serious error in transactional processing of POP3 message download.
Thanks to my patch to mork, C-C TB no longer crashes in the face of an I/O error, but
I notice a place we need to improve. I wonder how the old C++/C pop3client handled this.
This is because I think C-C TB behaved sanely after simulated errors several years ago.
My observation is based on error handling sequence.
At the end, I show the debug dump.
When a message download begins (= Incorporate message begin),
my simulated I/O error happened.
Eventually, the error was noticed after the network timeout,
mail message delivery aborted. (= Abort mail message delivery.)
So this last message was not incorporated into the mail folder.
So far so good.
HOWEVER, note the lines marked with "*" in the debug dump.
* CloseStdio:file FILECLOSE failed
* morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652933e97b0, ev=0x56528ff717a0,
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
Mork failed to write the .msf file for a folder (folder-E) due to an I/O error.
BUT this mork error HAPPENED AFTER the "Abort mail message delivery line"
AND after the JavaScript pop3client printed the following error.
* JavaScript error: , line 0: UnknownError: Could not write to `/home/ishikawa/REMOTE-NFS/Maildir/popstate.dat': failed to open file for writing (NS_ERROR_FAILURE)
BAD!
I think toplevel pop3 client loop should do
something like the following.
Start a sequence of mail delivery
if no more message then goto success
before starting the next message download
save mork file (.msf) if it is not done by mork itself (not sure if it is done by mork automagically?)
! setup so that mork UPDATES A COPY the mork file (.msf) during download instead of the real .msf file
! that should persist before and after the download. This may fail. If so, goto abort.
start download
if failure of download or filtering, etc. goto abort.
! if download and filtering, etc. succeeded,
! then replace old .msf file with the updated mork file.
! if replacement fails goto abort
if OK, then repeat for the next message download.
abort: abort mail delivery altogether
exit
success: download of all the messages succeeded.
I think the current behavior shows that the lines marked with"!" probably does not happen
because the error of mork .msf updates are printed AFTER JS pop3client reported a write error???
Top-level POP3client should monitor the mork db update UNTIL it reports success after the
message is incorporated into a folder after a move by filter rule, etc.
Someone needs to check the JS pop3client code.
It does not seem to be tested well for error handling.
Again I wonder what the old C++/C pop3client did.
Debug dump:
= Incorporate message begin:
uidl string: 0001cde3533eeb2c
{debug} mboxFile existed and so did not create.
{streamdebug} value: mboxFile = 0x56528ffbf7a0 'mboxFile being passed to NS_NewLocalFileOutput(): I suspect this is nil under certain circumstances.' in InternalGetNewMsgOutputStream at line 867 of /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp
(debug) Creating buffered output stream to mboxFile=<<mailbox://ishikawa@localhost/Inbox>> in nsMsgBrkMBoxStore::GetNewMsgOutputStream in nsMsgBrkMBoxStore.cpp;
(debug) MsgNewBufferedFileOutputStream(mboxFile, aResult, PR_WRONLY, 00600););returns error rv=0x80004005
[Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:918
[Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsMsgBrkMBoxStore.cpp:781
(debug) IncorporateBegin: m_msgStore->GetNewMsgOutputStream(....) failed. rv = 0x80004005
[Parent 513704, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp:463
[Parent 513704, Main Thread] WARNING: NS_ENSURE_TRUE(uri) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/caps/BasePrincipal.cpp:1494
[Parent 513704, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2585
[WARN webrender::renderer::init] asking to enable_gpu_markers but no supporting extension was found
{streamdebug} closing: m_outFileStream (= (nil) )->Close() 'before possible Close()' in AbortMailDelivery at line 379 of /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/local/src/nsPop3Sink.cpp
= Abort mail message delivery.
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt (t=270.198) [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
[Parent 513704, Compositor] WARNING: Possibly dropping task posted to updater thread: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/gfx/layers/apz/src/APZUpdater.cpp:395
* JavaScript error: , line 0: UnknownError: Could not write to `/home/ishikawa/REMOTE-NFS/Maildir/popstate.dat': failed to open file for writing (NS_ERROR_FAILURE)
morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
morkStdioFile::Length:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
~nsMsgDatabase(): closing database /home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
calling RemoveFromCache(this)
CloseStdio:file->_fileno=90, this=0x5652933e97b0, ev=0x56528ff717a0,
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
* CloseStdio:file FILECLOSE failed
* morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652933e97b0, ev=0x56528ff717a0,
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-E.msf
[Parent 513704, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkConfig.cpp:20
morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
morkStdioFile::Length:file->_fileno=39, this=0x565292747850, ev=0x5652904eeac0
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/Inbox.msf
morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
morkStdioFile::Length:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
~nsMsgDatabase(): closing database /home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
calling RemoveFromCache(this)
CloseStdio:file->_fileno=89, this=0x5652934190e0, ev=0x56528e9b9d30,
this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf <-- updating .msf file in place!
CloseStdio:file FILECLOSE failed
* morkStdio::new_stdio_file_fault: file->_fileno=-1, this=0x5652934190e0, ev=0x56528e9b9d30,
* this->mFile_Name=/home/ishikawa/REMOTE-NFS/Maildir/folder-D.msf
[Parent 513704, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/mork/morkConfig.cpp:20
morkStdioFile::Length:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940
this->mFile_Name=/home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
morkStdioFile::Length:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940
this->mFile_Name=/home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
~nsMsgDatabase(): closing database /home/ishikawa/.thunderbird/acggbycn.Default User/Mail/Local Folders/Unsent Messages.msf
calling RemoveFromCache(this)
CloseStdio:file->_fileno=72, this=0x565290e56b60, ev=0x565290e56940,
I will file a bugzilla if with my patch to avoid crash, C-C TB still loses e-mails after the simulated I/O errors.
Before my patch, it did.
| Reporter | ||
Comment 17•2 years ago
|
||
C-C TB does lose messages with the simple simulated I/O error. It did not happen several years ago.
it seems to pretend that the downloading after the I/O error handling succeeds. But it does not actually.
I do suspect the deficiency of pop3client code now done in JS.
This is all for today.
| Assignee | ||
Comment 18•2 years ago
|
||
I added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.
Updated•2 years ago
|
| Reporter | ||
Comment 19•2 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #18)
Created attachment 9395497 [details]
Bug 1887532 - Handle gracefully failed popstate.dat writes. r=gds,ChiakiI added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.
Thank you Magnus. This is great.
Yeah, I was thinking that popstate.dat might be the next suspicion, so this certainly helps.
I am going to run a few tryserver runs and will post a WIP patch if they build on tryserver, esp. on Windows and MacOSX.
At least, the incorrect double free is now gone and mork does not crash.
Whether they behave sanely is another matter.
Oops, I forgot to apply this patch locally before the tryserver jobs were submitted, but the avoid double-free code will be checked by the running treeserver runs.
After shufflling the patches, I think there was a syntax error somehow.
So I am going to resubmit the jobs after applying this patch locally.
| Reporter | ||
Comment 20•2 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #18)
Created attachment 9395497 [details]
Bug 1887532 - Handle gracefully failed popstate.dat writes. r=gds,ChiakiI added some IO error handling for writing to popstate.dat.
How the other reported errors hang together or not, I can't say. The client doesn't directly
interact with the the message db (mork) - but that is handled by the sink.
I found Magnus's patch is effective for simple dowload of messages (without filter move.)
The following is the description of test that finally showed the
behavior I saw several years ago. A duplicate download of a single
message or two is tolerable considering what happened.
- A test to download messages only (without filter move/copy).
I sent many e-mails in a particular order shown below.
The network is down, and the message "Downloading message 49 of 216" got
stuck while the network disruption continued. I simulated the I/O
error by disabling the network interface to the NFS server where the
mail folders are stored.
1 /var/tmp/12C is sent
2 /var/tmp/6C is sent
3 /var/tmp/11C is sent
4 /var/tmp/5C is sent
5 /var/tmp/10C is sent
6 /var/tmp/4C is sent
7 /var/tmp/9C is sent
8 /var/tmp/3C is sent
9 /var/tmp/8C is sent
10 /var/tmp/2C is sent
11 /var/tmp/7C is sent
12 /var/tmp/1C is sent
13 /var/tmp/12C is sent
14 /var/tmp/6C is sent
15 /var/tmp/11C is sent
16 /var/tmp/5C is sent
17 /var/tmp/10C is sent
18 /var/tmp/4C is sent
19 /var/tmp/9C is sent
20 /var/tmp/3C is sent
21 /var/tmp/8C is sent
22 /var/tmp/2C is sent
23 /var/tmp/7C is sent
24 /var/tmp/1C is sent
25 /var/tmp/12C is sent
26 /var/tmp/6C is sent
27 /var/tmp/11C is sent
28 /var/tmp/5C is sent
29 /var/tmp/10C is sent
30 /var/tmp/4C is sent
31 /var/tmp/9C is sent
32 /var/tmp/3C is sent
33 /var/tmp/8C is sent
34 /var/tmp/2C is sent
35 /var/tmp/7C is sent
36 /var/tmp/1C is sent
37 /var/tmp/12C is sent
38 /var/tmp/6C is sent
39 /var/tmp/11C is sent
40 /var/tmp/5C is sent
41 /var/tmp/10C is sent
42 /var/tmp/4C is sent
43 /var/tmp/9C is sent
44 /var/tmp/3C is sent
45 /var/tmp/8C is sent
46 /var/tmp/2C is sent
47 /var/tmp/7C is sent
48 /var/tmp/1C is sent
49 /var/tmp/12C is sent <<<< An error was introduced.
50 /var/tmp/6C is sent <<<< The internal error processing persisted?
51 /var/tmp/11C is sent
52 /var/tmp/5C is sent
53 /var/tmp/10C is sent
54 /var/tmp/4C is sent
55 /var/tmp/9C is sent
56 /var/tmp/3C is sent
...
... actually many more continues.
... different messages were sent the same number of times.
Message"Downloading message 49 of 216" got stuck while the network disruption continued.
After the network error caused TCP timeout, CC-TB recognized it due to
the error return from the low-level I/O routine (Write()) and error dialog appeared.
Then I re-enabled the network interface, and for a brief moment, I see message
"Downloading message 50 of 216" which disappeared very quickly. (This is something I need to investigate. I/O handling is done by the main thread only, and thus while the main thread got stuck due to
TCP timeout, many other threads got stuck too(?) and many things happen at the same time after the error message dialog is thrown away by my hitting OK.
Now the original download is aborted (abort mail delivery) midway.
NOTE: Until a few days ago, C-C TB used to get stuck in the kernel or crashed in itself
(double free, etc) quite often after this step.
So I could not continue from this point on reliably.
But this is now taken care of by a patch I am about to post in Bug
1887532 (C-C TB crashes when an I/O error is simulated). Without the
patch, I could not proceed very reliably from this point on. I wonder
why the bug in mork surfaced lately. Maybe the subtle timing difference
between now and several years ago where I checked the I/O error
extensively?
Anyway, THEN, I manually continued download (Please recall the network
interface to the NFS server had been enabled.).
After all the downloaded finished, I counted the # of different
messages (with different timestamps in the subject line when it was
inserted at the sent time)). They ought to match.
I found ONLY THE MESSAGES for which C-C experienced the network error
(and after-effect of error processing?) are duped.
(I checked messages in subfolders also, but they had no messages
because filter move/copy is not invoked in this test.)
Message appears how many times...
checking /home/ishikawa/REMOTE-NFS/Maildir
============================================= folder folder-C
/home/ishikawa/REMOTE-NFS/Maildir/folder-C
============================================= folder folder-D
/home/ishikawa/REMOTE-NFS/Maildir/folder-D
============================================= folder folder-E
/home/ishikawa/REMOTE-NFS/Maildir/folder-E
============================================= folder folder-F
/home/ishikawa/REMOTE-NFS/Maildir/folder-F
============================================= folder Inbox
/home/ishikawa/REMOTE-NFS/Maildir/Inbox
= 10C 22
= 11C 22
= 12C 23 * Dup ! GREAT the one that got stuck! (msg # 49)
= 1C 22
= 2C 22
= 3C 22
= 4C 22
= 5C 22
= 6C 23 * Dup ! The one immediately after the network was re-enabled, (msg # 50)
I think because of the way error dialog appears in the middle of
error, I think the error processing was still underway internally.
= 7C 22
= 8C 22
= 9C 22
So two messages that were affected by network I/O errors were downloaded twice.
But this is much better than losing many or so many dupes.
OK, GREAT. Downloading without filter move works great with Magnus's patch.
(This is with my local buffered write patch.)
| Reporter | ||
Comment 21•2 years ago
|
||
Now what about the mixture of message download mixed with filters move.
Here I found really mysterious loss of e-mails in this case.
I have no idea of the error pattern yet.
This may have something to do with
Bug 1890253 (Compact does not work for C-C TB POP3. Inbox does not get compacted although some messages were moved during download.)
That is, even under normal conditions, C-C TB leaves the space of
message moved from Inbox to the other subfolders.
Something is not quite right in the code right now.
I will re-visit the mixed download case (simple download of messages and
messages that are moved by filters) after Bug 1890253 is fixed.
| Reporter | ||
Comment 22•2 years ago
|
||
So based on my observation in comment 20, the Magnus's patch should go in now IMHO.
Please note that I tested Magnus's patch with my other patches, and I
have not tested it agains the current pristine C-C TB. However, the
code is rather straightforward and under nomal conditions, it should
not cause any harm. I have already submitted a few treeserver test
jobs and it did not seem to cause any problems even without my
buffered write patches. So I don't see the problem.
BTW, with Magnus's patch,
I notice a very subtle bug in my local patches which
caused the message in https://searchfox.org/comm-central/source/mailnews/base/src/MboxMsgOutputStream.cpp#303
NS_WARNING("Failed to roll back mbox file");
to print.
After seeing this, I was able to fix a very subtle dormant problem in my patch locally.
I think the bug was here for a year or two.
So many patches have begun to affect each other and produce a long standing
fix for the I/O error handling in a nice way.
But we have a long way to go including the mysterious loss of messages when
some messages are moved by message filters to other subfolders when an error hits.
| Reporter | ||
Comment 23•2 years ago
|
||
My take on this right now is based on my comment
https://bugzilla.mozilla.org/show_bug.cgi?id=1887532#c16
I modified the pseudo code slightly:
Problem with the current C-C code is that checkpointing and rollback of important meta data are not
explicitly mentioned IMHO.
At least for POP3 download, the transactional handling of the message body ONLY is done (but buggy before).
I cannot say for other meta data such as the one handled by MORK.
The following SHOULD BE the POP3 download transaction overview
(Maildir may be similar I don't know much about IMAP case. )
I suspect that when messages are moved, some meta data are not handled properly
in the face of error condition.
Start a sequence of mail delivery
if no more message then goto success
while there is message to download
// Checkpoint step
before starting the next message download
Save mork file (.msf) if it is not done by mork itself (not sure if
it is done by mork automagically?)
Save other meta information (header database is in .msf (correct?)
popstate.dat is now saved (and restored) by Magnus's patch.
if each of the save step fails
then
display error, and goto abort
endif
Not sure what is done regarding the transaction rollback.
Is there a rollback feature in MORK?
! Considering the file system full case, I think it is best to
! handle that possibility to let Mork modifies a new copy of .msf files.
! Set up so that mork UPDATES A COPY the mork file (.msf) during
! download (and possible move/copy by filtering) instead of the real
! .msf file that should persist before and after the download
! failure.
! This preparation step may fail. If so, goto abort.
(Not sure if this is done anywhere inside MORK.
nsPop3Sink.cpp should be doing something like this according to Magnus.
Someone needs to investigate.)
Start download
if failure of download or filtering, etc. then goto abort.
if download and filtering, etc. succeeded,
then
replace old .msf file with the updated mork file.
do the same for other old meta data including popstate.dat with the
updated meta data file.
if replacement step fails goto abort
endif
if all of the above steps OK, then repeat for the next message download. endif
endwhile
abort: abort mail delivery altogether, and
make sure that the meta data files are all in the checkpointed state now
exit
success: download of all the messages succeeded.
exit
How to mix the message body handling itself in the above process is also important.
We have a long way to go, but the simple case is now handled more or less satisfactory manner.
| Reporter | ||
Comment 24•2 years ago
|
||
Mark the structure that contains the FILE structure properly inactive
and not open even if the fclose() operation itself failed.
This avoids the double free and segfaults.
Also make the reading of unexpected characters an error, not a warning.
This happens so often in a corrupted mork data left over from an I/O
error such as artificially introduced network error to the remote CIFS
and/or NFS server where the message folders are stored during POP3
error test. Having to cope with warning messages (maybe 512 times?)
was not very productive, and the incorrect data IS SIGNALING AN ERROR
after all.
Added a few debug statements where error from Write() and other
low-level I/O function is not checked. Can Mork get away from checking
the low-level I/O errors? I don't think so.
Updated•2 years ago
|
| Assignee | ||
Updated•2 years ago
|
Comment 25•2 years ago
|
||
Pushed by john@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/1d090eae9aa5
Handle gracefully failed popstate.dat writes. r=gds,Chiaki
Updated•2 years ago
|
| Reporter | ||
Comment 26•2 years ago
|
||
I am checking the legacy code segfaulting in mork code when an I/O error is simulated by disabling the network interface that talks to a remote NFS server where the messages is stored.
I had to fix the code in Bug 1890253 because I use an external program to scan relevant folders to see
if all the e-mails I sent are received and the content is as it should be.
If Inbox has logically unused space that contains the image of an old message, that would be counted falsely by the check program. (I don't try to parse the .msf mork file at all.)
Now, the download to Inbox only is handled rather gracefully in the face of a simulated error, but the
multiple downloads of messages some of which are moved to other folders from Inbox resulted in
loss of many messages. some times.
Now I have begun to suspect that the mork gets quite confused (?) and maybe we are discarding/truncating folders
too eagerly (by looking at the offset of previously downloaded message (many messages apart) and thus
we lose the messages downloaded thus far. <--- This is a pure conjecture, but otherwise I cannot explain the wholesale deletion of messages which I SOMETIMES observe.
The symptom is obviously different based on the timing.
I observed another strange issue that the download could not continue after an I/O error.
This happened relatively few times, but once this happens, I think I have to quit C-C TB and repair folders and
only then I could continue.
(I have an additional patch for this, but since it is very much timing dependent, I need more testing
to say, the additional patch fixed the problem.)
Due to the way the network is disabled externally from a shell prompt currently, I cannot micromanage the
disabling of network interface deterministically and so reproducing the condition which forces C-C TB to lose many messages has been difficult.
I am now tempted to either
- insert a code in C-C TB locally for debugging purposes to
tweak the network interface up/down at the entry of key functions, or even - insert a code to do so between each line of related functions to
simulate the error at key positions.
Otherwise, I don't think I can re-create the condition where many messages are lost in one error easily.
I am thinking over it now.
The idea goes like this.
A macro inserted is expanded to a call to system() function that executes a bash script with a
filename and line number of its location.
In the shell script, it looks at an environment variable to see if the file matches an externally specified filename, and if the line number equals to an externally specified number.
If so, the network is disabled, and thus from the next operation to read/write to the NFS server will return an error after TCP timeout. Thus I can manage to create an error condition at the resolution of either at a function entry or
line by line execution.
From what I have observed, the many message loss situation seems to occur when a message is moved to a folder.
So I suspect I should insert such macros in the related functions in the source files and simulate an error...
As I mentioned previously, simple download only to Inbox is now handled rather well. Maybe a duplication a message or two, that were not properly put into the Inbox folder. It is better than losing a message or segfaulting.
Updated•1 year ago
|
Comment 27•1 year ago
|
||
Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/870b6cedf710
Fix crash due to re-use closed FILE* and quit early on incorrect input. r=mkmelin,BenC
Comment 28•1 year ago
|
||
Gently suggesting we should close this very soon so the bug doesn't cross esr boundaries. (it currently has a target milestone of 126)
Is there more to do for this crash?
| Assignee | ||
Comment 29•1 year ago
|
||
Yep, let's continue in other bugs if needed.
| Reporter | ||
Comment 30•1 year ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #29)
Yep, let's continue in other bugs if needed.
Thank you for closing this.
I have been trying to test the I/O errors using sikulix to simulate human operations via menu/mouse when the
simulated (transient NFS) I/O error is introduced at function entries. It has taken me almost three weeks to get things straightened out, but I feel the end is near. I think we no longer see the crash in the following scenario.
Download four e-mails. One of them is deleted (moved to trash) by a filter rule.
The other threes are copied to different folders by filter rules.
Without the patch, we saw crashes. But so far in about 400- test scenario cases, I have not seen crashes.
OTOH, I have noticed very strange error messages that may confuse users, but I will report them in different bugzilla.
The 400- tests have not finished yet (I think there are still 30 cases or more. I let it run in the background.)
Description
•