Closed Bug 1355350 Opened 3 years ago Closed 3 years ago

thunderbird 52: big attachments opened/saved incompletely when stored in IMAP folder not synchronised for offline use

Categories

(MailNews Core :: Networking: IMAP, defect, critical)

defect
Not set
critical

Tracking

(thunderbird_esr5253+ fixed, thunderbird53 wontfix, thunderbird54 fixed, thunderbird55 fixed)

RESOLVED FIXED
Thunderbird 55.0
Tracking Status
thunderbird_esr52 53+ fixed
thunderbird53 --- wontfix
thunderbird54 --- fixed
thunderbird55 --- fixed

People

(Reporter: jmarc.thiault+bugzilla, Assigned: jorgk)

References

Details

(Keywords: dataloss, regression)

Attachments

(6 files, 3 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
Build ID: 20170323105023

Steps to reproduce:

Thunderbird 52 stable release under Windows 10 64 bits (1607 or 1703 releases)
when you open large attachment it is partially stripped to a fixed size 2985Ko and then unreadable

if you ask for download attachment , size is good and open succesfull
very annoying bug


Actual results:

Large attachments stripped to 2985Ko when directly opened


Expected results:

attachment not stripped then opened successfully
Component: Mail Window Front End → Attachments
Product: Thunderbird → MailNews Core
Version: 52 Branch → 52
Looks like you're using an IMAP account for the received message.

So I've sent and image of 7 MB to an IMAP account. If the containing folder is not synchronised for offline use, the picture indeed opens only partially and is also saved opened partially.

In a synchronised folder everything is fine. What do you mean by "if you ask for download attachment"?

Looks like a regression from bug 1021843 or bug 629738. I tested it back then, but perhaps not with very big attachments.
Severity: normal → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Yes I use IMAP account but not synchronized for offline use

when mail is displayed you have some options either open directly attachment or store it in any place

when open directly bug arise
when store it then go to stored directory then double click on saved file , size is good and open is succesfull

his bug doesnot arise on thunderbird 45.8 

best regards
Thanks for confirming. I had a case where the attachment was also saved incompletely. I'll look into it.
Keywords: dataloss
Summary: thunderbird 52 big attachments stripped when opened → thunderbird 52: big attachments opened/saved incompletely when stored IMAP folder not synchronised for offline use
Summary: thunderbird 52: big attachments opened/saved incompletely when stored IMAP folder not synchronised for offline use → thunderbird 52: big attachments opened/saved incompletely when stored in IMAP folder not synchronised for offline use
Debug patch, let's see where the problem is.
OK, here comes the debugging on a 7 MB pictures which base64 encoded grows to 10 MB:

First access to the the message, message is not in the cache, so it is retrieved from the server and written into the cache.
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.ho
=== Calling AsyncOpenURI() on entire message
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E
=== nsImapMockChannel::OnCacheEntryAvailable: WRITE END

Second access to the cache, surprise, message is 9.7 MB in size, cache only has 4 MB, so rejecting cache entry.
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.ho
=== Calling AsyncOpenURI() on entire message to get part
=== nsImapMockChannel::OnCacheEntryAvailable, part (1)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=4187812
=== reading full message, should use cache entry: 1
=== annotation: |Not Modified|
=== messageSize 9791526, entrySize 4187812, rejecting entry
=== End separate processing: should use cache entry: 0
=== After header check: should use cache entry: 0
=== Cache entry rejected
=== nsImapMockChannel::ReadFromMemCache: Returning 80004005
=== nsImapMockChannel::OnCacheEntryAvailable: READ not successful

Next access to the cache, to open or save the attachment.
Since the cache entry was to small on second access, the entry got removed from the cache
and needs to be retrieved a second time from the server.
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.ho
=== Calling AsyncOpenURI() on part, will write, first time
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E
=== nsImapMockChannel::OnCacheEntryAvailable: WRITE END

While it is being retrieved, it is concurrently read. And there something seems to
go wrong as the message part is now truncated.
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.ho
=== Calling AsyncOpenURI() on existing part, will read
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E
=== nsImapMockChannel::OnCacheEntryAvailable: concurrent read, no size available
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=0
=== reading part, should use cache entry
=== End separate processing: should use cache entry: 1
=== After header check: should use cache entry: 1
=== Cache entry accepted

So two problems here:
1) Cache entry is too short.
2) Concurrent write/read doesn't seem to work, or perhaps too little is written.
As far as I remember my testing from summer 2016, there was never a case where the cache entry read back was too short.

I remember various cases where the cache was concurrently written and read, and that worked.

Honza, just two very silly questions:
1) There is no size limitation of what you can write into the cache, right?
2) The cache is designed for concurrent access, so while someone is writing to it, someone else
   can already read the data already written and when writing stops, everything can be read.

I guess there is an application error somewhere where we don't write sufficient stuff into the cache. That would explain why the entry is short and also why it's too short upon concurrent write/read where we have no check (since we can't get the size of something being written) and trust to get the full length.
Flags: needinfo?(honzab.moz)
(In reply to Jorg K (GMT+2) from comment #6)

https://dxr.mozilla.org/comm-central/rev/5a9983b82898ad03268862080953083e0b626be0/mailnews/imap/src/nsImapProtocol.cpp#9421

is definitely wrong, you must read asynchronously.  but read on, there's more info that you may find useful.


> As far as I remember my testing from summer 2016, there was never a case
> where the cache entry read back was too short.
> 
> I remember various cases where the cache was concurrently written and read,
> and that worked.
> 
> Honza, just two very silly questions:
> 1) There is no size limitation of what you can write into the cache, right?

- There is a 4GB limit per file, this is an implementation limitation

- There are also preferences to limit size per entry.  50MB for a single disk file, 5MB when stored only in memory - probably your case?

  controlled by browser.cache.disk.max_entry_size and browser.cache.memory.max_entry_size

- And there is an overall limit that should be 350MB by default.

  controlled by browser.cache.disk.capacity

If any of the above limits is overrun, we don't 'cut' entries.  We delete them completely.


- But there is also the write backlog limit.  If there is too much data (>10MB by default) scheduled for write and the disk is too slow to store them, we start throwing the pending data away.  You must handle errors returned by cache entry's output stream Write()!  It can throw NS_ERROR_OUT_OF_MEMORY in this case.  And as I'm looking into our code, we don't throw away the entry for you and leave it broken, I believe (no intention to fix this, tho).  You have to handle it by dooming the entry on write errors.

  controlled by browser.cache.disk.max_priority_chunks_memory_usage and browser.cache.disk.max_chunks_memory_usage


> 2) The cache is designed for concurrent access, so while someone is writing
> to it, someone else
>    can already read the data already written and when writing stops,
> everything can be read.

Exactly.  Just a note that when you want to read from the cache entry's input stream, the output stream has to be already opened or (after data were written) closed.  Otherwise, the input stream will report 0 size like no data would be written.  

And, more importantly, you always MUST read the input stream asynchronously (see nsIAsyncInputStream).  When the write is still in progress, the input stream may return NS_ERROR_WOULD_BLOCK when read.  This is by contract in general for any asynchronous input stream in Gecko, not just limited to cache input stream impl.  You have to implement nsIInputStreamCallback, or you can use nsInputStreamPump to read the stream.

> 
> I guess there is an application error somewhere where we don't write
> sufficient stuff into the cache. That would explain why the entry is short
> and also why it's too short upon concurrent write/read where we have no
> check (since we can't get the size of something being written) and trust to
> get the full length.



In the log from the debug patch I can see the size == 0 is in the NS_ERROR_IN_PROGRESS branch.  It means the output stream of the entry is still open.
Flags: needinfo?(honzab.moz)
Thanks for the quick and detailed answer, much appreciated. There's plenty to investigate. At first sight, we're hitting the 5 MB limit and then don't handle the error well. A workaround would be to increase this limit to a reasonable message size.

Also note that we use a nsIStreamListenerTee to put the data into the cache and current channel listener (that's what the comment says).

For now just one thing:
https://dxr.mozilla.org/comm-central/rev/5a9983b82898ad03268862080953083e0b626be0/mailnews/imap/src/nsImapProtocol.cpp#9421
> is definitely wrong, you must read asynchronously.
    // Note: This will not work for a cache2 disk cache.
    // (see bug 1302422 comment #4)
    rv = in->Read(firstBlock, sizeof(firstBlock), &readCount);
    NS_ENSURE_SUCCESS(rv, rv);
Is this really wrong? We had this conversation in bug 1302422 comment #4. The Read() doesn't work if it's a disk cache, but it should be OK for a memory cache, right?

(Sorry, putting NI again since I don't know how closely you follow other people's bugs).
Flags: needinfo?(honzab.moz)
Ups I wrote "The input stream on disk-backed cache entries" but it applies to mem cache as well.  If the write is in progress, how would you expect the stream be synchronously read in one operation?  The goal of the cache2 project was to make the cache completely non-blocking (jank-free).  Hence input must always be read asynchronously.
Flags: needinfo?(honzab.moz)
We read 100 bytes synchronously, so I hope that doesn't break the bank ;-)
The call is rejected outright for a disk-backed cache, but succeeds for a memory-based cache.
(In reply to Jorg K (GMT+2) from comment #10)
> We read 100 bytes synchronously, so I hope that doesn't break the bank ;-)
> The call is rejected outright for a disk-backed cache, but succeeds for a
> memory-based cache.

No, you always have to read asynchronously to be completely risk-free, sorry if that was not clear.  When entry->GetDataSize fails, the output stream is still open and you can get NS_ERROR_WOULD_BLOCK result from Read() and you also don't know the size of the actual data of the entry.
OK, but I have bigger problems than the "unsafe" read right now. There seem to be logic errors somewhere so the message gets truncated.

I've just tested with a 2 MB attachment and everything works fine. My debug shows when retrieving the data from the cache:

=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=2465510
=== reading full message, should use cache entry: 1
=== annotation: |Not Modified|
=== messageSize 2465510
=== End separate processing: should use cache entry: 1
=== After header check: should use cache entry: 1
=== Cache entry accepted
=== nsImapCacheStreamListener::OnDataAvailable: Count = 2465510

2.5 MB handled everywhere in one chunk.

I haven't found the spot were we write to the cache, as I said in comment #8, we use a nsIStreamListenerTee.
Hello,

very offen when direct open bugs store attachment then open it is good

but in my last test with mp4 attachment of 3.6 Mo
    direct open is good VLC read it and size in temp directory is 3.6Mo
    but when store it size is truncated to 89 Bytes  ?????

Best regards
(In reply to Jorg K (GMT+2) from comment #12)
> OK, but I have bigger problems than the "unsafe" read right now. There seem
> to be logic errors somewhere so the message gets truncated.
> 
> I've just tested with a 2 MB attachment and everything works fine. My debug
> shows when retrieving the data from the cache:
> 
> === nsImapMockChannel::OnCacheEntryAvailable: READ, size=2465510
> === reading full message, should use cache entry: 1
> === annotation: |Not Modified|
> === messageSize 2465510
> === End separate processing: should use cache entry: 1
> === After header check: should use cache entry: 1
> === Cache entry accepted
> === nsImapCacheStreamListener::OnDataAvailable: Count = 2465510
> 
> 2.5 MB handled everywhere in one chunk.
> 
> I haven't found the spot were we write to the cache, as I said in comment
> #8, we use a nsIStreamListenerTee.

Have a log for 'cache2:5' module and grep for CacheFileOutputStream::Write.
Component: Attachments → Networking: IMAP
Thanks Honza.

First I noticed:
[5344] WARNING: Write failed (non-fatal): file c:/mozilla-source/comm-central/mozilla/xpcom/io/nsInputStreamTee.cpp, line 179

Now the cache debugging shows:
  CacheFileOutputStream::Write() - Wrote 15984
259 times, that's 4139856, so approx. what was reported in comment #5 (size=4187812).
Then I see:
  CacheFileOutputStream::Write() - Entry is too big, failing and dooming the entry.
as we already expected.
https://dxr.mozilla.org/mozilla-central/rev/ce69b6e1773e9e0d0a190ce899f34b1658e66ca4/netwerk/cache2/CacheFileOutputStream.cpp#102

So that then takes writing there any further *silently* out of the Tee, so we don't even notice that something went wrong.

I don't see that the entry was really doomed, since when we request it later, we receive it delivered to OnCacheEntryAvailable(), but with the shorter size of 4187812 bytes.

So is it really doomed, Honza?

Increasing both browser.cache.memory.max_entry_size and browser.cache.disk.max_chunks_memory_usage to 20000 doesn't make any difference, it still gives up after about 4 MB, not even increasing to 50000 makes any difference.

So so far this raises two questions:
1) Is the entry that fails to be written because it's becoming too big really doomed?
   If so, why do I get it delivered to OnCacheEntryAvailable()?
2) Why can't I increase the threshold? OK, I could study the cache2 code ;-)

So now we at least understand that retrieving the message part the first time around doesn't place it into the cache properly. It's unfortunate that the Tee silently ignores that. We can increase the cache capacity to handle bigger attachments.

However, we detect the fact that the cache entry is no good and proceed to reading the message again from the IMAP server. We also doom the entry.

Then, when someone wants to open the attachment, we test for existence of the complete message in the cache. The answer is "no" since we've just doomed that entry. So now we go for the part, as can be seen in the debug:
=== Calling AsyncOpenURI() on part, will write, first time
That's been fetched from the IMAP server and placed into the cache, but somehow, only about 65 KB are written, I see:
"Wrote 15984 bytes" four times, followed by "Wrote 1596 bytes" once, so 65532 bytes.

Then a second request for the part comes in, the cache entry now exists, if fact, it's still in the process of being written. We detect that and concurrently read it, but get an even shorter entry of 65 KB which is accepted since due to the concurrent read, the size can't be compared.

So while I'll dig into this second problem deeper, perhaps Honza can answer the questions above. I'll attach the log.
Flags: needinfo?(honzab.moz)
Attached file Cache log
Coming back to the concurrent write/read. That is triggered when a message contains an inline attachment (image) and a non-inline attachment (like a PDF) (or only inline attachments but showing them is turned off) on a server that supports part fetching. This was tested and documented in bug 629738 comment #31.

The debug looks like this in that case:
Fetching entire message from cache first:
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2/;section=2?part=1.2&filename=test.pdf|
=== Calling AsyncOpenURI() on entire message to get part
=== nsImapMockChannel::OnCacheEntryAvailable, part (1)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2|
=== Entry is not complete and cannot be used

Entry no good, so fetching part:
Write first:
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2/;section=2?part=1.2&filename=test.pdf|
=== Calling AsyncOpenURI() on part, will write, second time
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2?part=1.2&filename=test.pdf|
=== nsImapMockChannel::OnCacheEntryAvailable: WRITE END

Concurrent read:
=== nsImapMockChannel::OpenCacheEntry |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2?section=2?part=1.2&filename=test.pdf&type=application/pdf&filename=test.pdf|
=== Calling AsyncOpenURI() on existing part, will read
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://info%40hostalpancheta%2Ees@mail.hostalpancheta.es:993/fetch%3EUID%3E.INBOX.AAA%20No%20local%20storage%3E2?part=1.2&filename=test.pdf|
=== nsImapMockChannel::OnCacheEntryAvailable: concurrent read, no size available
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=0
=== reading part, should use cache entry
=== End separate processing: should use cache entry: 1
=== After header check: should use cache entry: 1
=== Cache entry accepted

So that worked back in September 2016 and still works, but now let's try this when with a >5 MB attachment when the cache will reject the entry. Coming up in the next comment.
OK, with a 7 MB attachment this happens.

Part is requested twice, first request fills the cache, second request reads from it, I see:
WARNING: Write failed (non-fatal): file c:/mozilla-source/comm-central/mozilla/xpcom/io/nsInputStreamTee.cpp, line 179
and
CacheFileOutputStream::Write() - Entry is too big, failing and dooming the entry.
I don't see any read error.

My PDF reader complains about a broken document, that's pretty clears since it got truncated.

The second time I open the attachment from the message I get:

=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9/;section=3?part=1.3&filename=Steeplechase%20CDcatalog.pdf|
=== Calling AsyncOpenURI() on existing part, will read
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9?part=1.3&filename=Steeplechase%20CDcatalog.pdf|
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=4187799
=== reading part, should use cache entry
=== End separate processing: should use cache entry: 1
[9852] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8052000F: file c:/mozilla-source/comm-central/mozilla/netwerk/cache2/CacheEntry.cpp, line 1183
[9852] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8052000F: file c:/mozilla-source/comm-central/mailnews/imap/src/nsImapProtocol.cpp, line 9447
=== nsImapMockChannel::OnCacheEntryAvailable: READ not successful
So the read didn't work, although the truncated entry was presented to OpenCacheEntry(). So we doom the entry and read the data again from the IMAP server.

The second request now doesn't find the entry in the cache, so it also reads it again from the server at the same time putting it into the cache, which fails again due to the size.
=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9?section=3?part=1.3&filename=Steeplechase%20CDcatalog.pdf&type=application/pdf&filename=Steeplechase%20CDcatalog.pdf|
=== Calling AsyncOpenURI() on entire message to get part
=== nsImapMockChannel::OnCacheEntryAvailable, part (1)
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9|
=== Entry is not complete and cannot be used
=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9?section=3?part=1.3&filename=Steeplechase%20CDcatalog.pdf&type=application/pdf&filename=Steeplechase%20CDcatalog.pdf|
=== Calling AsyncOpenURI() on part, will write, second time
=== nsImapMockChannel::OnCacheEntryAvailable, part (0)
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E9?part=1.3&filename=Steeplechase%20CDcatalog.pdf|
=== nsImapMockChannel::OnCacheEntryAvailable: WRITE END

So there are various problems here:
1) How can the cache entry size be increased?
   Increasing browser.cache.memory.max_entry_size and browser.cache.disk.max_chunks_memory_usage
   don't seem to have an effect.
2) After an entry is rejected due to its size, what state is it left in?
   The log says it's doomed, but requesting the entry again it seems to be presented to
   OnCacheEntryAvailable() as existing with the shorter size.
3) Concurrent read doesn't seem to be a great idea since when reading, I don't
   know that the entry will ever be written successfully, it might fail due to excessive size.

3) is the part of the code I can control. I can't control 1) and I also don't see the cache write error since it's eaten silently in the Tee.

I'll run this again with some better debugging.
OK, slightly changed the debug.

Completely fetch a message from the server, try to cache it, which fails due to size, then retrieve it from the cache, gives:
=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2?part=1.2&type=image/jpeg&filename=Stephanie.jpg|
=== Calling AsyncOpenURI() on entire message to get part
=== nsImapMockChannel::OnCacheEntryAvailable, new=0 <==== Not new but truncated!
=== nsImapMockChannel::OnCacheEntryAvailable, get part from entire message=1
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2|
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=4187812
=== reading full message, should use cache entry: 1
=== annotation: |Not Modified|
=== messageSize 9791526, entrySize 4187812, rejecting entry

So as I noticed before and can now confirm by dumping out the 'new' parameter:
The "defective" entry *was* made available as existing/non-new.

That's point 2) from the previous comment.

Reading back from your comment #7 (quote):
===
You must handle errors returned by cache entry's output stream Write()! ... And as I'm looking into our code, we don't throw away the entry for you and leave it broken, I believe (no intention to fix this, tho). You have to handle it by dooming the entry on write errors.
===

On one hand, you should make your logging less confusing: "failing and dooming the entry" is just not correct.

On the other hand, we have a problem: We don't see the write error, it's (not) handled in the Tee:
  Write failed (non-fatal): file c:/mozilla-source/comm-central/mozilla/xpcom/io/nsInputStreamTee.cpp

So I'm sorry about repeating my results over and over again, here - hopefully - for the last time:

1) We need to increase the max. entry size.
   Increasing browser.cache.memory.max_entry_size and browser.cache.disk.max_chunks_memory_usage
   don't seem to have an effect.
2) After an entry is rejected due to its size, it is left broken in the cache.
   When requested again, it gets delivered as existing, but truncated.
   Our problem: Write failure happens in the Tee, we don't notice it.
   With luck, when reading the entire message, we notice it later based on the size mismatch.
3) Concurrent read doesn't seem to be a great idea since when reading, I don't
   know that the entry will ever be written successfully, it might fail due to excessive size.

Let me try to address point 3) now.
Sorry, I have to repeat this again:
2) After an entry is rejected due to its size, it is left broken in the cache.
   When requested again, it gets delivered as existing, but truncated.
   Our problem: Write failure happens in the Tee, we don't notice it.
   With luck, when reading the entire message, we notice it later based on the size mismatch.

We seem to have more luck. Even though the entry is presented a non-new, we can't read from it, we get
[9440] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8052000F: file c:/mozilla-source/comm-central/mozilla/netwerk/cache2/CacheEntry.cpp, line 1183
So you're not dooming it, but somehow it's doomed anyway, since it can't be read.

So, I'll attach a patch for item 3) and if we could then increase the size as per item 1) we're done here ;-)
Honza, is this the best way to stop concurrent read?

With this little patch alone, all the problems go away since half-written cache entries are also detected when read later and then doomed.
It would still be nice to increase the size of the entries.
Damn, another repetition of what I already said ;-)

Reporter, would you be interested to test a Daily version of Thunderbird to check that the issue has been fixed?
Flags: needinfo?(petitlou60.interne05)
Attachment #8858898 - Flags: feedback?(honzab.moz)
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Comment on attachment 8858898 [details] [diff] [review]
1355350-stop-concurrent-read.patch (v1).

Review of attachment 8858898 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/imap/src/nsImapProtocol.cpp
@@ +9188,5 @@
> +      // return NS_ERROR_IN_PROGRESS.
> +      int64_t size = 0;
> +      rv = entry->GetDataSize(&size);
> +      if (NS_FAILED(rv))
> +        break;

better is to call this method from OnCacheEntryCheck.  If it fails, just let the result of OnCacheEntryCheck be RECHECK_AFTER_WRITE_FINISHED.  That will make OnCacheEntryCheck be called again when write is done and delay OnCacheEntryAvailable until that.  If you don't want to wait up until write is finished, then this patch is probably correct.

Both are tho just quick fixes.  I'll try to find time to look at the logs and give more advises, but not sure I'll manage this week.
Attachment #8858898 - Flags: feedback?(honzab.moz)
(In reply to Jorg K (GMT+2) from comment #19)
> OK, slightly changed the debug.
> 
> Completely fetch a message from the server, try to cache it, which fails due
> to size, then retrieve it from the cache, gives:
> === nsImapMockChannel::OpenCacheEntry
> |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.
> No%20local%20storage%3E2?part=1.2&type=image/jpeg&filename=Stephanie.jpg|
> === Calling AsyncOpenURI() on entire message to get part
> === nsImapMockChannel::OnCacheEntryAvailable, new=0 <==== Not new but
> truncated!
> === nsImapMockChannel::OnCacheEntryAvailable, get part from entire message=1
> ===
> |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.
> No%20local%20storage%3E2|
> === nsImapMockChannel::OnCacheEntryAvailable: READ, size=4187812
> === reading full message, should use cache entry: 1
> === annotation: |Not Modified|
> === messageSize 9791526, entrySize 4187812, rejecting entry
> 
> So as I noticed before and can now confirm by dumping out the 'new'
> parameter:
> The "defective" entry *was* made available as existing/non-new.
> 
> That's point 2) from the previous comment.
> 
> Reading back from your comment #7 (quote):
> ===
> You must handle errors returned by cache entry's output stream Write()! ...
> And as I'm looking into our code, we don't throw away the entry for you and
> leave it broken, I believe (no intention to fix this, tho). You have to
> handle it by dooming the entry on write errors.
> ===
> 
> On one hand, you should make your logging less confusing: "failing and
> dooming the entry" is just not correct.

OK, if the entry is not doomed, then the log is incorrect, yes.

> 
> On the other hand, we have a problem: We don't see the write error, it's
> (not) handled in the Tee:
>   Write failed (non-fatal): file
> c:/mozilla-source/comm-central/mozilla/xpcom/io/nsInputStreamTee.cpp
> 
> So I'm sorry about repeating my results over and over again, here -
> hopefully - for the last time:
> 
> 1) We need to increase the max. entry size.
>    Increasing browser.cache.memory.max_entry_size and
> browser.cache.disk.max_chunks_memory_usage
>    don't seem to have an effect.

This is weird.  Sounds like a bug to me.  Have you tried both disk and memory variants of the pref? 

> 2) After an entry is rejected due to its size, it is left broken in the
> cache.
>    When requested again, it gets delivered as existing, but truncated.
>    Our problem: Write failure happens in the Tee, we don't notice it.
>    With luck, when reading the entire message, we notice it later based on
> the size mismatch.

This should be fixed, a broken entry, where we may not know the expected size every time, should not be left in the cache.

> 3) Concurrent read doesn't seem to be a great idea since when reading, I
> don't
>    know that the entry will ever be written successfully, it might fail due
> to excessive size.

Yes, I had to solve that as well for HTTP.  If the first channel writing the entry is interrupted, there is a code that readers have to attempt to load the rest from the server via a range request.  The thing is that the cache is primarily intended, designed and optimized for the HTTP usage.

> 
> Let me try to address point 3) now.
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #22)
> better is to call this method from OnCacheEntryCheck.  If it fails, just let
> the result of OnCacheEntryCheck be RECHECK_AFTER_WRITE_FINISHED.  That will
> make OnCacheEntryCheck be called again when write is done and delay
> OnCacheEntryAvailable until that.  If you don't want to wait up until write
> is finished, then this patch is probably correct.
Of course we can wait since we'd get our data from the server otherwise anyway.

However, waiting only makes sense if we're guaranteed to get the entry, which we already know we don't in case of the size violation.
So let's say we do that and the write fails in the end. OnCacheEntryAvailable() will then be called with new==true since the entry didn't get written successfully in the first place? So we're back to square one?

At least that would avoid a second fetch from the server on the second request if the first actually succeeds.

I'll try it.
Fixed comment.
Attachment #8858898 - Attachment is obsolete: true
The key is:
https://dxr.mozilla.org/mozilla-central/rev/c697e756f738ce37abc56f31bfbc48f55625d617/netwerk/cache2/CacheObserver.cpp#477

I stepped though it with the debugger, when not setting browser.cache.memory.capacity I get
sAutoMemoryCacheCapacity==33554432 and divided by eight I'm just above 4 MB. There you go.

Setting browser.cache.memory.capacity to 50000 allows cache entries up to about 6 MB.
Increasing to 100000 will finally let me handle my 10 MB attachment.

So now we know how to tweak the system to accept larger sized cache entries, in fact one can use this:
browser.cache.memory.capacity 200000 (so max. entry 25 MB) and
browser.cache.memory.max_entry_size -1 (unlimited).

So the reporter can set these values to fix the "very annoying bug" for now.
That's the other option. The read gets delayed until the write has finished.

With that patch I still observe a heavily truncated part:

First request: Writes to cache:
=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2?part=1.2&filename=Stephanie.jpg|
=== Calling AsyncOpenURI() on part, will write, first time
=== nsImapMockChannel::OnCacheEntryAvailable, new=1
=== nsImapMockChannel::OnCacheEntryAvailable, get part from entire message=0
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2?part=1.2&filename=Stephanie.jpg|
=== nsImapMockChannel::OnCacheEntryAvailable: WRITE END

Second request reads, now not concurrent any more:
=== nsImapMockChannel::OpenCacheEntry |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2?part=1.2&type=image/jpeg&filename=Stephanie.jpg|
=== Calling AsyncOpenURI() on existing part, will read
=== nsImapMockChannel::OnCacheEntryAvailable, new=0
=== nsImapMockChannel::OnCacheEntryAvailable, get part from entire message=0
=== |imap://jorgk%40jorgk%2Ecom@mail.your-server.de:143/fetch%3EUID%3E.INBOX.No%20local%20storage%3E2?part=1.2&filename=Stephanie.jpg|
=== nsImapMockChannel::OnCacheEntryAvailable: READ, size=65536
=== reading part, should use cache entry
=== End separate processing: should use cache entry: 1
=== After header check: should use cache entry: 1
=== Cache entry accepted

However, only 65 KB of the part got cached, and they also successfully get read back, so it wasn't the case of a semi-doomed entry from which one can't read. Looks like something wrong in the application logic.
I meant to add: That effect was already described at the end of comment #15. So whether concurrent or not, after one failed request for the main message, the part requests seem to be successful but return truncated data.
Update release notes with:

Workaround: Set preference browser.cache.memory.max_entry_size to the value of -1 (unlimited) and create integer preference browser.cache.memory.capacity with the value of 200000 (200 MB, 25 MB per message).
Jorg, looks like you found a solution.  You can always make this a default for comm-central if it suits your needs.

One question is, do you need to use the HTTP cache for this at all?  I assume the convenience is that requests to the same URL are coalesced by the cache API?  Otherwise, maybe the HTTP cache is not what you want to use here, it's optimized for HTTP code usage, not for general usage.  Did you think of rather using a simple hashtable of storage streams?

Also, if you need some changes to the limit handling (like new preferences or so) in the network cache code, you are welcomed to write them.  I will gladly review them.  Only requirement is to not break the existing behavior.
(In reply to Honza Bambas (:mayhemer) from comment #31)
> Jorg, looks like you found a solution.
Far from it. I found how to set the preferences so we can handle sizes to 25 MB for now. Yes, the "stop" patch also seems to fix all visible issues.

However, I still need to land a patch to take care of concurrent read better, that is improving the variant where it is delayed but not stopped. In this case I still seem to have an application issue where we only write 65 KB to the cache after some previous error.

Last not least you were going to fix some things:
Fix confusing log message "Entry is too big, failing and dooming the entry." or in fact really doom it as per your comment #23: This should be fixed, a broken entry, where we may not know the expected size every time, should not be left in the cache.

As I detailed above, we can detect such a broken entry for now in two ways:
1) By a size check, it's shorter then we expect.
2) When we have no size check, accept the entry and read it, we get:
   WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8052000F:
   file c:/mozilla-source/comm-central/mozilla/netwerk/cache2/CacheEntry.cpp, line 1183
   https://dxr.mozilla.org/mozilla-central/source/netwerk/cache2/CacheEntry.cpp#1182
So it does like your internal structures get confused when writing to an entry is aborted due to size overrun.

After all, when you reject further writes to the entry, you so some sort of dooming here:
https://dxr.mozilla.org/mozilla-central/rev/ce69b6e1773e9e0d0a190ce899f34b1658e66ca4/netwerk/cache2/CacheFileOutputStream.cpp#102

I can't answer the question as to why we use the HTTP cache. Traditionally TB used cache(1) and I switched it to cache2. This seemed to work until someone started pounding it with a lot of data triggering unforeseen code path.
(In reply to Jorg K (GMT+2) from comment #32)
> As I detailed above, we can detect such a broken entry for now in two ways:
> 1) By a size check, it's shorter then we expect.
> 2) When we have no size check, accept the entry and read it, we get:
>    WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8052000F:
>    file
> c:/mozilla-source/comm-central/mozilla/netwerk/cache2/CacheEntry.cpp, line
> 1183
>   
> https://dxr.mozilla.org/mozilla-central/source/netwerk/cache2/CacheEntry.cpp#1182

Honza, I've done a bit more digging:
https://dxr.mozilla.org/mozilla-central/rev/722fdbff1efc308a22060e75b603311d23541bb5/netwerk/cache2/CacheEntry.cpp#1182
calls CacheFile::OpenInputStream()
https://dxr.mozilla.org/mozilla-central/rev/722fdbff1efc308a22060e75b603311d23541bb5/netwerk/cache2/CacheFile.cpp#727
and that intentionally fails on broken entries.
So we *can* always detect broken entries.

So thanks for your support, for the moment I need to work on our internal logic. Summarising the findings so far:

First:
We're hitting the browser.cache.memory.max_entry_size limit but even earlier the browser.cache.memory.capacity limit.
We can work around that by setting browser.cache.memory.max_entry_size to the value of -1 (unlimited) and create integer preference browser.cache.memory.capacity with the value of 200000.

Second:
We write to the cache via a Tee. We don't notice write errors, the Tee simply gives up. However, we can always detect broken entries based on their size or due to a read error later.

Third:
The concurrent read is not save since we don't know that the entry will ever be written successfully. To fix this, we can delay the read until the write is finished with the patch in attachment 8858957 [details] [diff] [review].

Fourth:
I've always talked about message parts being accessed twice, that's how the concurrent read is triggered. Debugging shows that the first access will only fetch 64KB of the part. If the part is larger than 64KB, subsequent access will use the shortened part which was placed in the cache and no matter how I solve the concurrent read, the data is short.

I have yet to get to the bottom of this. By the looks of it the logic only requests the first chunk of the part from the IMAP server to get basic information, like part headers. A second request then reads the full part. A hacky way to fix this is the patch in attachment 8858930 [details] [diff] [review]. We detect the concurrent read of the second request and fetch directly from the server ignoring the short cache entry. But that will bite us later and is also against the spirit of using the cache.

So the fourth issue needs more work.
Attachment #8858930 - Attachment is obsolete: true
It seems safer to setup the part extraction in OnCacheEntryAvailable() to avoid messing with m_channelListener too early in the process when a concurrent read request arrives. This patch has no noticeable benefit, but better be safe than sorry.
Comment on attachment 8859511 [details] [diff] [review]
1355350-shift-setup-part-extract.patch (v1).

Nice try, but wasn't right and wasn't necessary either.
Attachment #8859511 - Attachment is obsolete: true
Attachment #8858957 - Flags: review?(rkent)
I don't want to tell you how many hours I spent to find this one line that needs to be added.

Sometimes only exactly 64KB of a message part get fetched and cached. If we don't detect those cases, we end up with cache entries which are way too short.

It's quite tricky to test. First you need to view HTML+inline attachment and have it fail on cache entry overrun. Then double click the attachment. That requests the part twice, but the first time, only 64KB are fetched. The second request gets the 64KB from the first and your image is truncated.

If fetching the whole message had worked, this code path wouldn't have run since it would have just extracted the part from the entire message - tricky!
Attachment #8859736 - Flags: review?(rkent)
Finally, lift the entry size to 15 MB so that cases of cache failure will be unlikely.
Attachment #8859752 - Flags: review?(rkent)
Like always, my debug patch, so you can see this stuff at work.
If it makes the review easier:
For 1355350-delay-concurrent-read.patch read comment #22.
For 1355350-doom-on-inteterrupted.patch see comment #36 (oops typo in patch name).
And for the sizes, see comment #27.
Comment on attachment 8859752 [details] [diff] [review]
1355350-lift-entry-size.patch (v1).

Review of attachment 8859752 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8859752 - Flags: review?(rkent) → review+
Comment on attachment 8859736 [details] [diff] [review]
1355350-doom-on-inteterrupted.patch (v1).

Review of attachment 8859736 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8859736 - Flags: review?(rkent) → review+
Comment on attachment 8858957 [details] [diff] [review]
1355350-delay-concurrent-read.patch (v1).

Review of attachment 8858957 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8858957 - Flags: review?(rkent) → review+
OK I looked at the patches, reviewed the discussion and the changes, but I can't claim to fully understand the changes, but they seem reasonable and fairly low risk. I am comfortable that sufficient effort was put into understanding this by Jorg, and we should just move forward.
Comment on attachment 8859752 [details] [diff] [review]
1355350-lift-entry-size.patch (v1).

Uplift for all three patches.

[Approval Request Comment]
Regression caused by (bug #): Bug 1021843, bug 629738
Attachment #8859752 - Flags: approval-comm-esr52?
Attachment #8859752 - Flags: approval-comm-beta+
Duplicate of this bug: 1360938
Duplicate of this bug: 1361278
petitlou, telman, jank, max,

Is your problem gone, with no ill effects, if you use the current nightly build from https://archive.mozilla.org/pub/thunderbird/nightly/latest-comm-central/
thunderbird-55.0a1.en-US.win32.installer.exe windows
thunderbird-55.0a1.en-US.linux-x86_64.tar.bz2 linux
Flags: needinfo?(tilman)
Flags: needinfo?(petitlou60.interne05)
Flags: needinfo?(max)
Flags: needinfo?(jan.kudrik)
Attachment #8859752 - Flags: approval-comm-esr52? → approval-comm-esr52+
The problem appears to be gone with Thunderbird 55.0a1 win32.
However I can't get Enigmail to work with that build.
This limits my ability to test since we use PGP encryption extensively.
Tested with Enigmail 1.9.6.1 offered on https://addons.mozilla.org/en-US/thunderbird/extensions/.
(Project homepage www.enigmail.net offers the same version.)
Flags: needinfo?(tilman)
Tilman, thanks for testing. It is very helpful to have these first hand reports. I hope others will also respond.

You are not having problems with enigmail when using 52.x, correct?
Can you be available to test engimail for us when we build the next 52.x?
Flags: needinfo?(tilman)
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #49)
> petitlou, telman, jank, max,
> 
> Is your problem gone, with no ill effects, if you use the current nightly
> build from
> https://archive.mozilla.org/pub/thunderbird/nightly/latest-comm-central/
> thunderbird-55.0a1.en-US.win32.installer.exe windows
> thunderbird-55.0a1.en-US.linux-x86_64.tar.bz2 linux

I no longer have this problem with both of these 55.0a1 (2017-05-08) builds you linked. Tried it just now and haven't noticed any problems.
Yes, Enigmail is working fine with both ESR 52 and 45.
I can test it with new builds, except from 2017-05-17 to 2017-05-24 as I'll be traveling.
Flags: needinfo?(tilman)
(In reply to Tilman Schmidt from comment #51)
> The problem appears to be gone with Thunderbird 55.0a1 win32.
> However I can't get Enigmail to work with that build.
> This limits my ability to test since we use PGP encryption extensively.
> Tested with Enigmail 1.9.6.1 offered on
> https://addons.mozilla.org/en-US/thunderbird/extensions/.
> (Project homepage www.enigmail.net offers the same version.)

I also use Enigmail 1.9.6.1 and I see no problem running it (encrypting and decrypting messages) under these Daily builds (win32 and linux-x86_64).
Flags: needinfo?(jan.kudrik)
Works fine here.
I see this issue marked as being resolved.
However, I just encountered it in 52.3.0 (32-bit) on Windows 10.

Originally, I encountered it with the stock (default) settings for browser.cache.memory.*
browser.cache.memory.capacity was already set to 200000 (200 MB).
Then, per comment 30 here, I set "preference browser.cache.memory.max_entry_size to the value of -1 (unlimited)". That didn't change anything.

The message in question has 6 attachments, including 4 jpegs and 2 pdfs.

Just in case, here some information about their sizes (as reported by alpine):
    [ Part 2, ""  Image/JPEG (Name: "Work_sheet_1_p_3.jpg") 259 KB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]


    [ Part 3, ""  Image/JPEG (Name: "Work_sheet_1_p_5.jpg") 312 KB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]


    [ Part 4, ""  Image/JPEG (Name: "HW_p_1.jpg") 353 KB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]


    [ Part 5, ""  Image/JPEG (Name: "Puzzle_p12.jpg") 643 KB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]


    [ Part 6, ""  Application/PDF (Name: ]
    [ "Geometry_for_6-7_pp_1-2.pdf") 615 KB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]


    [ Part 7, ""  Application/PDF (Name: ]
    [ "Geometry_problems_pp_1_2.pdf") 2.7 MB. ]
    [ Cannot display this part. Press "V" then "S" to save in a file. ]

At the bottom (attachment status), Thunderbird says: 6 attachments   at least 187 KB.
When expanded, it shows 46.7 KB for each JPEG, and doesn't report sizes for the PDFs.
If I am saving attachments do the local drive, PDFs are correct, but JPEGs are 46.7 KB (47,827 bytes) [as reported by Windows Explorer] each.

Right-click on the attachment -> Open;  The pop-up windows shows 6.3 MB size for the jpeg. It gets truncated to the same ~47 KB size.
The only way to open the complete attachment is to start forwarding the message. Then Thunderbird downloads all the attachments, and they are shown with correct sizes (and content) in the attachments. At that point, it is possible to save each of them by dragging from the Attachments panel to a Windows Explorer (saved as nsmail-#.jpg , not the name displayed in the attachments panel)

One of the same JPEG attachments was sent earlier individually, - and it was downloaded correctly.

This has been observed with the same message (forwarded [in alpine, in Thunderbird], bounced [in alpine] with two different servers: private dovecot (IMAP) and Gmail's IMAP server.
Yes, if images (inline) are followed by PDFs (non-inline), we have a problem, see bug 1395620. Should be fixed in TB 52.4. You can see it fixed in a Daily:
http://ftp.mozilla.org/pub/thunderbird/nightly/latest-comm-central/
Flags: needinfo?(petitlou60.interne05)
Flags: needinfo?(max)
Argh!...

If I were to install a daily, which is 57.x, - I wouldn't get informed by TB about 52.4 being released, would I?

BTW, https://bugzilla.mozilla.org/show_bug.cgi?id=1399609  explains how labeling where the particular bug is fixed is very confusing for those who is not closely following this project. 
It would be great, if "Fixed in" would be showing parallel branches as well, e.g. 
"Fixed in: 55.Beta, 52.2" or "57.daily, 56.1.Beta, 52.5.stable" or something like that.
(In reply to Igor R. from comment #60)
> If I were to install a daily, which is 57.x, - I wouldn't get informed by TB
> about 52.4 being released, would I?
Not really. If all goes well, TB 52.4 will come out at the end of September, early October. Also, TB 57 is the development version; we know that some add-ons which still work in TB 52 don't work in TB 57 any more.

We have some indicators to tell insiders when fixes are released, but it's hard to follow.
You need to log in before you can comment on or make changes to this bug.