Closed Bug 1658345 Opened 5 years ago Closed 5 years ago

In "Fetch headers only" mode, downloaded messages are duplicated in each subsequent receiving session for profile stored on the cifs based NAS (network storage)

Categories

(MailNews Core :: Networking, defect)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1612414

People

(Reporter: artem.semenov, Unassigned)

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

Debian 10:
Switch on "Fetch headers only" mode for mailbox. Send letter to this mailbox. Try to receive mail some times (but don't load received message body).

Actual results:

Unloaded messages are duplicated after each receive trying.
Screenshot is attached.

Expected results:

There should be no duplicates.

Component: Untriaged → Networking: POP
Product: Thunderbird → MailNews Core
Summary: In "Fetch headers only" mode, unloaded messages are duplicated in each subsequent receiving session → In "Fetch headers only" mode, downloaded messages are duplicated in each subsequent receiving session

I've found the reason. The reason is the mount options. The profile is stored on the cifs based NAS. So (only differences are shown):

doesn't work:

nosuid,nodev,noexec,vers=1.0,uid=1000,forceuid,gid=1000,forcegid,rsize=130048,wsize=65536

correct works:

vers=default,uid=0,noforceuid,gid=0,noforcegid,rsize=1048576,wsize=1048576

I don't know which option affects, but previous versions of Thunderbird (68.8 - exactly, 68.9 - may be, 68.10 - not sure) worked fine with the option set that doesn't work now.

That's pretty interesting

(In reply to artem.semenov from comment #1)

I've found the reason. The reason is the mount options. The profile is stored on the cifs based NAS. So (only differences are shown):

doesn't work:

nosuid,nodev,noexec,vers=1.0,uid=1000,forceuid,gid=1000,forcegid,rsize=130048,wsize=65536

correct works:

vers=default,uid=0,noforceuid,gid=0,noforcegid,rsize=1048576,wsize=1048576

I don't know which option affects, but previous versions of Thunderbird (68.8 - exactly, 68.9 - may be, 68.10 - not sure) worked fine with the option set that doesn't work now.

I have known that some code in mozilla-central assumes that file read call issued from within mozilla code will return all the data at once and never check if a shorter content is returned, etc. due to network size packet limit, remote file system limitation, etc.
Remote network file system is free to return only a short chunk and expects the client to re-read the file stream to obtain the rest.
Typical example was a code to read JSON from a file. I don't know if it has been fixed since I noticed that 4-5 years ago.
I have been trying to fix this "Short Read" issue for the last few years. However, it is still in my local patch queue.
I wanted to fix the failure to handle low-level code (such as the one I explained above) in pop3, and imap to some extent first.
That is when I noticed this short read error while testing pop3 operation against profile stored on a local CIFS mount.

Now, I suspect that there IS a data that needs to be read as a whole and is larger than 130048 (rsize in old setting), but still shorter than 1048576, (rsize in new setting).
In the current mozilla-central and comm-central code such data reading probably ends up reading only the first 1300048 octets in the old setting, and since the code fails to read the rest of data, something goes very wrong. That is my guess.

I am right now trying to fix my source tree to cope with the recent directory change.
Once it is fixed, I will submit a tryserver job.
Once the binary is available with the fix for short read, I may ask you to check the voluminous log my patch produces to check if my guess is indeed correct and TB needs to re-read the remote file system to obtain the rest of the data that it requested.
Network file system.
Hmm, the voluminous logging output is produced only on linux or apple iOS if I am not mistaken. TB under Windows fail to produce output sent to "stderr".

Sorry, I just read your message - there were problems with the mail server.

(In reply to ISHIKAWA, Chiaki from comment #3)

Once the binary is available with the fix for short read, I may ask you to check the voluminous log my patch produces to check if my guess is indeed correct and TB needs to re-read the remote file system to obtain the rest of the data that it requested.

Ok.

Alex, Magnus, note comment 1, and comment 2 "Now, I suspect that there IS a data that needs to be read as a whole and is larger than 130048 (rsize in old setting), but still shorter than 1048576, (rsize in new setting). In the current mozilla-central and comm-central code such data reading probably ends up reading only the first 1300048 octets in the old setting, and since the code fails to read the rest of data, something goes very wrong. That is my guess."

Status: UNCONFIRMED → NEW
Component: Networking: POP → Networking
Ever confirmed: true
Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(alessandro)
Summary: In "Fetch headers only" mode, downloaded messages are duplicated in each subsequent receiving session → In "Fetch headers only" mode, downloaded messages are duplicated in each subsequent receiving session for profile stored on the cifs based NAS (network storage)

Chiaki, how is this going?

(In reply to ISHIKAWA, Chiaki from comment #3)

...
In the current mozilla-central and comm-central code such data reading probably ends up reading only the first 1300048 octets in the old setting, and since the code fails to read the rest of data, something goes very wrong. That is my guess.

I am right now trying to fix my source tree to cope with the recent directory change.
Once it is fixed, I will submit a tryserver job.
Once the binary is available with the fix for short read, I may ask you to check the voluminous log my patch produces to check if my guess is indeed correct and TB needs to re-read the remote file system to obtain the rest of the data that it requested.
Network file system.

Flags: needinfo?(ishikawa)

Alex could check if it has something to do with bug 1612414.

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(alessandro)

(In reply to Magnus Melin [:mkmelin] from comment #7)

Alex could check if it has something to do with bug 1612414.

Which is why I had NI set.

Flags: needinfo?(alessandro)

I had noted it in bug 1612414.

Flags: needinfo?(alessandro)

(In reply to Wayne Mery (:wsmwk) from comment #6)

Chiaki, how is this going?

(In reply to ISHIKAWA, Chiaki from comment #3)

...
In the current mozilla-central and comm-central code such data reading probably ends up reading only the first 1300048 octets in the old setting, and since the code fails to read the rest of data, something goes very wrong. That is my guess.

I am right now trying to fix my source tree to cope with the recent directory change.
Once it is fixed, I will submit a tryserver job.
Once the binary is available with the fix for short read, I may ask you to check the voluminous log my patch produces to check if my guess is indeed correct and TB needs to re-read the remote file system to obtain the rest of the data that it requested.
Network file system.

I was able to compile TB locally and submitted try-comm-central build/test jobs.
Unfortunately, there were some strange test error issues on try-comm-central, and thus I was not forthcoming to test if my patches fix the issue.

However, come to think of it, even if my binary might be every so slightly buggy, it may be worth the test since the issue is the duplication of the headers and not the body. And in that sense, the local copy of e-mails get already hosed, and the messages would remain intact on the server. What would we lose?

The version with short-read fix was built and tested on Sep 29.
Tue Sep 29 09:19:12 2020 +0000

https://hg.mozilla.org/try-comm-central/pushloghtml?changeset=5b03807938f47733d930d9cbcb19faf9f692b793

Now where is the binary?

I may have a broken setup, I see "Need to retrieve or renew Taskcluster credentials before action can be performed." when I tinker with treeherder.

Flags: needinfo?(ishikawa)

(In reply to ISHIKAWA, Chiaki from comment #10)

...
https://hg.mozilla.org/try-comm-central/pushloghtml?changeset=5b03807938f47733d930d9cbcb19faf9f692b793

Now where is the binary?

I may have a broken setup, I see "Need to retrieve or renew Taskcluster credentials before action can be performed." when I tinker with treeherder.

Dunno. I don't see a try build at https://treeherder.mozilla.org/#/jobs?repo=try-comm-central

(In reply to Wayne Mery (:wsmwk) from comment #11)

(In reply to ISHIKAWA, Chiaki from comment #10)

...
https://hg.mozilla.org/try-comm-central/pushloghtml?changeset=5b03807938f47733d930d9cbcb19faf9f692b793

Now where is the binary?

I may have a broken setup, I see "Need to retrieve or renew Taskcluster credentials before action can be performed." when I tinker with treeherder.

Dunno. I don't see a try build at https://treeherder.mozilla.org/#/jobs?repo=try-comm-central

Maybe due to the flurry of errors, try-comm-central may decide not to store the binary (with certain threshold of errors)?
(Oh, by the way, the job I mentioned is more than 2 weeks old. That might also be the reason.)

I have refreshed my local tree and am running local build. Once it is successful, I will submit another try server run.

I see a few try builds submitted on September 29. https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&author=ishikawa@yk.rim.or.jp

Artifacts from try builds expire after 28 days.

(In reply to Rob Lemley [:rjl] from comment #14)

I see a few try builds submitted on September 29. https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&author=ishikawa@yk.rim.or.jp

Artifacts from try builds expire after 28 days.

So that means there ought to be artifacts from those jobs, correct? [Today is Oct 17 and so it is about 18 days since then.]

Anyway, I was hit with the bug Bug 1671345 (Opened 1 day ago Updated 2 hours ago)
Fix invalid C++ code w.r.t. parameter pack expansion in lambda capture in IDBResult.h
and so spent a day figuring out the cause of the bug (it is a combination of g++-9/g+-10 and dubious code in M-C), and so it may take another day or two before I can submit new jobs.

I submitted a job that includes my patches for short-read issues.

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&selectedTaskRun=fYgPXtc4SoGyEqFe_Fibew.0&revision=d7356d8050fd1740230c5692a97af670ff134ef0

OSX compilation failed due to maybe renaming of header files or directory layout change. I will investigate.

By clicking on the "B" character on linux64 debug line, you can get to the artifact and target.tar.bz2 file.

Run thunderbird binary in it, and see it improves the situation.

NOW, there IS caveat.
My patches take care of short read issues of C-C portion.

Several years ago, I noticed a glaring offender in M-C tree regarding short read issue.
There is a routine that reads JSON data file.: it checked the whole size of the file by issueing "stat" syscall and then uses a SINGLE read() operation to read the data in one go.
Obviously this may fail on a remote CIFS mount that limits the size to a certain maximum if the file size exceeds that threshold.
There is this function and possibly OTHER offenders in M-C portion of the code.
My patches do not take care of them.

Anyway, if you run thunderbird binary from a bash in a console window, you will see many warning/error messages. When the headers are downloaded you may want to monitor or save the dumped messages one way or the other. Running |script| command and issue thunderbird therein is one way.

YMMV.

Flags: needinfo?(artem.semenov)
Flags: needinfo?(vseerror)

Linux folks who want to test ... see https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/fYgPXtc4SoGyEqFe_Fibew/runs/0/artifacts/public/build/target.tar.bz2

You probably don't want to use this on your production profile

Flags: needinfo?(vseerror) → needinfo?(wls220spring)

(In reply to ISHIKAWA, Chiaki from comment #16)

Created attachment 9182170 [details]
My submited job with short read patches and obtaining artifacts.

I submitted a job that includes my patches for short-read issues.

I need some time to test. A few days I think.

I don't have a cifs based NAS to test with, but when I tried using the target file I saw many errors in my terminal, and once I had created an email account and downloaded messages my computer barked at me each time I selected one to read it.

I think that is the default Gnome application error notification sound.

Flags: needinfo?(wls220spring)

(In reply to WaltS48 [:walts48] from comment #19)

I don't have a cifs based NAS to test with, but when I tried using the target file I saw many errors in my terminal, and once I had created an email account and downloaded messages my computer barked at me each time I selected one to read it.

I think that is the default Gnome application error notification sound.

So basically, you tested

  • which version of the TB program - I presume the test version of TB,

  • against which type of server - I presume a POP3 server (I misunderstood that this header-only mode is for IMAP, I was wrong. See next comment.)

  • where was your profile - some type of remote file system (but not CIFS)

  • does your setup involve remote file system that limits the size of read/write system call an upper limit?

under linux (?)

I need to know the above.

OK, I did NOT realize that even with POP3 server, we can have this "Fetch Headers Only" mode.
Or rather this Header-only mode is actually available only for POP3 (!?).

Maybe I can try testing locally even, then (!) [I hate to set up IMAP server with all those option details.]

I can create a local CIFS server that exports the file system with read/write data size limit.
Then create a test profile on this restricted local CIFS server.

I can test
(1) if I can reproduce the original issues, and then if so,
(2) if my short-read patch alleviates the issue somewhat (I can't say it solves the issue completely
because we need the change in M-C portion as well).

(In reply to ISHIKAWA, Chiaki from comment #20)

(In reply to WaltS48 [:walts48] from comment #19)

I don't have a cifs based NAS to test with, but when I tried using the target file I saw many errors in my terminal, and once I had created an email account and downloaded messages my computer barked at me each time I selected one to read it.

I think that is the default Gnome application error notification sound.

So basically, you tested

  • which version of the TB program - I presume the test version of TB,

I tested the target.tar.bz2 file which is a Daily 83.0a1(?) version.

  • against which type of server - I presume a POP3 server (I misunderstood that this header-only mode is for IMAP, I was wrong. See next comment.)

Using my default POP3 account.

  • where was your profile - some type of remote file system (but not CIFS)

Profile was in the default location, not remote.

  • does your setup involve remote file system that limits the size of read/write system call an upper limit?

No idea what that is.

under linux (?)

Ubuntu 18..04 LTS.

I need to know the above.

I also forgot to change the account to "Fetch headers only" mode before downloading the emails.

I was curious as to what the target.tar.bz2 file was.

(In reply to ISHIKAWA, Chiaki from comment #16)

Created attachment 9182170 [details]
My submited job with short read patches and obtaining artifacts.

I submitted a job that includes my patches for short-read issues.

(In reply to Wayne Mery (:wsmwk) from comment #17)

Linux folks who want to test ... see https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/fYgPXtc4SoGyEqFe_Fibew/runs/0/artifacts/public/build/target.tar.bz2

I've tested. It seems the problem has been fixed.

Flags: needinfo?(artem.semenov)

(In reply to artem.semenov from comment #23)

(In reply to ISHIKAWA, Chiaki from comment #16)

Created attachment 9182170 [details]
My submited job with short read patches and obtaining artifacts.

I submitted a job that includes my patches for short-read issues.

(In reply to Wayne Mery (:wsmwk) from comment #17)

Linux folks who want to test ... see https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/fYgPXtc4SoGyEqFe_Fibew/runs/0/artifacts/public/build/target.tar.bz2

I've tested. It seems the problem has been fixed.

Can you confirm that the problem still persists with the stock TB, i.e., not my patched version you tested.

This is a great encouragement and shows that my patches are on the right track if confirmed.

(In reply to WaltS48 [:walts48] from comment #22)

(In reply to ISHIKAWA, Chiaki from comment #20)

(In reply to WaltS48 [:walts48] from comment #19)

I don't have a cifs based NAS to test with, but when I tried using the target file I saw many errors in my terminal, and once I had created an email account and downloaded messages my computer barked at me each time I selected one to read it.

I think that is the default Gnome application error notification sound.

So basically, you tested

  • which version of the TB program - I presume the test version of TB,

I tested the target.tar.bz2 file which is a Daily 83.0a1(?) version.

  • against which type of server - I presume a POP3 server (I misunderstood that this header-only mode is for IMAP, I was wrong. See next comment.)

Using my default POP3 account.

  • where was your profile - some type of remote file system (but not CIFS)

Profile was in the default location, not remote.

  • does your setup involve remote file system that limits the size of read/write system call an upper limit?

No idea what that is.

under linux (?)

Ubuntu 18..04 LTS.

I need to know the above.

I also forgot to change the account to "Fetch headers only" mode before downloading the emails.

I was curious as to what the target.tar.bz2 file was.

The version Wayne mentioned contains my local patches that fix the issues associated with
"short read" from remote servers.: Usually system calls to read a block of data returns all the data. This is an implicit assumption that permeates many source code including M-C/C-C source tree.
However, this implicit assumption is broken with remote file systems.

This is the problem of "short read".

Depending on the network status (transient quite often), or configuration (as in the original poster's setup to limit the size of block that can go out in one read request), or various setup issues along network path (some routes may decide to fragment a large packet, while some try to consolidate, etc.), the read system call issued against remote servers (CIFS/SAMBA, NFS{3,4}, etc.) may not return the requested number of octets always, but return only smaller amount, and the caller needs to re-issue additional read system calls to read the necessary data.
M-C (and C-C to some extent) does contain code to read such additional data when the connection is an EXPLICT NETWORK CONNECTION.
However, when the read is issued against a file system, such care is not taken at all unfortunately.

Why don't we see this problem often? The reason is simple. MS Windows system call for read (can't recall the name off hand) performs this additional re-reading automagically, at least for CIFS/SAMBA [I am not sure about NFS, but that is rare, I think.].
Thus windows application including TB does not have to worry about the short read from remote file systems.
(If MS Windows API does not handle this re-reading automagically, the problem would have surfaced many years ago and was fixed.)

There are set of patches I have created to fix the issues, but they have to wait for the buffered write patches first, and the short-read patches are quite lengthy themselves. (the latest patches on my PC have not been uploaded there, but the bulk of the patches are the same. I just need to upload the patches that replace the bitrotten version.)

Bug 1170606
C-C T-B fixes to take care of short read (part of [META] Failure to deal with short read)

Re: Buffered Write
Bug 1242042
Enabling buffering for file stream to write message for C-C TB

A lot of things got in the way.:
MS C compiler suddenly failed to certain lines on tryserver once, which was really confusing.
C++ formatting issues.
directory layout changes
|make mozmill| to mochitest transition

I intend to clear my local patches one by one so that I can land the patches for the buffered write first.
However, I am not sure if we have enough man-power to check the patches in detail, to be honest.

So AIUI, we don't need a patch in Mozilla core code, correct?

Bug 1242042 is quite small. But Bug 1170606 is huge which probably makes it impractical to get into version 78 any time soon. So is there a minimal patch of Bug 1170606 that will just resolve bug 1612414?

As noted in bug 1612414 comment 66, this issue takes on new importance because users stuck old versions of Thunderbird to have their network functions won't be able to do so much longer because of Big Sur.

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(ishikawa)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE

(In reply to Wayne Mery (:wsmwk) from comment #26)

So AIUI, we don't need a patch in Mozilla core code, correct?

Bug 1242042 is quite small. But Bug 1170606 is huge which probably makes it impractical to get into version 78 any time soon. So is there a minimal patch of Bug 1170606 that will just resolve bug 1612414?

Wayne, bug 1242042 needs to come after patch(es) in bug bug 1242030 which is not small (however, it has been reviewed once and
I need to fix the bitrot and directory layout changes, which has been done, and I have workable patches. But I still need to clear the review I think since the changes are not that small.

Sorry, Bug 1170606 needs the buffered write patches to be in place to begin with. Buffered write hardens the I/O error calls by at least checking
error codes returned by write, close, etc. Bug 1170606 depends on the read erros from the earlier patch , bug 1242030.

As noted in bug 1612414 comment 66, this issue takes on new importance because users stuck old versions of Thunderbird to have their network functions won't be able to do so much longer because of Big Sur.

Big Sur seems to be the new MacOS version.
I can re-arrange the priority and try to land short read patches first if this Bug Sur is causing a big trouble.
But I distinctive recall that when I proposed a patch as a precursor for the M-C portion of short read problem, the MacOS TB misbehaved.
Unfortunately, that happened when my PC's extenral hard disk enclosure got fried and I could not remedy the MacOS issue and the patch was
backed off. I remember I was quite puzzled why MacOS version misbehaved while linux, and Windows version was OK.
I think the issue was a simple misplacement of "<" for "<=" or something to that effect. MacOS runtime library was so sensitive with this.

The original poster of this bugzilla used linux when the problem was reported.
I am not sure if the short read patch fixed the original poster's issue, BTW.
I sincerely hope it would.

I am saying this because, if so, I am encouraged to port short read patch faster.
OTOH, if the problem is not even solved with my short read patch, THEN it is quite likely
the code to read profile is M-C portion of the source code, and that requires the landing of the previously backed off patch (with the one line correction this time) and some more. (My patch proposed a function to read from file stream until the requested bytes are read or EOF is reached. One needs to replace the simple read() call with my version of enhanced read where profile is read and
if necessary, needs to handle some loop-constructs where such read is used. Bug 1170606 looks huge, but it actually does the replacement of read and handling of loop-construct one by one. It is NOT quite complicated, but repetitive and given that read is called in many places has become huge.
So although the patches are huge, the review ought to be straight-foward IMHO.

Buffered write is more complicated because the code was complicated to begin with.

Given my local patches are created in the order I mention switching the patch landing order would be a major task to be frank.
I have submitted tryserver runs in the current order many times before and have run mochitest and xpcshell tests locally regularly.
So I am more or less confident that patch's functionality is OK, but code style, etc. obviously require proper review.

(In reply to artem.semenov from comment #23)

(In reply to ISHIKAWA, Chiaki from comment #16)

Created attachment 9182170 [details]
My submited job with short read patches and obtaining artifacts.

I submitted a job that includes my patches for short-read issues.

(In reply to Wayne Mery (:wsmwk) from comment #17)

Linux folks who want to test ... see https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/fYgPXtc4SoGyEqFe_Fibew/runs/0/artifacts/public/build/target.tar.bz2

I've tested. It seems the problem has been fixed.

Oh, so at least then the problem has been fixed with my patch. That sound encouraging. We don't have to touch M-C portion for the short term, then.

Flags: needinfo?(ishikawa)

(In reply to ISHIKAWA, Chiaki from comment #29)

(In reply to artem.semenov from comment #23)

(In reply to ISHIKAWA, Chiaki from comment #16)
Oh, so at least then the problem has been fixed with my patch. That sound encouraging. We don't have to touch M-C portion for the short term, then.

No I was premature.

Someone ought to check that remote profile can be read by TB under MacOS.
I am submitting tryserver run with a full set of local TB patches. If the binary works, then at least the patches are ready more or less (module the problematic read/write error handling patches.)

Probably best to continue in bug 1612414

Flags: needinfo?(mkmelin+mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: