Closed Bug 1854586 Opened 1 year ago Closed 10 months ago

Inconsistency when trying to fetch imap messages after Thunderbird started with disabled network (and set to NOT sync messages)

Categories

(MailNews Core :: Networking: IMAP, defect)

defect

Tracking

(thunderbird_esr115 fixed, thunderbird120 fixed, thunderbird121 affected)

RESOLVED FIXED
121 Branch
Tracking Status
thunderbird_esr115 --- fixed
thunderbird120 --- fixed
thunderbird121 --- affected

People

(Reporter: TbSync, Assigned: gds)

References

Details

Attachments

(1 file, 1 obsolete file)

STR:

  1. setup the sync settings of an IMAP account as shown in the image attached to Bug 1854361
  2. send yourself a couple of test messages which have an attachment to make sure they are larger than 1KB and hit the no-sync-criteria
  3. do not read the received messages
  4. disable the network connection and restart TB
  5. when accessing one of the new messages, there is no connection error, but also the message is not loaded
  6. enable your network connection again
  7. the messages will still not show up and will also not show any error

Even if step 5 is skipped, the message will not be shown in step 7.

Blocks: 1851074

The difference in behavior happens here:
https://searchfox.org/comm-central/rev/c919ceff4b319030e3f7b8e76701235dbcb87c6b/mailnews/imap/src/nsImapIncomingServer.cpp#424-458

If we start Thunderbird without network connection and then request to stream a non-synced imap-message, aProtocol is falsy and we get into the else statement and there (!m_skipRetryQueued) is false as well and we end up in the else there as well. Then nothing happens anymore. None of the consumer callback functions are called. Even if we re-enable network, aProtocol remains falsy and we will always end up in this dead end.

If we start Thunderbird with an active network connection, disable network after startup has fininished and then try to stream a non-synced imap-message, aProtocol is not falsy and a real connection attempt is made.

Ben or Magnus, any idea?

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(benc)

OK, the function GetImapConnection exits early here:

https://searchfox.org/comm-central/rev/c919ceff4b319030e3f7b8e76701235dbcb87c6b/mailnews/imap/src/nsImapIncomingServer.cpp#717-722

because msgWindow is falsy.

  1. If I remove that entire if statement, everything works.
  2. Why do we need msgWindow? It is not used at all in the function.

Can I remove it?

Assignee: nobody → john

That part was added in bug 1793599, see bug 1793599 comment 19. Gene has a better picture than me

Flags: needinfo?(mkmelin+mozilla) → needinfo?(gds)

That block of code caused a problem in another bug described here: bug 1842655 comment 11.
I've been meaning to look at that again and this reminded me.
I haven't tried to run your STR yet, but will.
By any chance is your STR also using oauth2?

Flags: needinfo?(gds)
See Also: → 1842655

@Ben @Gene: After looking at the code in question and the comments in the other related bugs, I realized that I forgot to mention important information in my STR: I tested with an OAUTH Google IMAP account, which does not have a password stored. Following my STR with a normal IMAP account with a stored password (gmx.de), I did not observe the faulty queuing behavior.

I think there are more empty-password edge case requests, which will never complete because they are pushed into the queue:

  1. OAUTH authenticated IMAP account (which does not have a stored password) and starting Thunderbird without network connection
  2. Normal IMAP Account with an invalid server and no password (starting with or without network)
  3. Normal IMAP Account where the user decided to not store the password in the password manager

I will do more tests and provide more information. But my first impression is that in general, we should not push requests into the queue, if the server is not reachable and instead let the request proceed and fail to inform the consumer.

I can't pretend to understand all the details here, but it does seem like any requests made while the network is down should fail rather than be pushed onto m_urlQueue.
My take on m_urlQueue is that it's a buffer for pending requests waiting for a real connection to execute upon. Not for handling requests made while disconnected.

But I'm not sure that's the root cause of problems here. I'd have thought it was step 5 in your STR which causes the queuing. But the fact you can reproduce the problem while skipping step 5 kind of hints that there's something else going on...

I won't be able to dig into this deeper for a couple of days, but if you want me to try reproducing it and digging down into it, NI me again and I'll start on some serious printf debugging :-)

Flags: needinfo?(benc)

I've been looking at this the last few days and can't find a clean fix on top of the existing code. So I'm inclined to go back to the original code as it was after Bug 163964 landed. Here's a summary of the relevant changes in the order they landed:

Bug 163964 caused initial discoverallboxes and select URL's to occur in parallel in two connection rather than serially in one connection to speed up startup for users with lots of folders. This change didn't intentionally queue any URLs to run later.
The request to improve the startup sequence was initiated in Bug 1660672 (eventually marked as duplicate of Bug 163964).

Bug 1768121 was found by the user putting in the wrong password and getting their account locked out. The fix for this was to do the discoverallboxes and queue the select until authentication completes, then run the select. This is where the queuing of URLs began and is the root of the problem. I think if this issue had been a "wontfix" the current problems we are seeing would not be present.

Bug 1764770 then required some more fixes for the queuing due to the need to show an override prompt for TLS errors.

Bug 1793599 added more queuing and other changes to handle the case where the password is not stored and has to be prompted for. Also handled the case where the first URL has no associated window so you can't prompt for password.

Bug 1798161 allows imap PREAUTH to not require a password. I think this is OK as-is, but will look closer.

I'll look into getting rid of all these changes except for Bug 163964. I think the only downside is that some users may more easily get locked out of their accounts if they use the wrong password as occurred at Bug 1768121.

Edited: Bug 1764770 landed before Bug 1793599

Here's a try build of the proposed back-out to only include Bug 163964 which gets rid of the concept of queuing the initial select URL and other "windowless" URLs when a password entry or other authentication might be needed.
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=b05198b4024b2e85054dc11f1884b09a32723244
This fixes the current problem that imap connection are never initiated when TB starts up with the network down and then later the network comes back. With this patch, this now works as long as the password or authentication token is stored.
However, after the network comes back without a stored password, the prompt only occurs on a user action such as "Get Mail" or selecting another folder, which is how it's always been.

However, after the network comes back without a stored password, the prompt only occurs on a user action such as "Get Mail" or selecting another folder, which is how it's always been.

I think this is fine, I hope it fails properly, will test this.

My main concern is that the backed out parts had a reason why they have been added. Are we going to introduce a regression here? Are we able to solve these other issues differently?

I think the only downside is that some users may more easily get locked out of their accounts if they use the wrong password as occurred at Bug 1768121.

I this the only relevant issue, or are we risking to bring back the empty folder pane issue? Could we fix the lock-out issue by introducing a throttle-mechanism, that limits the server connection attempts after the login failed?

Attachment #9354657 - Attachment is obsolete: true

(In reply to John Bieling (:TbSync) from comment #12)

Is this the try?
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=b05198b4024b2e85054dc11f1884b09a32723244

Yes. Sorry, I forgot to include the link.

(In reply to John Bieling (:TbSync) from comment #11)

However, after the network comes back without a stored password, the prompt only occurs on a user action such as "Get Mail" or selecting another folder, which is how it's always been.

I think this is fine, I hope it fails properly, will test this.

My main concern is that the backed out parts had a reason why they have been added. Are we going to introduce a regression here? Are we able to solve these other issues differently?

All the parts that I propose to back out are really just added because of Bug 1768121; otherwise they are not needed except for a part of Bug 1793599 which I kept, but I need to test some more to make sure that is really helpful.

I think the only downside is that some users may more easily get locked out of their accounts if they use the wrong password as occurred at Bug 1768121.

I this the only relevant issue, or are we risking to bring back the empty folder pane issue?

Don't know what you mean by "empty folder pane issue".

Could we fix the lock-out issue by introducing a throttle-mechanism, that limits the server connection attempts after the login failed?

That might be possible. I should have tried that to begin with.

I this the only relevant issue, or are we risking to bring back the empty folder pane issue?

Don't know what you mean by "empty folder pane issue".

Oh, sorry. I may have confused something. I was under the impression that one of the related bugs also dealt with the issue, that IMAP completely locked up during startup, which caused the user to end up with folder list and message list being completely empty. Thank you for pointing out the related bugs, which do not seem to include this, so everything is good on that end! \o/

Thanks for your work on this!

I tested your try run and issue #1 and issue #2 listed in Comment 7 are now properly failing.

Issue #3 is failing only if the network connection is down, but with an active network connection and a declined password prompt it does not fail but calls onStopRequest here, with a statuscode of 0 and an empty message:

headers: {}
​isEncrypted: false
​partName: ""
​parts: []

We talked about issue #3 only briefly and it is not the main issue, we could move this to a different bug, if you think that is better.

Issue #3 is failing only if the network connection is down,

Yes, you have to click "Get Message" or do something else like select a different folder before you get the password prompt. This is because a URL is needed and, after startup and before account login, URLs are only triggered automatically by biff (checking for new mail on timed interval). But the policy is for URLs triggered by unattended activities (like biff or filter action) to not provide a "window" so no prompt "out of the blue" occurs. But this is how it's always worked (I think) and I haven't heard complaints about it.

but with an active network connection and a declined password prompt it does not fail but calls onStopRequest here, with a statuscode of 0 and an empty message:

The issue I'm seeing here is that if you type in the wrong password, you actually get two prompts to retry/re-enter the password, one on top of the other. This is due to URLs discoverallboxes and select occurring in their own imap connection which have to be separately authenticated and is due to changes for Bug 163964. Sometimes if you enter a good password the other prompt remains even though the password was already accepted. I think a bug was reported about this but I don't have a record of the specific bug number right now.

I'm not sure if this relates at all to the onStopRequest issue you point out.

Another possible solution to the issues generated by changes in Bug 163964 might be to make the changes there configurable in the "Advanced Imap Setting" or maybe just in a pref. This is not an essential feature for most users who don't have a huge number of folders like Richard Leger (Re: bug 1660672 comment 7 and others).

(In reply to gene smith from comment #16)

Issue #3 is failing only if the network connection is down,

Yes, you have to click "Get Message" or do something else like select a different folder before you get the password prompt. This is because a URL is needed and, after startup and before account login, URLs are only triggered automatically by biff (checking for new mail on timed interval). But the policy is for URLs triggered by unattended activities (like biff or filter action) to not provide a "window" so no prompt "out of the blue" occurs. But this is how it's always worked (I think) and I haven't heard complaints about it.

You are completely correct that the user has to click on "Get Message" in the UI to get unattended requests working again in case of issue #3. But that is not the situation I am looking at. I am looking at the situation where the user has not yet clicked on "Get Message" and an unattended request occurs. For me it is important to get a proper fail in that case.

The observed behavior for issue #3 with your backout-patch/try-run is, that I get the proper fail only if the network connection is down. If the network connection is up and the user has skipped the initial password prompt during startup (or network was down during startup) and has also not yet clicked on "Get Message" (which should re-trigger the password prompt) and an unattended request occurs, it does not fail but instead succeeds with an empty message.

It could be that this behavior is not new and I could deal with this case in my consumer code and manually consider an empty message as a fail, but I wanted to make you aware of this edge case. I am interested in your opinion, if we should handle this case in this bug, or create a dedicated bug for it. All the other cases are fixed by your backout-patch.

If I start up with network up and I have no stored password, I get a password prompt which I cancel. I had set the biff interval to 1 minute so I see in imap log that two URLs are attempted every minute, discoverallboxes and select, and both fail and report the error here: https://searchfox.org/comm-central/rev/38b9c6416aa503df4003dbf400271d1e82f3bc75/mailnews/imap/src/nsImapProtocol.cpp#2055. The passed in rv value is not NS_OK but is NS_ERROR_FAILURE.
It appears to eventually invoke OnStopRunningUrl here https://searchfox.org/comm-central/rev/38b9c6416aa503df4003dbf400271d1e82f3bc75/mailnews/base/src/nsMsgMailNewsUrl.cpp#235 still with aExitCode = NS_ERROR_FAILURE. However, I don't know where the actual OnStopRunningUrl that it invokes is located.
With JS debugger, I'm not seeing it go to the place you link to in comment 15 (https://searchfox.org/comm-central/rev/12ced1ee9c283341d768495ec156fd4dbd7e8dc3/mailnews/db/gloda/modules/MimeMessage.jsm#101) which seems to have something to do about global search (gloda).

Anyhow, I probably need more details on exactly how you are seeing the issue.

Anyhow, I probably need more details on exactly how you are seeing the issue.

Flags: needinfo?(john)
Summary: [imap-cpp] Inconsistency when trying to fetch messages after Thunderbird started with disabled network (and set to not sync messages) → Inconsistency when trying to fetch imap messages after Thunderbird started with disabled network (and set to NOT sync messages)

If started with network down, imap connections don't occur and new messages are
not downloaded until "Get Messages" is clicked. Reported also in
Bug 1842655 and possibly Bug 1843190.
The issues are describe at Bug 1854586 comment 9.
Note: This leaves Bug 1768121 unfixed but I'll re-open that bug and I have a
proposed patch for it.

Assignee: john → gds
Status: NEW → ASSIGNED
Target Milestone: --- → 121 Branch

Note: This leaves Bug 1768121 unfixed but I'll re-open that bug and I have a
proposed patch for it.

I couldn't find a way to "re-open" Bug 1768121 so I cloned it at bug 1862111 and have attached a WIP patch there.

See Also: → 1862111
See Also: → 1843190

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/631fce35d378
Backout URL queueing causing failure to connect with network down. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Duplicate of this bug: 1842655
Duplicate of this bug: 1843190

Gene is this good for beta? Or do you prefer longer on nightly?
Do you still need more info from John?

Flags: needinfo?(gds)

(In reply to Wayne Mery (:wsmwk) from comment #25)

Gene is this good for beta? Or do you prefer longer on nightly?

I should be OK for beta.

Do you still need more info from John?

No, don't think so.

Flags: needinfo?(gds)
Attachment #9361119 - Flags: approval-comm-beta?

Comment on attachment 9361119 [details]
Bug 1854586 - Backout URL queueing causing failure to connect with network down. r=mkmelin

[Triage Comment]
Approved for beta

Attachment #9361119 - Flags: approval-comm-beta? → approval-comm-beta+

good for 115?

Flags: needinfo?(john) → needinfo?(gds)

Or maybe this should wait on bug 1862111 being ready?

(In reply to Wayne Mery (:wsmwk) from comment #30)

Or maybe this should wait on bug 1862111 being ready?

No, I think what this (bug 1854586) fixes is worse than the problem fixed by bug 1862111 . So I would say go to 115 with this and I'll try to get back to fixing bug 1862111 asap.

Flags: needinfo?(gds)

Comment on attachment 9361119 [details]
Bug 1854586 - Backout URL queueing causing failure to connect with network down. r=mkmelin

[Triage Comment]
Approved for esr115

Attachment #9361119 - Flags: approval-comm-esr115+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: