Some messages may be inaccessible after selecting a folder that is being compacted
Categories
(Thunderbird :: General, defect)
Tracking
(thunderbird_esr115 unaffected, thunderbird_esr128 unaffected, thunderbird138 wontfix, thunderbird139 affected, thunderbird140+ affected)
| Tracking | Status | |
|---|---|---|
| thunderbird_esr115 | --- | unaffected |
| thunderbird_esr128 | --- | unaffected |
| thunderbird138 | --- | wontfix |
| thunderbird139 | --- | affected |
| thunderbird140 | + | affected |
People
(Reporter: welpy-cw, Unassigned)
References
(Regression)
Details
(Keywords: regression)
Steps to reproduce:
- Open a local folder with several messages, sort by "Order Received", and delete a message in the middle.
- Select an adjacent folder.
- Open the context menu of the first folder and select "Compact".
- Immediately select the first folder (for example by arrow key).
Expected result:
All messages in the compacted folder are accessible.
Actual result:
Messages with a key higher than the previously deleted message are not accessible, the message pane doesn't change when they are selected. After a restart, Thunderbird shows "Building summary file for [compacted folder]…", all messages are accessible again.
I can reproduce this both with an artifact and a debug build of Daily, but not with ESR 128, I am not sure if that's affected too.
Updated•11 months ago
|
| Reporter | ||
Comment 1•11 months ago
|
||
Updated•10 months ago
|
Updated•10 months ago
|
Comment 2•10 months ago
|
||
(In reply to Hartmut Welpmann [:welpy-cw] from comment #0)
Steps to reproduce:
- Open a local folder with several messages, sort by "Order Received", and delete a message in the middle.
- Select an adjacent folder.
- Open the context menu of the first folder and select "Compact".
- Immediately select the first folder (for example by arrow key).
My guess:
The users flicks onto the folder being compacted (step 4) , but it still has the old database, from which the list of messages is shown.
Then the compaction completes and the new (compacted) database is installed, replacing the old one, but the view is still holding the old database object in existence and still showing it in the view.
All the messages before the deleted one retain the same storeToken after a compaction, but everything after the deleted message will have a new store token.
So clicking on one of those messages after the deleted one, it'll still be using the old storeToken from the old database, trying to read it from the new (compacted) mbox file, but failing because the offset in the old storeToken no longer points at a "From " separator in the new mbox file.
The old database will still continue to work for messages before the deleted one because the storeTokens haven't changed.
It's likely that the quick-switch to the other folder in step 4 means that the "AboutToCompact" folder notification is delivered before the compacting folder is the focus.
The "CompactionCompleted" folder notification will arrive OK.
Maybe the GUI is relying on the "AboutToCompact" and "CompactionCompleted" notifications to always be symmetrical i.e. both arrive for the same focused folder?
Updated•10 months ago
|
Comment 3•9 months ago
|
||
It's not the view that's holding on to the old database. Reopening the view while compact is happening causes the folder to reopen and hold onto the old database. (Basically until the end of time, or until you can convince everything to let the folder go and close it, whichever comes sooner.)
Here's a profile of it happening (in the marker chart tab, scroll down to the mailnews section). At about the 80ms mark, I switch back to the folder being compacted and the database opens again.
What do we do about it? I'm not sure. We could block reloading the view while the folder is being compacted but telemetry tells us that some people have folders that take minutes to compact. So that's not great. Do we ask the folder to close the old database again as compaction ends and before the compacted database gets moved into place?
Comment 4•9 months ago
|
||
FYI, I can reproduce on Linux using a big local folder.
Comment 5•9 months ago
|
||
Note I'm not acting quickly. It fails if I want 1-2 seconds after I have initiated the compact. So it probably isn't caused by an initial notification being out of order.
Comment 6•9 months ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #3)
What do we do about it? I'm not sure. We could block reloading the view while the folder is being compacted but telemetry tells us that some people have folders that take minutes to compact. So that's not great.
Still, I would prefer that.
We have a progress indicator, so people can know something is happening. Better show nothing than wrong data and being broken.
Comment 7•9 months ago
•
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #3)
What do we do about it? I'm not sure. We could block reloading the view while the folder is being compacted but telemetry tells us that some people have folders that take minutes to compact. So that's not great. Do we ask the folder to close the old database again as compaction ends and before the compacted database gets moved into place?
In theory, reading from the old database should be fine while the compaction proceeds.
It's just any writes that are made will be lost once the new compacted database is emplaced. (and that includes "passive" things like marking a message as "read")
But in practice:
- we don't have any concept of a "read-only" mode in the database.
- the current code seems to lock the database file even if it's only being read from. So even just reading from the DB locks it for writing. And holds that lock until (I think) the DB object is closed.
So for now, I tend toward the "don't allow access" to a compacting folder solution.
We could look at optimising compaction. We should probably be doing that at some point anyway. Really it should be entirely I/O bound on the mbox read/write, but we're a long long way off that at the moment...
No mbox should take minutes to compact.
| Reporter | ||
Comment 8•9 months ago
|
||
See bug 1959858 comment 79. This appears to be the Linux incarnation of the same underlying problem, the STR from comment 0 on Thunderbird 140 Beta under Windows produce the error message "The folder … could not be compacted because writing to folder failed. Verify…".
With the patch from bug 1959858 comment 80 I wasn't able to reproduce this any more. Chapeau!
Comment 9•9 months ago
|
||
I get different behavior depending on platform.
On Linux, I can reproduce this bug here every time.
Using a very big local folder (4 GB), delete a message from the top of the list (sort by order received), compact, switch to other folder and switch back, wait until compact finished, then click on a message at the bottom of the list, and it doesn't load.
The patch from bug 1959858 comment 80 does NOT fix that bug for me on Linux. I'm reopening, because I'm not yet convinced it's fixed.
On Windows, I cannot reproduce this bug at all. All messages always load for me, while compact is still running, and after compact is done.
Comment 10•9 months ago
|
||
(In reply to Kai Engert [:KaiE:] from comment #9)
On Linux, I can reproduce this bug here every time.
Using a very big local folder (4 GB), delete a message from the top of the list (sort by order received), compact, switch to other folder and switch back, wait until compact finished, then click on a message at the bottom of the list, and it doesn't load.The patch from bug 1959858 comment 80 does NOT fix that bug for me on Linux. I'm reopening, because I'm not yet convinced it's fixed.
With that patch, I can't replicate this bug - it all seems to be working as expected as far as I can tell.
Here are the steps I perform:
- set up a fresh profile, fresh IMAP account, with a 6GB INBOX, and wait for it all to download and settle down.
- switch to table view, show "order received column" (the msgKey), sort by order received.
- figure out the first message (the one with storeToken 0 - i.e. the very start of the mbox) and delete it.
- rightclick on the Inbox and pick "Compact"
- quickly click on another folder (Trash, or one of the other folders my IMAP account has)
- click back on Inbox
- wait for the compaction to complete
- click on messages which were after the deleted one
results and notes:
- All the messages appeared without problem when clicked on.
- I had MOZ_LOG="compact:3" so I could be sure the compaction completed with no error.
- I added in some printfs on
nsMsgHdr::SetStoreToken()andnsMsgHdr::GetStoreToken(), and can see that the correct (post-compaction) storeTokens were present. I'd picked a few messages and noted the storeToken before compaction, expecting the compaction to update them and it did. - in step 2 I tried staying on the card view instead (trickier, because it doesn't show the message key ("order received")), but it seemed to work fine too.
I just did another test (again with Set/GetStoreToken() printfs active) where I:
- deleted the first message
- started the compaction
- clicked quickly to another folder and then back to Inbox
- clicked on the second message (it displayed fine, using the pre-compaction storeToken)
- waited for the compaction to complete
- clicked on the second message again (it again displayed fine, but this time it was using the postcompaction storeToken)
(at step 5, the message view reset and stopped displaying the second message).
This is exactly what I'd expect - the view was using the old database right up until the compaction switched them over, and then it started using the new database (I guess the front end was using the OnCompactionComplete() notification to trigger switchover)
Kai:
- is the compaction completing successfully? (MOZ_LOG="compact:3" will tell you)
- is the storeToken for a message after the deleted message correctly updated? (Add a printf to nsMsgHdr::GetStoreToken() and note the value - both before compaction, then afterwards).
Worth noting that the messagekey ("order received" in gui) is not necessarily the same ordering as the storeToken (ie the order written into the mbox file). For IMAP the messageKey is the UID from the server, so it depends on what the server does. My local Dovecot install seems to be all over the place. I figured out the first message by adding a printf to nsMsgHdr::SetStoreToken() and noting down the messageKey for where storetoken=0, during the original downloading.
Hopefully there's something there that might help nail down what's happening...
Comment 11•9 months ago
|
||
I apologize. The patch fixes this bug for me on Linux, too.
It looks like yesterday I had failed to properly apply the patch on Linux.
It's embarrassing and sorry for the extra work I've caused you!
Comment 12•9 months ago
|
||
(In reply to Kai Engert [:KaiE:] from comment #11)
It looks like yesterday I had failed to properly apply the patch on Linux.
It's embarrassing and sorry for the extra work I've caused you!
No problem whatsoever! It was a good exercise to go through and really examine what was happening in detail and confirm that the mental model was correct!
Description
•