Closed Bug 1415723 Opened 2 years ago Closed 2 years ago

Severe regression in at least Tb55 when repairing/rebuilding/compacting folder database (msf) file.

Categories

(MailNews Core :: Database, defect, critical)

defect
Not set
critical

Tracking

(thunderbird_esr52 unaffected, thunderbird55 wontfix, thunderbird56 wontfix, thunderbird57 affected, thunderbird58 fixed, thunderbird59 fixed)

RESOLVED FIXED
Thunderbird 59.0
Tracking Status
thunderbird_esr52 --- unaffected
thunderbird55 --- wontfix
thunderbird56 --- wontfix
thunderbird57 --- affected
thunderbird58 --- fixed
thunderbird59 --- fixed

People

(Reporter: alta88, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [regression:TB55])

Attachments

(4 files)

In Tb52.4, a folder with the comm-central changelog feed, containing 17k messages and size of 22M with a .msf database of 11M will take ~2 seconds to repair or rebuild on selection if the .msf is deleted.

In Tb55 beta (not yet at 56) and on daily, this takes ~80 seconds, pegs a core's CPU, and freezes Tb. If enough rebuilds are requested, as when copying a profile with lots of folders with lots of messages, I learned an 8core 16M linux os will soon blow up impressively.
wsm, can you reproduce this? It's a borderline blocker, imo.
Confirmed. With 57 beta candidate (but NOT in safe mode) on SSD, a 12k article slashdot feed took 2 minutes. on 52.2.1 it was <2 seconds.  A 12k message local folder took one minute. 

I narrowed the regression to be between 2017-04-21 and 2017-04-22 (55.0a1)
https://hg.mozilla.org/comm-central/pushloghtml?startdate=2017-04-21+03%3A05%3A00&enddate=2017-04-22+03%3A05%3A00
Whiteboard: [regression:TB5?]
Compact is also affected. No dupes that I can see in https://mzl.la/2zu1pGm

In the process of testing the range 2017-04-19 to 2017-04-26 I experienced frequent redownloading of gmail folders and indexing. (and nsMsgDatabase::GetUint32Property bp-16d0c827-711f-4cd7-afe4-91c4e0171109 fixed by bug 1360873 on 2017-05-01)
Summary: Severe regression in at least Tb55 when repairing/rebuilding folder database (msf) file. → Severe regression in at least Tb55 when repairing/rebuilding/compacting folder database (msf) file.
I've seen this regression range before with the
  Jorg K — Bug 1358444 - port bug 1356843 to mailnews: Fix clang's -Wcomma warnings. rs=bustage-fix DONTBUILD

But none of the stuff in the range can be the culprit. Like in bug 1368786 comment #1 you're off by a day or so and the culprit is, ta dah, ... bug 853881 which landed on 22nd April 2017 in two changesets, bug 853881 comment #34 and bug 853881 comment #42. Note the crash fix we had to do on it in bug 1368786.

Joshua, can you tell us why rolling in a different caching scheme causes this performance degradation.

I've noticed semi-hangs as well with a lot of disk activity (but didn't pay much attention to them). I have the TB data on a magnetic disk, and I can hear it rattle.
Severity: major → critical
Flags: needinfo?(Pidgeot18)
Alta88, you could back out the changesets from bug 853881 and bug 1368786 and see whether that fixes things.
I'm not sure if what I'm seeing might be related:

I'm seeing massive delays (TB not responding) of 60 seconds and more, both on flat install and normal release, when opening a simple draft message from POP account with 14 plain text attachments (different extensions) totalling 1KB (sic!). I wouldn't claim my laptop is very fast, but I have 12 GB or RAM and I'm surprised to see Daily with 30% CPU load while trying to load that simple message.
Great, jsmimed again. Jorg, your optimism in expecting a needinfo response is touching, but history shows misplaced.

Nothing in bug 853881 indicates the change was important. Core did a lot of work on PLDHashTable, and uses it extensively, since the original patch, whatever reason drove it in the first place. (abc is also hard to read when one is 3).

I strongly suggest it and related patches be backed out, unless someone is going to dig into it. And even then, what demonstrable benefit is there.  If it's not perf equivalent, it's a fail.

It's really tiresome to constantly Groundhog Day things like Bug 418551.
(In reply to Jorg K (:jorgk, GMT+1) from comment #4)
> I've seen this regression range before with the
>   Jorg K — Bug 1358444 - port bug 1356843 to mailnews: Fix clang's -Wcomma
> warnings. rs=bustage-fix DONTBUILD
> 
> But none of the stuff in the range can be the culprit. Like in bug 1368786
> comment #1 you're off by a day or so 

hmm. I retested to confirm my findings. To put a finer point on my regression range
* works 04-21 buildid 20170421030231 built from https://hg.mozilla.org/comm-central/rev/e2beacaab75ad46459484a89d2d0ae479db18f52
* fails 04-22 buildid 20170422030207 built from https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371 --  

CPU isn't high, in fact it is low during the repair process - ranging 1-15% - which is very strange.

Attached is developer tools performance profile, from beginning to end of "repair folder"

** downloaded from https://archive.mozilla.org/pub/thunderbird/nightly/2017/04/2017-04-21-03-02-31-comm-central/  
  File 	thunderbird-55.0a1.en-US.win32.installer.exe 	33M 	21-Apr-2017 10:58
(In reply to Jorg K (:jorgk, GMT+1) from comment #4)
> Joshua, can you tell us why rolling in a different caching scheme causes
> this performance degradation.

It may be due to some bug is unused. Or the hash function changed and is not very appropriate to the kind of keys (message keys) we store and there are many collisions. In the case of msf rebuild, we read the keys sequentially (and they may be monotonically increasing), which may be a pattern where this particular cache/hash has an edge case bottleneck.
I had a chat to Eric (:erahm) and Nathan (:froydnj) on IRC to learn a bit more about those tables.

Turns out that the new tables nsTHashtable and friend (https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Hashtables) are a wrapper for PLDHashTable, see:
https://dxr.mozilla.org/mozilla-central/rev/2535bad09d720e71a982f3f70dd6925f66ab8ec7/xpcom/ds/nsTHashtable.h#348

Some comments from IRC:

22:34:28 - jorgk: So if the original code worked with PLDHashTable, what would have been the motivation to switch to the new interface and put another layer in between?
22:38:51 - froydnj: jorgk: less code and more comprehensible code?
22:38:58 - erahm: jorgk: the motivation is it's easier to use correctly, handles various ref and string types for you, etc. tbh if you have something that's working okay I wouldn't waste dev time on moving away unless you have spare cycles. Like remove mork before you switch the table type
22:42:25 - jorgk: OK guys. That's for the answers. Sadly we decided to turn something working based on PLDHashTable into something not-working based on the newer classes and we burned a damn awful lot of cycles (we don't really have) to do it. Now I need to decide whether to invest more cycles or just back it all out.

Eric offered to glance over the changes, so here they are:
https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371 - bug 853881 comment #34 (main patch)
https://hg.mozilla.org/comm-central/rev/5c3191edbbd4fe552450b582d520ba9be4cd431b - bug 853881 comment #42
https://hg.mozilla.org/comm-central/rev/a0bd8f2b70be - bug 1368786 comment #47, fix memory issues.

Note that the performance regression was already caused after landing the first changeset.

Eric, here's the information you need, thanks for the offer.
Flags: needinfo?(erahm)
As mentioned in comment 8 where I attached the performance profile, and to reiterate, this is NOT causing high CPU. So some other resource issue is going on with the patch. And I forget to say it is not causing significant memory increase. 

Because this has been in the field so long we should not rush for a fix. But we definitely need to fix before 59 goes to beta.
(In reply to Wayne Mery (:wsmwk) from comment #8)
> CPU isn't high, in fact it is low during the repair process - ranging 1-15%
> - which is very strange.

Low CPU usage suggests either I/O bottleneck or excessive sleep pauses.

(In reply to Jorg K (:jorgk, GMT+1) from comment #10)
> 22:34:28 - jorgk: So if the original code worked with PLDHashTable, what
> would have been the motivation to switch to the new interface and put
> another layer in between?
> 22:38:51 - froydnj: jorgk: less code and more comprehensible code?

I'll point out that the original PLDHashTable-based implementation had some really nasty ownership issues, since it predates nsCOMPtr (maybe, I'm not 100% sure of when that helper was first added) and used manual refcount calls that were placed in somewhat odd locations.

(In reply to Jorg K (:jorgk, GMT+1) from comment #4)
> Joshua, can you tell us why rolling in a different caching scheme causes
> this performance degradation.

I'd suspect it's issues caused from accidentally missing one of the crazy manual NS_ADDREF/NS_RELEASE calls to nsMsgHdr objects or something in that vein.
Flags: needinfo?(Pidgeot18)
Comment on attachment 8927448 [details] [diff] [review]
Backout patch for bug 853881 and bug 1368786

I can verify that in a build with this backout patch applied the regression is eliminated.
Attachment #8927448 - Flags: review?(acelists)
Just to explain how the patch was created:
I backed out the three changesets listed in comment #10 in reverse order using "hg backout -r ...". There was a merge conflict where I chose "C" (IIRC). I exported the resulting patches with TortoiseHg and then folded them into one patch using "hg qfold". I have a Windows build with that patch
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e16019b64aea30afcdcec68b1ef13bc83b4cceea
which I'm using now to see whether the occasional severe delays when accessing large folders will disappear. The X failures are not due to the patch but due to bug 1410780.

That said, it would be better to fix the regression than to back out the whole lot. Anyway, not for me to decide.
So if I read the attached profile correctly (in the Developer tools inside TB), 70% of the time is spent in nsInputStreamPump::OnStateTransfer():545 (which calls nsInputStreamPump::OnInputStreamReady():400) and 22% in nsXULWindow::ShowModal().

If there would be excessive hash collisions or bad caching of objects, I would expect there to be 100% CPU usage.
But Wayne says there isn't. May there be significant CPU usage in the OS (but not in TB)? Or high I/O activity? I could imagine something like closing and reopening the same file again for each message which could cause work for the OS/filesystem/disk.
Wayne, can you retest? Of course, the reopening could be caused by bad caching caused by the patch, we just need to find out in which area the code regressed.
(In reply to Jorg K [exited Thunderbird project due to non-renewal of contract] from comment #4)
> I've noticed semi-hangs as well with a lot of disk activity (but didn't pay
> much attention to them). I have the TB data on a magnetic disk, and I can
> hear it rattle.
Please refer to my comment #4. During the freeze/hang/slowness there is a lot of *disk* activity which I can hear, but users of an SSD can't. That matches what Aceman wrote in comment #16: It comes to disk I/O. If I were to make a wild guess, I'd say that something in the cache goes wrong and TB decides to read the entire MSF or even recreate it. Anyway, that comment applies to long access times to large folders.
Attached image bug 1415723 disk IO.png
I confirm an IO increase with 57.0b1 compared to 04-21 build - 3x more IO reads (10k reads vs 3.5k reads, write count is unchanged)  to repair my 12k message slashdot folder on SSD - see attached, tested only once.  

But that read rate increase alone is I think does not seem enough to account for a 20-25x increase in execution time. 3-4 seconds at about 1k reads per second for 04-21 vs ~2 minutes at 80 reads per second!
profile for the faster 04-21 build
p.s. disabling global search made no difference
Wayne, try
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-e16019b64aea30afcdcec68b1ef13bc83b4cceea/try-comm-central-win32/
It's got bug 853881 and bug 1368786 backed out (and also the M-C change that renders add-ons unusable, so this is a Daily that can be used for real).
Flags: needinfo?(vseerror)
Jorg, thanks for the try  build.  

Try build IO reads and write are similar to 04-21. But note, my numbers/old math for 0421 build writes were wrong, write are 16k, not 26k.  So a 56% increase in writes for 04-22.  STILL, a 56% increase in write IO seemingly could not cause a 20x increase in run time.  Someone might check my math :)

*new math
                  reads  writes
04-21/try build   3.5k    16k
04-22/57.0b1      10k     25k
Blocks: 853881
Flags: needinfo?(vseerror)
Whiteboard: [regression:TB5?] → [regression:TB55]
I added notes on possible issues with the original patches in bug 853881, comment 45. The might be not completely correct, I didn't have time to dive in super deep.
Flags: needinfo?(erahm)
Comment on attachment 8927448 [details] [diff] [review]
Backout patch for bug 853881 and bug 1368786

No review required here. Bug 853881 and bug 1368786 will be backed out with this patch RSN(TM) for the following reasons:
- Bug 853881 comment #45 by Eric Rahm suggests that the original
  patch had some design problems.
- It is uncertain when those problems can be addressed.

If we want to pursue the hash table replacement, we should do so in a new bug with a fresh start and with a developer who is committed to see the bug through to the end including regressions, etc.
Attachment #8927448 - Flags: review?(acelists)
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/10d6a264303b
back out bug 853881 and follow-up bug 1368786 for causing a performance regression. a=jorgk
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Fixed by backout.

We need to fix the target here once TB 59 becomes available.
Target Milestone: --- → Thunderbird 58.0
Target Milestone: Thunderbird 58.0 → Thunderbird 59.0
You need to log in before you can comment on or make changes to this bug.