Closed Bug 1415723 Opened 2 years ago Closed 2 years ago
Severe regression in at least Tb55 when repairing/rebuilding/compacting folder database (msf) file
1.73 MB, text/plain
34.17 KB, patch
|Details | Diff | Splinter Review|
36.27 KB, image/png
1.56 MB, text/plain
In Tb52.4, a folder with the comm-central changelog feed, containing 17k messages and size of 22M with a .msf database of 11M will take ~2 seconds to repair or rebuild on selection if the .msf is deleted. In Tb55 beta (not yet at 56) and on daily, this takes ~80 seconds, pegs a core's CPU, and freezes Tb. If enough rebuilds are requested, as when copying a profile with lots of folders with lots of messages, I learned an 8core 16M linux os will soon blow up impressively.
wsm, can you reproduce this? It's a borderline blocker, imo.
Confirmed. With 57 beta candidate (but NOT in safe mode) on SSD, a 12k article slashdot feed took 2 minutes. on 52.2.1 it was <2 seconds. A 12k message local folder took one minute. I narrowed the regression to be between 2017-04-21 and 2017-04-22 (55.0a1) https://hg.mozilla.org/comm-central/pushloghtml?startdate=2017-04-21+03%3A05%3A00&enddate=2017-04-22+03%3A05%3A00
Compact is also affected. No dupes that I can see in https://mzl.la/2zu1pGm In the process of testing the range 2017-04-19 to 2017-04-26 I experienced frequent redownloading of gmail folders and indexing. (and nsMsgDatabase::GetUint32Property bp-16d0c827-711f-4cd7-afe4-91c4e0171109 fixed by bug 1360873 on 2017-05-01)
Summary: Severe regression in at least Tb55 when repairing/rebuilding folder database (msf) file. → Severe regression in at least Tb55 when repairing/rebuilding/compacting folder database (msf) file.
I've seen this regression range before with the Jorg K — Bug 1358444 - port bug 1356843 to mailnews: Fix clang's -Wcomma warnings. rs=bustage-fix DONTBUILD But none of the stuff in the range can be the culprit. Like in bug 1368786 comment #1 you're off by a day or so and the culprit is, ta dah, ... bug 853881 which landed on 22nd April 2017 in two changesets, bug 853881 comment #34 and bug 853881 comment #42. Note the crash fix we had to do on it in bug 1368786. Joshua, can you tell us why rolling in a different caching scheme causes this performance degradation. I've noticed semi-hangs as well with a lot of disk activity (but didn't pay much attention to them). I have the TB data on a magnetic disk, and I can hear it rattle.
Severity: major → critical
Alta88, you could back out the changesets from bug 853881 and bug 1368786 and see whether that fixes things.
I'm not sure if what I'm seeing might be related: I'm seeing massive delays (TB not responding) of 60 seconds and more, both on flat install and normal release, when opening a simple draft message from POP account with 14 plain text attachments (different extensions) totalling 1KB (sic!). I wouldn't claim my laptop is very fast, but I have 12 GB or RAM and I'm surprised to see Daily with 30% CPU load while trying to load that simple message.
Great, jsmimed again. Jorg, your optimism in expecting a needinfo response is touching, but history shows misplaced. Nothing in bug 853881 indicates the change was important. Core did a lot of work on PLDHashTable, and uses it extensively, since the original patch, whatever reason drove it in the first place. (abc is also hard to read when one is 3). I strongly suggest it and related patches be backed out, unless someone is going to dig into it. And even then, what demonstrable benefit is there. If it's not perf equivalent, it's a fail. It's really tiresome to constantly Groundhog Day things like Bug 418551.
(In reply to Jorg K (:jorgk, GMT+1) from comment #4) > I've seen this regression range before with the > Jorg K — Bug 1358444 - port bug 1356843 to mailnews: Fix clang's -Wcomma > warnings. rs=bustage-fix DONTBUILD > > But none of the stuff in the range can be the culprit. Like in bug 1368786 > comment #1 you're off by a day or so hmm. I retested to confirm my findings. To put a finer point on my regression range * works 04-21 buildid 20170421030231 built from https://hg.mozilla.org/comm-central/rev/e2beacaab75ad46459484a89d2d0ae479db18f52 * fails 04-22 buildid 20170422030207 built from https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371 -- CPU isn't high, in fact it is low during the repair process - ranging 1-15% - which is very strange. Attached is developer tools performance profile, from beginning to end of "repair folder" ** downloaded from https://archive.mozilla.org/pub/thunderbird/nightly/2017/04/2017-04-21-03-02-31-comm-central/ File thunderbird-55.0a1.en-US.win32.installer.exe 33M 21-Apr-2017 10:58
(In reply to Jorg K (:jorgk, GMT+1) from comment #4) > Joshua, can you tell us why rolling in a different caching scheme causes > this performance degradation. It may be due to some bug is unused. Or the hash function changed and is not very appropriate to the kind of keys (message keys) we store and there are many collisions. In the case of msf rebuild, we read the keys sequentially (and they may be monotonically increasing), which may be a pattern where this particular cache/hash has an edge case bottleneck.
I had a chat to Eric (:erahm) and Nathan (:froydnj) on IRC to learn a bit more about those tables. Turns out that the new tables nsTHashtable and friend (https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Hashtables) are a wrapper for PLDHashTable, see: https://dxr.mozilla.org/mozilla-central/rev/2535bad09d720e71a982f3f70dd6925f66ab8ec7/xpcom/ds/nsTHashtable.h#348 Some comments from IRC: 22:34:28 - jorgk: So if the original code worked with PLDHashTable, what would have been the motivation to switch to the new interface and put another layer in between? 22:38:51 - froydnj: jorgk: less code and more comprehensible code? 22:38:58 - erahm: jorgk: the motivation is it's easier to use correctly, handles various ref and string types for you, etc. tbh if you have something that's working okay I wouldn't waste dev time on moving away unless you have spare cycles. Like remove mork before you switch the table type 22:42:25 - jorgk: OK guys. That's for the answers. Sadly we decided to turn something working based on PLDHashTable into something not-working based on the newer classes and we burned a damn awful lot of cycles (we don't really have) to do it. Now I need to decide whether to invest more cycles or just back it all out. Eric offered to glance over the changes, so here they are: https://hg.mozilla.org/comm-central/rev/62bfaab2731fc47ee7ed26aefa001e63c4f2c371 - bug 853881 comment #34 (main patch) https://hg.mozilla.org/comm-central/rev/5c3191edbbd4fe552450b582d520ba9be4cd431b - bug 853881 comment #42 https://hg.mozilla.org/comm-central/rev/a0bd8f2b70be - bug 1368786 comment #47, fix memory issues. Note that the performance regression was already caused after landing the first changeset. Eric, here's the information you need, thanks for the offer.
As mentioned in comment 8 where I attached the performance profile, and to reiterate, this is NOT causing high CPU. So some other resource issue is going on with the patch. And I forget to say it is not causing significant memory increase. Because this has been in the field so long we should not rush for a fix. But we definitely need to fix before 59 goes to beta.
(In reply to Wayne Mery (:wsmwk) from comment #8) > CPU isn't high, in fact it is low during the repair process - ranging 1-15% > - which is very strange. Low CPU usage suggests either I/O bottleneck or excessive sleep pauses. (In reply to Jorg K (:jorgk, GMT+1) from comment #10) > 22:34:28 - jorgk: So if the original code worked with PLDHashTable, what > would have been the motivation to switch to the new interface and put > another layer in between? > 22:38:51 - froydnj: jorgk: less code and more comprehensible code? I'll point out that the original PLDHashTable-based implementation had some really nasty ownership issues, since it predates nsCOMPtr (maybe, I'm not 100% sure of when that helper was first added) and used manual refcount calls that were placed in somewhat odd locations. (In reply to Jorg K (:jorgk, GMT+1) from comment #4) > Joshua, can you tell us why rolling in a different caching scheme causes > this performance degradation. I'd suspect it's issues caused from accidentally missing one of the crazy manual NS_ADDREF/NS_RELEASE calls to nsMsgHdr objects or something in that vein.
Comment on attachment 8927448 [details] [diff] [review] Backout patch for bug 853881 and bug 1368786 I can verify that in a build with this backout patch applied the regression is eliminated.
Just to explain how the patch was created: I backed out the three changesets listed in comment #10 in reverse order using "hg backout -r ...". There was a merge conflict where I chose "C" (IIRC). I exported the resulting patches with TortoiseHg and then folded them into one patch using "hg qfold". I have a Windows build with that patch https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e16019b64aea30afcdcec68b1ef13bc83b4cceea which I'm using now to see whether the occasional severe delays when accessing large folders will disappear. The X failures are not due to the patch but due to bug 1410780. That said, it would be better to fix the regression than to back out the whole lot. Anyway, not for me to decide.
So if I read the attached profile correctly (in the Developer tools inside TB), 70% of the time is spent in nsInputStreamPump::OnStateTransfer():545 (which calls nsInputStreamPump::OnInputStreamReady():400) and 22% in nsXULWindow::ShowModal(). If there would be excessive hash collisions or bad caching of objects, I would expect there to be 100% CPU usage. But Wayne says there isn't. May there be significant CPU usage in the OS (but not in TB)? Or high I/O activity? I could imagine something like closing and reopening the same file again for each message which could cause work for the OS/filesystem/disk. Wayne, can you retest? Of course, the reopening could be caused by bad caching caused by the patch, we just need to find out in which area the code regressed.
(In reply to Jorg K [exited Thunderbird project due to non-renewal of contract] from comment #4) > I've noticed semi-hangs as well with a lot of disk activity (but didn't pay > much attention to them). I have the TB data on a magnetic disk, and I can > hear it rattle. Please refer to my comment #4. During the freeze/hang/slowness there is a lot of *disk* activity which I can hear, but users of an SSD can't. That matches what Aceman wrote in comment #16: It comes to disk I/O. If I were to make a wild guess, I'd say that something in the cache goes wrong and TB decides to read the entire MSF or even recreate it. Anyway, that comment applies to long access times to large folders.
I confirm an IO increase with 57.0b1 compared to 04-21 build - 3x more IO reads (10k reads vs 3.5k reads, write count is unchanged) to repair my 12k message slashdot folder on SSD - see attached, tested only once. But that read rate increase alone is I think does not seem enough to account for a 20-25x increase in execution time. 3-4 seconds at about 1k reads per second for 04-21 vs ~2 minutes at 80 reads per second!
profile for the faster 04-21 build
p.s. disabling global search made no difference
Wayne, try https://email@example.com/try-comm-central-win32/ It's got bug 853881 and bug 1368786 backed out (and also the M-C change that renders add-ons unusable, so this is a Daily that can be used for real).
Jorg, thanks for the try build. Try build IO reads and write are similar to 04-21. But note, my numbers/old math for 0421 build writes were wrong, write are 16k, not 26k. So a 56% increase in writes for 04-22. STILL, a 56% increase in write IO seemingly could not cause a 20x increase in run time. Someone might check my math :) *new math reads writes 04-21/try build 3.5k 16k 04-22/57.0b1 10k 25k
Whiteboard: [regression:TB5?] → [regression:TB55]
I added notes on possible issues with the original patches in bug 853881, comment 45. The might be not completely correct, I didn't have time to dive in super deep.
Comment on attachment 8927448 [details] [diff] [review] Backout patch for bug 853881 and bug 1368786 No review required here. Bug 853881 and bug 1368786 will be backed out with this patch RSN(TM) for the following reasons: - Bug 853881 comment #45 by Eric Rahm suggests that the original patch had some design problems. - It is uncertain when those problems can be addressed. If we want to pursue the hash table replacement, we should do so in a new bug with a fresh start and with a developer who is committed to see the bug through to the end including regressions, etc.
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/comm-central/rev/10d6a264303b back out bug 853881 and follow-up bug 1368786 for causing a performance regression. a=jorgk
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Fixed by backout. We need to fix the target here once TB 59 becomes available.
Target Milestone: --- → Thunderbird 58.0
Backed out from Beta, TB 58: https://hg.mozilla.org/releases/comm-beta/rev/a53e3e2a9ba92ac4d8d4190d03222043004afa48
2 years ago
You need to log in before you can comment on or make changes to this bug.