Open Bug 1642292 Opened 8 months ago Updated 1 month ago

Calendar CalDAV - Thunderbird processing / "Not Responding" 10+ minutes at startup with 77.0b3 - Application Unusable - Major Regression

Categories

(Calendar :: General, defect)

Thunderbird 77
x86_64
All
defect

Tracking

(thunderbird_esr78+ affected)

Tracking Status
thunderbird_esr78 + affected

People

(Reporter: richard.leger, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Blocks 2 open bugs, Regression, )

Details

(Keywords: perf, regression)

Attachments

(30 files)

1.45 MB, application/x-zip-compressed
Details
680.74 KB, application/x-zip-compressed
Details
1.72 MB, application/x-zip-compressed
Details
86.67 KB, image/png
Details
44.18 KB, image/png
Details
88.23 KB, text/plain
Details
1.73 MB, application/x-zip-compressed
Details
24.87 KB, image/png
Details
42.36 KB, image/png
Details
37.14 KB, text/plain
Details
2.40 KB, text/plain
Details
2.45 KB, text/plain
Details
6.96 KB, image/png
Details
7.05 KB, image/jpeg
Details
7.22 KB, image/png
Details
23.83 KB, image/png
Details
43.73 KB, image/png
Details
980.29 KB, application/x-zip-compressed
Details
1.41 MB, application/x-zip-compressed
Details
1.50 MB, application/x-zip-compressed
Details
30.93 KB, image/png
Details
4.81 MB, application/x-zip-compressed
Details
6.08 MB, application/x-zip-compressed
Details
79.85 KB, image/png
Details
36.83 KB, image/png
Details
2.96 MB, application/x-zip-compressed
Details
83.28 KB, image/png
Details
25.25 KB, image/png
Details
7.95 KB, text/plain
Details
87.88 KB, image/png
Details

+++ This bug was initially created as a clone of Bug #1502923 +++
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0

Steps to reproduce:

Upgraded to TB 77.0b3 from 77.0b2...

Actual results:

The application take ages to start and keep processing/not responding 10mn+ ad vitam aeternam... the UI ix so slow and freezing intermittently that the application is simply unusable... though if you try to open menu after long wait it would do it so the UI is not completely frozen. In safe mode issue is still occurring, and as Lightning is now integrated to Core it has become impossible to disable it... to at least be able to use email feature :-)

Note: After major improvements (via workaround e.g lazzy loading) introduced in the past months, TB was loading within about ~1m10s when using large multiple Caldav calendars... but since recent betas and especially 77.x branch it has got worth over time... with a culmination in 77.0b3 which has now became unusable... just remain in Not Responding mode with UI freeze for while at regular intervals...

I managed to gatter (with great difficulties and patience) the attached performance report which I hope may shade some light on the issue... and help sorting it out...

Expected results:

After starting the application, it should be usable within 20 seconds maximum...

As a workaround, while still running in Safe Mode, I have managed to regain access to emails by:

  • Putting TB in Offline Mode
  • Disabling all my CalDAV calendars
  • Putting TB in Online Mode
    ==> I can now access emails...

Issue re-occurs as soon as I turn ON one of my CalDAV calendar (as per Properties > Enable this Calendar)... it does not seems to matter what view is selected e.g:

  • Events in the next 7 days + Week view
    or
  • Event on current selected Day + Day view

Putting TB in Offline Mode (which requires to click on the icon once and wait a long time) then issue stops again. So definitely a problem with the Calendar feature of TB... especially when syncing online...

Last lines from the console:

(...)
Lightning: CalDAV: New webdav-sync Token: data:,2283020 CalDavRequestHandlers.jsm:832
Lightning: aChangeLogListener=null
calendarURI=https://*****/davical/caldav.php/richard/home/
iscached=false
this.mQueuedQueries.length=0 CalDavCalendar.jsm:1024
Lightning: CalDAV: No response status doing multiget for calendar Richard's Calendar CalDavRequestHandlers.jsm:798
Lightning: CalDAV: Get failed: CalDAV: Error: got status undefined fetching calendar data for Richard's Calendar, null CalDavCalendar.jsm:1068
Lightning: There has been an error reading data for calendar: Richard's Calendar. However, this error is believed to be minor, so the program will attempt to continue. Error code: 0x80004005. Description: CalDAV: Error: got status undefined fetching calendar data for Richard's Calendar, null CalCalendarManager.jsm:736
Lightning: There has been an error reading data for calendar: Richard's Calendar. However, this error is believed to be minor, so the program will attempt to continue. Error code: READ_FAILED. Description: CalCalendarManager.jsm:736
Lightning: CalDAV: New webdav-sync Token: data:,2283021 CalDavRequestHandlers.jsm:832
Lightning: aChangeLogListener=null
calendarURI=https://*****/davical/caldav.php/richard/home/
iscached=false
this.mQueuedQueries.length=0 CalDavCalendar.jsm:1024
Lightning: CalDAV: onStopRequest: no reader CalDavRequestHandlers.jsm:851
debuggee `resource://devtools/shared/base-loader.js:289' would run

Restarted TB with all CalDAV calendars disabled and notice that TB was still processing for 1-2mn without allowing me access to emails... while the UI much more responsive...

Attached is a perf recording of 30s while TB was processing if that can shade some more light...

I managed to workaround the issue for now by putting TB in offline mode shortly after it startup. And then put it back in online mode.
Then both email/calendar seems usable after a short while without the Not Responding issue occurring.

But really not an ideal situation... this issue also seem to cause some unexpected side effect Bug 1642580...

See Also: → 1582578

Issue still happening in 78.0b1 (32-bit), it has gone worth as now the CPU fan is running crescendo upon TB startup/not responding... and remain maximum running... very noisy... situation stopping immediately after putting TB in offline mode...

Shouldn't this be considered as a critical issue? Especially considering that 78.x branch is supposed to be the next stable version...

By now TB performance shall have seen improvements not worthen!!!
Clearly performance still does not seem a priority for the dev team especially the calendar team...

While there may be major rework going on under the hood and new features being implemented, it should go hand-in-hand with performance improvements... not to the detriment of the later... this has been raised as a concern months ago and appears still as not being tackled in regards of the results in TB following months of developments, sad :-(

Looking forward to major perf improvements in next betas prior stable version pops out!

(In reply to Richard Leger from comment #5)

Issue still happening in 78.0b1 (32-bit), it has gone worth as now the CPU fan is running crescendo upon TB startup/not responding... and remain maximum running... very noisy... situation stopping immediately after putting TB in offline mode...

Shouldn't this be considered as a critical issue? Especially considering that 78.x branch is supposed to be the next stable version...

By now TB performance shall have seen improvements not worthen!!!
Clearly performance still does not seem a priority for the dev team especially the calendar team...

While there may be major rework going on under the hood and new features being implemented, it should go hand-in-hand with performance improvements... not to the detriment of the later... this has been raised as a concern months ago and appears still as not being tackled in regards of the results in TB following months of developments, sad :-(

Looking forward to major perf improvements in next betas prior stable version pops out!

I don't think there is a need to be harsh toward the devs here. I am not seeing this CPU spike upon start. Although I am on 71.0b4 x64 (due to GData Provider 68.2.1 only working with my Google calendars), it has never take more than 10 seconds +/- to launch on my older system with SSD either at work or at home. I have tested briefly with all newer betas from 72 to 78 and do not see this issue. My main profile is 46.6GB and encompasses 20 years of email. If anyone is a candidate for possibly slow startup due to a huge profile, it would be me. But, I don't use CalDAV and the one time I tried, I couldn't get it to work right. How big is your profile? Have you though about moving to 64-bit TB or is your OS 32-bit?

It was said many eons ago that TB was not going to be getting as much support as Firefox. That being said, I have seen perf improve when I went to x64 a long time ago. All I am running is Lightning and GData Provider for Google calendars (68.2.1), just not CalDAV. I have seen others running CalDAV and having similar issues.

Fyi, TB running on Windows 10 1909 x64 bits i7 with SSD. With multiple network Caldav calendar setup but only one enabled my main calendar with 4000+ items.
Full calendar take 1-2sec to load entirely via http (.ics)
My user profile has not changed since few beta back where things were working "fine" (1-2mn startup time max an improvement compared to the 100+ minutes sometime ago...not that far!) situation deteriorated upon successive update of TB beta versions... to 77.x branch causing TB to become unusable from startup (nor responding) then to 78.x branch where situation worthen.
I can understand that someone using a different version of TB and setup, not encountering the issue, may think I may be too harsh, but everyone shall be able to understand that having TB becoming unresponsive at startup and unusable over auto-update especially to next stable candidate branch is extremly frustrating and shall be a major concern. I cannot believe I am the inly 10 year+ TB user (imap/smtp/caldav) out there... encountering perf issues...
This is a culmination of perf issues encountered and reported/worked on in the past two years by reporting bugs and analysing issues... some of them fixed, some pending... most of them due to Calendar system and CalDAV (in my case).
The reason I suspect the Calendar system is that when it occurs the calendar items are loading very slowly in UI... but a Gecko or other underlying issue may not be excluded either...

I will try your suggestion to uninstall 32bits and install 64bits see if that makes any difference... though it shall not in theory...

I'm using CalDAV with very large calendars, and I haven't noticed any problems.
There must be something in particular with your profile/data that is causing the problem.

FYI, find attached a perf report taken in safe mode while the issue occurs. Unless I am mistaking, it seems to clearly show that the culprit is linked to Calendar CalDav feature... especially the CalDavRequest.jsm (as seen in CallTree section of the report).

Here is a printscreen for convenience...

Attached image image.png

FYI, here are my calendar options...
So what should I change in my profile to make it work you think?

FYI, find attached a profile preferences diff between TB 78.0.b1 blank profile and my current profile...

Would you spot anything that could cause the "Not Responding" issue at startup?

All I have done is to update from one TB to another version within the beta channel... no changes were applied to profile between the moment it was working and the moment it was not... apart from updating/upgrading TB as prompted from within the app...

(In reply to Richard Leger from comment #12)

Created attachment 9157275 [details]
profile-prefs-diff-78.0b1-blank-vs-prod-anonymized.txt

FYI, find attached a profile preferences diff between TB 78.0.b1 blank profile and my current profile...

Would you spot anything that could cause the "Not Responding" issue at startup?

All I have done is to update from one TB to another version within the beta channel... no changes were applied to profile between the moment it was working and the moment it was not... apart from updating/upgrading TB as prompted from within the app...

I take it bumping up to TB x64 made zero difference?

(In reply to Arthur K. from comment #13)

(In reply to Richard Leger from comment #12)

I take it bumping up to TB x64 made zero difference?

After sending attachments as per above comments generated with TB 78.0b1 (32-bits) I have:

  • Uninstalled TB 78.0b1 (32-bits)
  • Rebooted computer
  • Renamed C:\Program Files (x86)\Mozilla Thunderbird into C:\Program Files (x86)\Mozilla Thunderbird.old, folder was still present after uninstall and not empty (contains an updated folder with files and sub-folders). I would have expected the TB (32-bits) uninstaller to remove entirely the x86 install folder.
  • Renamed C:\Program Files\Mozilla Thunderbird into C:\Program Files\Mozilla Thunderbird.old, folder still exists from previous install/uninstall of TB (64-bits) and still containing some .dll files. I would have expected the TB (64-bits) uninstaller previously used to remove entirely the x64 install folder.
  • Installed TB 78.0b1 (64-bits) from https://www-stage.thunderbird.net/en-US/thunderbird/beta/all/#E the version English US > Windows 64-BITS
    -- use standard installation and default options

==> Same "Not Responding" issue occurs at startup online (even in Safe Mode)

Now that Calendar feature has been integrated into core, is there a way to disable it to see if it makes any difference?
Before we could simply disable the add-on but now that the feature is integrated into core, it seems no longer possible to disable it.
Any clue?

(In reply to Richard Leger from comment #14)

Now that Calendar feature has been integrated into core, is there a way to disable it to see if it makes any difference?

The new way to do this is to disable each of your calendars. (Note that the calendar component won't be entirely disabled, but this should suffice in most cases for diagnostic purposes.)

Just going on what you have written and the timing, I suspect it may be related to the changes in bug 1546606.

FYI, see attached perf report recorded with TB 78.0b1 (64-bits).

Situation is about 10x worth with that version as UX is almost not responding at all... very long lag between interaction and reaction of UX... making it very difficult to go offline or take such perf report...

Looks like the same issue...

As per attached, CalDAV REPORT requests/response seems to take a very long time to be processed by TB... 4x requests handled in 6mn+

In the past, when last checked, it took about 48 requests for my calendar to be fully loaded... so at the above speed it is gonna take quite a while... 1H+ :-)

I don't think I can wait that long :-)

Issue seems only to occurs at startup. If I go offline, and back online after reading my first email, then the enabled calendar seems to work ok... sync, manage items, etc...

This issue also remind me bug 1515123 "devtools network response from a Lightning REPORT request loads very slowly (long delay) with UI freeze/not responding"

Could it be an issue with the response parsing again?

(In reply to Paul Morris [:pmorris] from comment #15)

(In reply to Richard Leger from comment #14)

Now that Calendar feature has been integrated into core, is there a way to disable it to see if it makes any difference?

The new way to do this is to disable each of your calendars. (Note that the calendar component won't be entirely disabled, but this should suffice in most cases for diagnostic purposes.)

When disabling all calendars, I can read my first email after 37s upon startup... still not great but faster than with one calendar enabled :-)
37s in itself seems also a bit slow to me I think before it was closer to 20-25s... so the underlying issue may also affect perf for email sync/loading/parsing possibly... or be a small cumulation of issues in different places perhaps?

Just going on what you have written and the timing, I suspect it may be related to the changes in bug 1546606.

Likely to be linked to handling of the request response...

I don't think I can provide more info, but if you need anything more to look into it, let me know.

I hope it would be fixable for the next beta...

Attached image 52requests.png

One more thing...

If I re-enable my calendar while TB is already opened, it takes about 10mn+ for it to load completely (4000+ items) and about 52 network requests...

See attached statistics as per dev tools > network print screen...

Via Task Manager > Wifi I can see huge gaps between each network requests (aka TB processing time)... that slightly increase over time...

I'm confused

  • the whiteboard has [not regression]
  • but comment 0 implies regression after updating 77.0b2 -> b3

Seems inconsistent. What am I missing? If this is a regression, then the whiteboard should be removed and a keyword regression added to this bug.

Also, if this issue is caldav specific please add it to the bug summary. And is bug 1546606 (shipped in 77.0b1) unrelated to the performance issue?

(In reply to Richard Leger from comment #18)

...
I hope it would be fixable for the next beta...

Just to set reasonable expectations - betas for the next couple months will be at a rate of 1-2 per week, so unless the both the exact cause and coding solution are obvious (which is more likely IF this is in fact a very recent regression), plus a developer has the time, I would not expect a fix so quickly.

See Also: → 1502923, 1572823, 1543953
Keywords: regression
Whiteboard: [not regression]
Summary: Thunderbird processing / "Not Responding" 10+ minutes at startup with 77.0b3 - Application Unusable - Major Regression → Calendar CalDAV - Thunderbird processing / "Not Responding" 10+ minutes at startup with 77.0b3 - Application Unusable - Major Regression

FYI, find attached the TB 78.0b1 error+debug appearing in Console when loading only my calendar, nothing else (via new profile with only one Caldav Calendar setup, no email account setup). I thought perhaps it may help as well... if related...

Also, so far, I have kept track of Calendar CalDAV statistics (e.g LoadingTime perf improvements/regressions) for various TB version (especially time it take to load one CalDAV Calendar with 4000+ items entirely - first time) in Bug 1572823, if that is of interest to anyone for CalDAV past and future comparison of performance improvements/regressions.

Also got errors below towards the end (after 10mn) as just noticed:

Also got DevTools > Console > errors attached towards the end of calendar loading (after 10mn) as just noticed...

FYI, issue still happening in TB 78.0b2, it seems to be linked to area of code as per attachment.

Possibly due to multiple call to async function imbricated into each other... it seems to look like one async function is run for each event item processed... but I could be mistaking... while in theory it should only run an async function for the REPORT request/response to/from the server and then just await for complete response and process the completed response in one go (no more async/await required from that point).

Currently the DevTools > Debugger does not allow to step into async functions nor to see what is happening during await period... I could not find a way to see the Async stack in TB DevTools > Debugger though such feature seems to have been introduced since Firefox 75 (according to Bug 1616322 Comment 5).

Basically at this level of code, the debugger stop and I have to click 25 to 30 times on the "Resume F8" (during the clicks the only UI change that occurs in DevTools > Debugger is an empty yellow area appearing below the "Resume F8" button) for actions to keep running afterwards...

Hope that may help you find the culprit... and fix it :-)

See Also: → 1546606

Richrad,

How did you get the number of events in your CalDAV calendar(s)? Do I remember right that you said you had 4000+ events or something like that? Maybe that's from another bug that I'm thinking about. I'm only asking because today, with a nudge from Walt, I switched from Provider for Google Calendar extension to CalDAV on TB 78.0b3 and it's working like a champ. I have two GMail CalDAVs set up but heck if I know how many events I have in total for each. Where does it show that number?

I know you've pretty much summed up what appears to be the code causing you grief but I'm trying to see if there's anything I can see on my end now that I made the switch myself.

Flags: needinfo?(richard.leger)

(In reply to Arthur K. from comment #25)

How did you get the number of events in your CalDAV calendar(s)?

10 years of using TB daily and its CalDav feature (Lightning previously, core now...), also synced with my phone via Davx and occasionally synced with Outlook via CalDavSynchronizer but only for testing purpose.

Do I remember right that you said you had 4000+ events or something like that?

Yes. I also have 4x additional CalDav calendars (other people, shared) that I keep disabled because otherwise TB keep Not Responding at startup and is unusable. Now I cannot even open my own calendar at startup ;-) My environment is the real life experience ;-)

Maybe that's from another bug that I'm thinking about. I'm only asking because today, with a nudge from Walt, I switched from Provider for Google Calendar extension to CalDAV on TB 78.0b3 and it's working like a champ. I have two GMail CalDAVs set up but heck if I know how many events I have in total for each. Where does it show that number?

I don't think using only GMail for testing is the right thing to do... as it may use different backend perhaps? My calendar is setup and distributed via a Davical server...

I think the number of items appears somewhere in the request/response... or I may have got the number by exporting my calendar to gmail to test Gmail calendar sync (GSync provider) in the past and it showed number of items there in status bar while loading...

I know you've pretty much summed up what appears to be the code causing you grief but I'm trying to see if there's anything I can see on my end now that I made the switch myself.

In all honesty, I am just highlighting what I see and experience at my end, without really knowing what is causing the issue. Hoping it can help the TB team identify issue. So if you have an idea please pursue it... feel free to let me know what your idea is and if there is anyway I can help...

It might also be linked to Bug 1647657 but too early to tell if the fix would have positive impact...

Flags: needinfo?(richard.leger)

(In reply to Richard Leger from comment #26)

It might also be linked to Bug 1647657 but too early to tell if the fix would have positive impact...

I just took a peek at that bug. One tiny case of a capitalized O might be all it was? Hope it can get tested and, if it fixes it, put into a 78.0b4. I can imagine there are other instances of these minor case-typos and we don't even realize it.

(In reply to Arthur K. from comment #27)

(In reply to Richard Leger from comment #26)

It might also be linked to Bug 1647657 but too early to tell if the fix would have positive impact...

I just took a peek at that bug. One tiny case of a capitalized O might be all it was? Hope it can get tested and, if it fixes it, put into a 78.0b4. I can imagine there are other instances of these minor case-typos and we don't even realize it.

As per Bug 1647657 Comment 7 it is confirmed that fix for Bug 1647657 does not seems to resolve the issue described here in this bug (Bug 1642292).
So back to square one I am afraid...

Possible Calendar delay workaround

A long time ago when this delay to start was first reported, it was suggested to me to set my display to "Events in the Next 14 Days".

I also have a large calendar.

(In reply to Craig Arno from comment #30)

A long time ago when this delay to start was first reported, it was suggested to me to set my display to "Events in the Next 14 Days".

Thanks for the tip, I can confirm that even with "Currently Selected Day" set in display options of unifinder, issue is still occurring.

That tip relate to Bug 1502923 which I have analysed at length for which some improvements were brought to light but issue never really sorted and pushed under the rug till better days by introducing workaround such as Bug 1502923 "Lazily load calendar unifinder tree" which help bring the loading down from 1H+ to 1mn+ at the time... as confirmed in Bug 1572823 Comment 19.

Since then though the situation has deteriorated to the point that since 77.x TB is Not Responding at startup and situation worthen with 78.x branch which triggered registration of this bug... with my CalDAV calendar taking 10mn+ to load... nothing changed at my end apart from auto-updating TB.

Analysis above seems to suggest this bug is not related to Bug 1502923 so far. But I can't seems to get access to the async stack to see what happen when async function is run.

As said previously, the performance report seems to suggest there are multiple imbricated call to async functions at the moment... which may not complete fast enough...

According to DevTool > Network, the network request appear and then seems stuck for a while (as seen in Devtool UI) with no access to the request details or response... till the next one is triggered after a long while... which get stuck as well... both request timing and Windows Task Manager shows that server responds in due time without much delay... the TB not responding seems appearing during the processing of each item in the response... items do appear one by one in the view but at a very slow pace like every 15-20s or more...

Attached image devtool-async-prefs.png

I found two devtools prefs related to async stack... see attached...

Would anyone be able to clarify what the difference between async-captured-stacks and async-live-stacks?

Would changing any of those options help analyse the problem deeper via DevTool?

If anyone from the TB team want remote access to my machine to look at issue first-end, it can also be arranged via remote session/screen sharing... if you cannot reproduce the issue at your end... and if that can help to move forward in sorting the issue... preferably prior the next ESR is out...

(In reply to Magnus Melin [:mkmelin] from comment #8)

I'm using CalDAV with very large calendars, and I haven't noticed any problems.
There must be something in particular with your profile/data that is causing the problem.

For the sake of comparison, I have re-run the same test on the same machine as per Bug 1572823 Comment 19 and my above comments/analysis, basically:

  • use 72.0a1 (2019-10-22) (64-bit)
  • create a new profile with no email account - keep default
  • add just one CalDAV network calendar (mine with my 4000+ items)
  • run the sync

Here is the result... it takes only 2mn+ in 72.0a1 to load the entire calendar... not great... but nothing compared to the 10mn+ in 78.x branch...

# TB Version Add-ons UnifinderView LoadingTime(Disappearance) DevTools > Networking
# 72.0a1 (2019-10-22) (64-bit) Lightning Caldav Libical (NoReminder/NoOfflineSupport) Events in the Next 7 days, NoHeaderColumnSorting ~2m18s (33s) 49 Requests - 9.25MB Transferred - Finish: 1mn68s

This is just to confirm once more that issue is:

  • Not Profile related (use a complete blank new profile for testing)
  • Not Data related (use the same data - my CalDAV calendar)
  • Not Environment related (use the same environment - my laptop)

Just to wind the clock back a bit here…

(In reply to Richard Leger from comment #0)

Upgraded to TB 77.0b3 from 77.0b2...

There were no changes to the calendar code between these two versions. This is what did change on comm-beta and on mozilla-beta.

(In reply to Geoff Lankow (:darktrojan) from comment #34)

Just to wind the clock back a bit here…

(In reply to Richard Leger from comment #0)

Upgraded to TB 77.0b3 from 77.0b2...

There were no changes to the calendar code between these two versions. This is what did change on comm-beta and on mozilla-beta.

Then it means the issue may not necessarily be linked to that particular upgrade... but it is still occurring... maybe the upgrade cleared some cache or else perhaps? Also I did upgraded to TB 78.x branch afterwards hoping issue would have been fixed but it did not...

Find attached comparison of network performance as seen in Task Manager between two different version of TB, each time run sync in a new profile with not email account setup...

As noted on the graph, the issue start to raise after the 4th request which may be the first one to actually load data... the first three just preparing the job... and gap between request increases over time... :-)

(In reply to Richard Leger from comment #35)

(In reply to Geoff Lankow (:darktrojan) from comment #34)

Just to wind the clock back a bit here…

(In reply to Richard Leger from comment #0)

Upgraded to TB 77.0b3 from 77.0b2...

There were no changes to the calendar code between these two versions. This is what did change on comm-beta and on mozilla-beta.

Then it means the issue may not necessarily be linked to that particular upgrade...

Or could it be due to an issue within a library used by the calendar system?
Gecko/javascript core/async feature - request/response object/class or parsing tool perhaps? Just guessing here...

Attached image image.png

I run a Mozregression Bisection for Thunderbird - 64 - comm-central - new profile - reuse for period between 2020-01-01 and 2020-06-25

Find attached the results... and link below...

https://hg.mozilla.org/comm-central/pushloghtml?fromchange=a5a26b1ee9ecaa232a01eea49a1cba758a05778b&tochange=38aefe44ce12cec9e7e7bbf6f0df23e46516e065

I got a warning at the end saying it did not have enough data to bisect or something like that...

According to the results, issue started between:

app_name: thunderbird
build_date: 2020-04-07
build_file: C:*masked*.mozilla\mozregression\persist\2020-04-07--comm-central--thunderbird-77.0a1.en-US.win64.zip
build_type: nightly
build_url: https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-07-08-46-18-comm-central/thunderbird-77.0a1.en-US.win64.zip
changeset: 177fcd5c42fc5e692b1334da174d82fd509bef12
pushlog_url: https://hg.mozilla.org/comm-central/pushloghtml?fromchange=177fcd5c42fc5e692b1334da174d82fd509bef12&tochange=6d53255df005ef1a1f849b7b328b19155956d42e
repo_name: comm-central
repo_url: https://hg.mozilla.org/comm-central

...and...

app_name: thunderbird
build_date: 2020-04-08
build_file: C:*masked*.mozilla\mozregression\persist\2020-04-08--comm-central--thunderbird-77.0a1.en-US.win64.zip
build_type: nightly
build_url: https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-08-10-51-27-comm-central/thunderbird-77.0a1.en-US.win64.zip
changeset: 6d53255df005ef1a1f849b7b328b19155956d42e
pushlog_url: https://hg.mozilla.org/comm-central/pushloghtml?fromchange=f15b0f3b36af47ec42943e90ad83f1bc7eb5ad48&tochange=6d53255df005ef1a1f849b7b328b19155956d42e
repo_name: comm-central
repo_url: https://hg.mozilla.org/comm-central

(In reply to Richard Leger from comment #38)

app_name: thunderbird
build_date: 2020-04-08
build_file: C:*masked*.mozilla\mozregression\persist\2020-04-08--comm-central--thunderbird-77.0a1.en-US.win64.zip
build_type: nightly
build_url: https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-08-10-51-27-comm-central/thunderbird-77.0a1.en-US.win64.zip
changeset: 6d53255df005ef1a1f849b7b328b19155956d42e
pushlog_url: https://hg.mozilla.org/comm-central/pushloghtml?fromchange=f15b0f3b36af47ec42943e90ad83f1bc7eb5ad48&tochange=6d53255df005ef1a1f849b7b328b19155956d42e

CORRECTION:

Just tested with build 2020-04-08 (77.0a1 64-bit) which was to be starting point according to bisection but it works ok... so not sure if the bisection test and result are what was to be expected... or I may have done something wrong somewhere... unless somehow TB updated automatically during bisection and I was not testing the version I thought I was as per UI info? Will continue to try to narrow down which build the issue starts to appear...

CORRECTION 2:

TB 64 builts:

2020-04-07: loads ok ~2mn+

2020-04-08: hit and miss - when tested via bisection it mostly does not work, issue occurs - when unzipping and running TB from https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-08-10-51-27-comm-central/thunderbird-77.0a1.en-US.win64.zip it mostly work ok but not always!

2020-04-09: issue occurs

So it may be around that time-frame possibly that issue started and got worth on after...

(In reply to Richard Leger from comment #40)

CORRECTION 2:

TB 64 builts:

2020-04-07: loads ok ~2mn+

2020-04-08: hit and miss - when tested via bisection it mostly does not work, issue occurs - when unzipping and running TB from https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-08-10-51-27-comm-central/thunderbird-77.0a1.en-US.win64.zip it mostly work ok but not always!

2020-04-09: issue occurs

So it may be around that time-frame possibly that issue started and got worth on after...

Are you able to use Mozgression and plug in the Build ID from the good version (2020-04-07) and then the Build ID for the semi-working version (2020-04-08) and see what it spits out in terms changes? Forgive me if you've already been there and done that.

Let me ask the obvious,
Is it still true that this slow/stuck behavior only occurs at startup/initialization? Like is there a dependency which isn't being met during initialization? After the 10-30 minute startup delay does everything run well? Is there "init" code which -assumes- a fully functioning run-time environment which may not be true, like login and connection is all established and functioning?

Bug 1546606 landed for the 2020-04-09 daily.

Should Bill G and Frank-Rainer Grahl get looped in?

Here is the perf profile for 2020-04-07, load in 2mn+ ==> ok

Here is the perf profile for 2020-04-08, load in 2mn+ ==> ~ok more or less if I use build version not bisect version (but via Task Manager start to see gap appearing between request...)

The perf profile looks different premise of possible issue...

Here is the perf profile for 2020-04-09, load in 10mn+ ==> ko - Not responding issue occurs definitely...

The perf profile look the same as reported in Comment 10 related to other most recent version of TB.

So by the look of it, build 2020-04-09 seems the starting point...

Find attached comparison of network performance as seen in Task Manager between 77.0a1 64-bit:

  • build 2020-04-07
  • build 2020-04-08
  • build 2020-04-09

As noted on the graphs, there seems to be a premise of an issue 2020-04-08 (not obvious but I can notice a difference) with definitely issue occurring and obvious huge gap appearing between requests in 2020-04-09... the gap increasing over time each time...

==> 2020-04-09 definitively the culprit!

app_name: thunderbird
build_date: 2020-04-09
build_file: C:\Users\richard.mozilla\mozregression\persist\2020-04-09--comm-central--thunderbird-77.0a1.en-US.win64.zip
build_type: nightly
build_url: https://archive.mozilla.org/pub/thunderbird/nightly/2020/04/2020-04-09-20-42-33-comm-central/thunderbird-77.0a1.en-US.win64.zip
changeset: 27f756d7fac7f2f2bb6055db00f45bafe0d776b9
pushlog_url: https://hg.mozilla.org/comm-central/pushloghtml?fromchange=177fcd5c42fc5e692b1334da174d82fd509bef12&tochange=27f756d7fac7f2f2bb6055db00f45bafe0d776b9
repo_name: comm-central
repo_url: https://hg.mozilla.org/comm-central

Let's see if Paul has ideas.

Severity: major → S1
Regressed by: 1546606
See Also: 1546606

FYI, I suppose in all logic that those below would be the change (based on changeset) between...

I cannot seems to be able to run a bisection between 07<->08 and 08<->09 to show changes (by build_date or changeset)... I suspect you would need more than two build to run a bisection...

Nor I know where to find buildId...

(In reply to Craig Arno from comment #42)

Let me ask the obvious,
Is it still true that this slow/stuck behaviour only occurs at startup/initialization? Like is there a dependency which isn't being met during initialization? After the 10-30 minute startup delay does everything run well? Is there "init" code which -assumes- a fully functioning run-time environment which may not be true, like login and connection is all established and functioning?

I believe there are two different issues currently affecting TB 78.x branch startup performance... which combined make of a awful experience...

  • Slow general startup of TB
    With all calendars disabled and IMAP/SMTP account setup (no cache), TB 78.0b3 (64-bit) is currently not starting up at the speed of light:

    • 2 sec to get prompt for choosing profile
    • 20 sec after selecting profile to start TB and get the main UI up
    • 35 sec to be able to read first IMAP email (no cache set) - as soon as main UI is visible I can click on message in the list but content is not immediately loaded in preview pane. It takes this extra 35s for message content to load...
    • = 57 sec waiting in total from Start to Read 1st email!
  • Slow loading of CalDAV network calendar
    With one CalDAV network calendar setup, no email, TB 78.0b3 (64-bit) started with calendar disabled, enabling calendar to get prompt for authentication, authenticating:

    • 10mn+ to load one calendar (without thousands of items) from authentication to completion
      This is the time to load just one CalDAV network calendar, at this stage TB is already started and fully initialised...

At Startup
Now if you startup TB online with calendar enabled, all the issues are triggered at the same time aggravating the situation to the point TB becomes unusable because almost constantly Not Responding... I haven't waited 30mn but I am sure it would come up and become available normally after a long period of time... I just don't have the patience :-)

Workaround
What I do as a workaround is to:

  • start TB offline (the option appears in the choose user profile wizard)
  • enable online mode afterwards,
  • disable offline mode
  • re-enable it.
    I do it twice because the first time it automatically triggers the calendar loading which prevent access to email... which it does not the second time...
  • then if I need the calendar, I right click on it and "Synchronise Calendar" then wait 10mn+ for it to load :-)
    Alternatively I could disable the calendar but I tend to forget to disable prior closing TB... so first way seems best...

After that TB works like a charm (with view slowdown here and there sometime to time)...
That is the only way for me to be able to use TB 78.0b3 (64-bit) at the moment :-)

Also changing network connection from a Broadband ADSL to 4G mobile data client side, does not make any difference in terms of test results. In case some are wondering...

Richard, thanks for the perf profiles. I've started looking at the one in comment 47 and can use that to tell where in the code the time is being spent. So far I don't see any one big thing that's an obvious drag on performance, it just looks like there are a lot of items to process that involve a lot of steps that each take time. Which is to say that improvement may be a matter of incremental gains. But that's just a first take, we'll see. Will post back here with progress as I explore further.

I'm using profiler.firefox.com to analyze the data. Would you mind if I publish it there and share a link to make it easier for others to see? (You've already uploaded it here so it's already available in that form, but I thought I'd ask.)

For future reference, for anyone reading this, if you want to share a performance profile, feel free to go ahead and upload it to profiler.firefox.com, publish it there, and then post the link in a bug or wherever. That will save a few steps, process-wise.

When comparing the profiles from comment 47 and comment 45 they seem similar in terms of where in the code the time is being spent. I noticed the profile from comment 47 only covers about 50 seconds (and the one from comment 45 only about 35 seconds). Would it be possible to capture one that shows the problem (ideally with current TB beta) that covers the full 10 minutes? It will be really large (maybe too large?) but that should show whether things get bogged down or delayed at some point in there.

If smaller profile size is wanted, for example to help cover a larger time period, the sampling interval might be increased (with resulting decreased resolution obviously). I don't see where to do that in Firefox developer tools, but it had been possible in the profiler extension.

For perf report, I am limited by the buffer size is there a way I can increase its size?

Also limited by the fact TB become not responsive very quickly preventing to start stop performance recording later in time during the process but it would show more or less the same thing I believe but I can try...

In past dev team asked for small size perf profile that is why I record only about 1mn or till buffer is full it overright autmatically after 100% reached... I dont think I can record a 10mn perf profile even if I could it would be way too big in size... unless the profiler.firefox.com allow upload of any size profile... Personally I think recording perf between two network requests (or few of them) shall be sufficient to find the culprit...

I don't think there are any personal info in Devtools > Performance profile so feel free to publish on profiler.firefox.com and share/provide link to it back here or as you see fit to help resolve the issue ;-)

Part of it seems linked to imbricated async functions thst are called in turn during processing of items without really showing what it does as Devtool > Debugger dont seems to be able to step in or capture the async stack... it become a black box... could also be indication of something going wrong within Gecko, Javascript engine, build-in APIs or Devtools debugger itself... on the top of it!

(In reply to Richard Leger from comment #56)

For perf report, I am limited by the buffer size is there a way I can increase its size?

Maybe try the settings mentioned at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Firefox_Startup - but without MOZ_PROFILER_STARTUP=1

Find here attached a perf profile corresponding to the period of time between the 4th minute and 5th minute while calendar is loading... which shows the same issue as recorded before...

I am not able to record a full 10mn performance profile...

Find here attached a perf profile corresponding to the period of time between the 7th minute and 10th minute while calendar is loading... captured with 78.0b3 (64-bit). Showing the same issue as recorded before...

I am not convinced that recording 10mn of data would make much of a difference... I can't seems to be able to do it anyway...

Here is my attempt to share the result of the 7-10mn profile perf via profiler.firefox.com
https://share.firefox.dev/2YAYaeB

Also attached a print screen of the Call Tree for convenience...

It seems to me the resource:///modules/caldav/CalDavRequest.jsm get function call on itself in some sort of "endless" loop... with async functions running within it...

Here attached is what you see when you invert call stack...

Good find Richard! The profile you shared on profiler.firefox.com works for me. I see that get loop you are seeing. Looking at the code, that loop is what would happen when bug 1647657 is not yet fixed.

It looks like that patch has not made it to our beta releases yet. Would you be able to try with a nightly build of Thunderbird and see how that goes? You can download one from https://www.thunderbird.net by scrolling down to where it talks about beta and clicking the "nightly channel" button.

That patch will be in the next 78 beta (beta 4) so that's another option.

(In reply to Paul Morris [:pmorris] from comment #62)

Good find Richard! The profile you shared on profiler.firefox.com works for me. I see that get loop you are seeing. Looking at the code, that loop is what would happen when bug 1647657 is not yet fixed.

As per my test and my Bug 1647657 Comment 7, I can already confirm issue described here is not fixed by this patch.

But I can try again in b4 when out... to re-confirm it!

;-)

(In reply to Richard Leger from comment #51)

  • Slow general startup of TB
    With all calendars disabled and IMAP/SMTP account setup (no cache), TB 78.0b3 (64-bit) is currently not starting up at the speed of light:

    • 2 sec to get prompt for choosing profile
    • 20 sec after selecting profile to start TB and get the main UI up
    • 35 sec to be able to read first IMAP email (no cache set) - as soon as main UI is visible I can click on message in the list but content is not immediately loaded in preview pane. It takes this extra 35s for message content to load...
    • = 57 sec waiting in total from Start to Read 1st email!

A separated dedicated new Bug 1649103 has been created to look after the general slow performance of TB at startup (Comment 51) separately from the CalDAV issue that occurs on the top of it dealt with here...

I will shortly load a startup perf report there...
If someone is willing to help on this separate issue please do so in the new parallel bug created...

Hi Arthur,

FYI below following previous exchange...

(In reply to Richard Leger from comment #26)

(In reply to Arthur K. from comment #25)

... I have two GMail CalDAVs set up but heck if I know how many events I have in total for each. Where does it show that number?

I don't think using only GMail for testing is the right thing to do... as it may use different backend perhaps? My calendar is setup and distributed via a Davical server...

I think the number of items appears somewhere in the request/response... or I may have got the number by exporting my calendar to gmail to test Gmail calendar sync (GSync provider) in the past and it showed number of items there in status bar while loading...l

As of today I have ~4481 events in my calendar :-)

I now recall how I know exactly how much items I have in my CalDAV calendar.

I use Cyberduck app (https://cyberduck.io/) to connect to the CalDAV calendar to list each event in the format of individual .ics files (it takes ~14 seconds) and in status bar it shows how many items are listed.

This app is also handy for troubleshooting purpose to load/open individual event in a text editor and check data or edit/save it manually in ical format, if need be...

Just sharing in case it may be useful to someone else...

Cheers,
Richard

Hi Paul,

(In reply to Richard Leger from comment #64)

(In reply to Paul Morris [:pmorris] from comment #62)

Good find Richard! The profile you shared on profiler.firefox.com works for me. I see that get loop you are seeing. Looking at the code, that loop is what would happen when bug 1647657 is not yet fixed.

As per my test and my Bug 1647657 Comment 7, I can already confirm issue described here is not fixed by this patch.

But I can try again in b4 when out... to re-confirm it!

FYI, just upgraded to 78.0b4 (64-bit) and same issue occurs, I'm afraid!
Any suggestions?

Regards,

Flags: needinfo?(paul)

Do your calendars have "Offline Support" ticked?

(In reply to Magnus Melin [:mkmelin] from comment #68)

Do your calendars have "Offline Support" ticked?

No, that option is unticked for me as always. I don't really like cache mode as it is unreliable (though haven't really tested lately)... in any case I should be able to work online normally and only be limited by the network... not by TB... IMAP/SMTP works very well without cache mode...

(In reply to Richard Leger from comment #69)

(In reply to Magnus Melin [:mkmelin] from comment #68)

Do your calendars have "Offline Support" ticked?

No, that option is unticked for me as always. I don't really like cache mode as it is unreliable

Is there a bug for it?

IMAP/SMTP works very well without cache mode...

Not necessarily a useful or fair comparison

FYI, I don't know if that may help or not...

Putting a break point at line 526 (first line in the get function), the aProp variable seems to get very different values during iteration... which not all seems to correspond to an event... is it normal?

CalDavRequest.jsm
(...)
524  this._listener = new Proxy(this.request._handler, {
525    get(aTarget, aProp, aReceiver) {
526      if (aProp == "onStartRequest") {
(...)

Here is a few example of possible values of aProp over iterations...

'QueryInterface'
'doMultiGet'
'calendar' (appears multiple times within iteration)
'itemsNeedFetching' (appears multiple times within iteration)
'_reader'
'onDataAvailable'
...

It seems for each of those which are not onStartRequest or onStopRequest, the following line is triggered...

(...)
549    return Reflect.get(...arguments);
(...)

which itself triggers...

(...)
112    async commit() {
(....)
127        this.channel.asyncOpen(this.response.listener, this.channel);

129        **await** this.response.responded;
(...)

One of those two last functions triggers again the get function at line 525 somehow... so it keeps calling to "itself" in some sort of imbricated async loop... possibly over each calendar item processed...

Maybe all this is normal but I just thought to mention it... as it may be what is causing the slow down...
Though I cannot step-in at some point as said previously so there is part of the code on which the DevTools > Debugger pause on without showing calling stack or info... and require to click multiple time in a row on the Resume F8 to keep going... also at some point it start Not Responding for a while and then it comes back but cannot figure out exactly at which point...

(In reply to Wayne Mery (:wsmwk) from comment #70)

(In reply to Richard Leger from comment #69)

(In reply to Magnus Melin [:mkmelin] from comment #68)

Do your calendars have "Offline Support" ticked?

No, that option is unticked for me as always. I don't really like cache mode as it is unreliable

Is there a bug for it?

It is on my list but not yet, because I am more interested in working online with TB... which was working more or less fine till this bug occurs...
One step at a time... not need to add more bugs when previous ones have not been fixed!

Also not everyone is willing or wanting to work with Offline Support ON. For example a user on a local LAN network connected to the server don't need cache mode.

The cache mode shall be used for allowing people to access and manage calendar when offline but not to compensate performance issues in TB... especially those when using CalDav feature online (Offline Support unticked).

IMAP/SMTP works very well without cache mode...

Not necessarily a useful or fair comparison

Well end-user would expect the CalDav feature of TB to work as well... smoothly and with performance... what ever the settings of end-user choice, as provided by developers themselves :-)

That should be the aim... and getting closer to that at every iteration...

Not saying we shouldn't fix whatever the new problem is with online-only calendars.

However, not using the cache is just... what? You expect Thunderbird to process multi-megabytes of calendar data super fast. The network lag alone is going to give you bad UX. An email is usually in < 100KB, so then it doesn't matter. If you need to load many large calendars maybe 1000 larger than the mail, and display the data fast: that is cache is is for.

(In reply to Magnus Melin [:mkmelin] from comment #73)

You expect Thunderbird to process multi-megabytes of calendar data super fast.
The network lag alone is going to give you bad UX.
An email is usually in < 100KB, so then it doesn't matter. If you need to load many large calendars maybe 1000 larger than the mail, and display the data fast: that is cache is is for.

Please note that people loading their calendar for the first time will be affected by this bug!

Yes cache mode (Offline Support) allow to work offline...
Yes cache mode is helping to prevent too much network access and compensate network lags...
... but let me be clear ones more here, it is is not the problem at stake nor the fix for this bug...

So let's just disregard cache mode for now shall we?

I cannot help but feel that I am misunderstood (and other end-users encountering issues out there - not just me).

I will try to draw the picture differently, lets concentrate about speed of loading the same CalDav calendar (4000+ items) with Offline Support disabled in a new blank profile in various version of TB, that something you may be able to relate to:

TB Version Loading Time DevTools > Networking
60.8.0 (32-bit) ESR ~100mn+ (with sorting) 48 requests - 7.98MB transferred
60.8.0 (32-bit) ESR ~6mn23s (without sorting) idem
70.0b2 (64-bit) ~1mn02s 49 Requests - 8.25MB Transferred
78.0b1 (64-bit) ~10mn30s 50 Requests - 9.25MB Transferred

Am I really the only one seeing those results?

As you can see clearly (I hope), the fastest measured ~1mn02s was obtained so far with TB 70.0b2 with the same calendar over an ADSL line.
(they might have been slightly better than this but cannot recall stats... but you get the picture!)

This result was accomplished thanks to work done in Bug 1502923 and few follow up bugs... but that had been quite a fight at the time and lot time spent to get the result! Since then I have kept statistics on regular bases in one of the follow up Bug 1572823.

The only difference between tests above is the TB version, the rest is the same.
Similar results obtained when TB running on the same LAN as the server or a 4G network! So network is not the issue here :-)

Clearly this should be sufficient evidence that great performance results can be obtained with TB CalDav feature, without using cache mode (which would just be the cherry on the cake!).
If the dev team only test with Offline Support enabled (the default) they might not see issues during the development cycle linked to online mode! They also need to test the online mode on large calendars... and not advertise cache mode as the fix for performance issue... which is not! Just hiding dust under the carpet to me :-)

To go even further, reach a target of below 1mn which is likely achievable, you would then need to work on the number of network requests (see dedicated Bug 1572823) which seems way too numerous, I believe that network request could be reduced to 4 if the batch were requested in 1000 items instead of 100 currently...

My entire .ics calendar is ~2MB in size and takes 1-2 sec to load entirely via http! Yes CalDav XML protocol create a lot of overhead so size of data transferred is increased by about x3 time... but that is not an issue for now... this can be dealt with later perhaps... network and size of data transfer is not the issue here.

In summary, here are the few major issues currently affecting TB CalDav in online mode as I see it:

  • Gap between pics of network request/response /\_/\__/\___ which correspond to the time TB takes to process the response received and slowly load items in the various views (UX) before sending next request, delay often increasing over time. At best the network traffic should look like this /\/\/\/\ meaning no processing delay by TB (no gap) between two network request/response (one pic = one request+response). Attachment 9159427 [details] published in Comment 48 shall give you a rough idea of what I mean it clearly show deterioration... also more network graph attachment available in Bug 1502923 and Bug 1572823 for various TB version.

  • The number of network requests sent by TB to load the calendar - 50 requests really?! This is way too much to my opinion... still to be dealt with in separate dedicated bug Bug 1543953.
    There may be additional way of improving request/response by not preloading all the data at first perhaps but only the essentials... especially when working online... I only need to load body and attachment when I click on item to edit it/view it... but that is another story entirely.

  • The dev team not seeing such issues (for some unknown reason) or not testing for it perhaps, to allow them to fix and prevent regression over iteration... as said previously each iteration cycle shall bring us closer to better performance with CalDav... at least between ESR version... the fact the such regression slipped between the finger seems to suggest that the testing suite of CalDav feature during development may need to be reviewed somehow...

  • Performance deteriorating fast as number of enabled CalDAV calendar increases...

  • UX performance itself to load seemlessly items in the various views...

Only the first point in that list is to be dealt with here.
I am not sure how I can explain better or help better than the above.

Don't get me wrong I love TB and the team is doing a great job just trying to raise concern about a performance culprit observed and experienced for way too long (prior the new governance) with hope it may help to fix long term and before next ESR :-)

Regards,

(In reply to Richard Leger from comment #67)

FYI, just upgraded to 78.0b4 (64-bit) and same issue occurs, I'm afraid! Any suggestions?

Thanks for confirming that this did not fix the issue. What you describe in comment 71 is what I would expect. (That code sets up a proxy object, a wrapper that intercepts all property accesses on a wrapped object and then does something different with onStartRequest or onStopRequest but passes the other accesses through to the object it's wrapping.)

I haven't had time to look into this further this week, but rest assured that we do want to improve the slower performance you're seeing. We are just spread too thin to get to everything as quickly as we'd like.

If you have a chance to upload and share a new profile (using the latest beta) to the profiler.firefox.com site, that would be helpful (now that the fix for that one issue is in).

Also, I think you should be able to debug async functions/code in the debugger, at least it works for me. Try using the "step in" button rather than the "step over" button, and that should let you step into async function calls (e.g. where there is an "await"). That's worth a try if you haven't tried that yet, and are still motivated.

Flags: needinfo?(paul)

(In reply to Paul Morris [:pmorris] from comment #75)

(In reply to Richard Leger from comment #67)

I haven't had time to look into this further this week, but rest assured that we do want to improve the slower performance you're seeing.

Thank you for acknowledging and looking into it... it keeps me motivated :-)

We are just spread too thin to get to everything as quickly as we'd like.

Understandable... but this situation is not new nor it is gonna change anytime soon... while performance issues have been going on for years...
So at what point do you think the performance issue(s) would to be tackled, do you reckon? Any time frame in mind? Postponed advitam eternam because of no time?
Would post 78 ESR be a good time to start tackling performance issues? In a follow up update(s) perhaps or prior the following ESR at the latest?

For the time being, would you think it may be feasible to land a fix for this particular bug in 78.1 or 78.2 update prior everyone auto-update within the ESR channel?

Long term, do you have plan to automatically measure and test Calendar feature performance as part of your test suite during each development iteration cycle to:

  • quickly identify performance regression before they even land in beta
  • improve performance over time by measuring progress

Could Telemetry be of help perhaps to measure type of calendar in use, number of items, loading time as a basic start line, then progressively expand depending on options: cache enabled/disabled, filtering options, sorting options, etc...? Maybe starting with cache mode on/off option first... I am saying this because in Bug 1502923 (2 years old - closed 11 months ago) showed that sorting (which is not activated by default) had a tremendous impact on performance (causing calendar to load in 100mn+ instead of 10mn-) at the time... and this is only the tip of the iceberg as you may already know... so it may be essential to automate performance testing... Then you can also measure performance when 1,2,3.... more large calendar are setup and enabled at the same time... to identify scaling perf issues (long term)...

Also DevTools > Networking already is able to record quite some info about network pic (request/response) and possibly delay/gap between them (TB overall processing time)...

Just some ideas... maybe you are already working on it... but just thought to suggest it...

This way you may be able to get automatically data to evaluate performance issues and bottlenecks without wasting too much time on it?
My manual testing of performance here is indeed taking tremendous amount of (my) time, but it is necessary to raise awareness in the hope it would be fixed at some point not only this bug but the overall performance of Calendar feature as a whole...
Automatic measurement and testing may be the key to save time while tackling the perf issues in Calendar feature... or at least integrate them in the current allocated time dev team may have during development iteration(s)...

I was hoping that while integrating Lightning into Core and as part of refactoring, this would have started to happen slowly in parallel to get in a better position now for the ESR and beyond but it hasn't, it seems...

In addition you may be able to get more hands on by considering setting up a Thunderbird Summer of Code program perhaps (as part of the Google Summer Code program or on its own) to bring people on board for a short period of time to work on specific (very) long lasting issues to have them properly analysed to identify possible solutions. Could that help further in any way?

If you have a chance to upload and share a new profile (using the latest beta) to the profiler.firefox.com site, that would be helpful (now that the fix for that one issue is in).

My bad I thought I had done so, I may have forgotten to paste the link in the comment at the time...

Attachment 9160700 [details] as per Comment 67 has been uploaded on profiler.firefox.com, here is the link:
https://share.firefox.dev/2NWyEKK

Also, I think you should be able to debug async functions/code in the debugger, at least it works for me. Try using the "step in" button rather than the "step over" button, and that should let you step into async function calls (e.g. where there is an "await"). That's worth a try if you haven't tried that yet, and are still motivated.

Believe me I do "step in" but at some point (which I haven't been able to identify) it gets "stuck" and the debugger view does not change as you run or step-in for about ~x30 times then it comes back... and goes again... at some point it also start to Not Responding but I cannot find why nor at what point (line of code), cannot get to it... believe me I tried and I would have shared info if I could... that is why I am seeking your help here :-)

As I said before, if someone want to do a private remote session with me to look at issue first hand, it can be arranged if that can help in any way.

Thank you for your support and making TB better everyday!

Regards,
Richard

Hi,

(In reply to Richard Leger from comment #76)

Attachment 9160700 [details] as per Comment 67 has been uploaded on profiler.firefox.com, here is the link:
https://share.firefox.dev/2NWyEKK

Any finding or progress following perf profile published about the issue?

Regards,
Richard

Flags: needinfo?(paul)

(In reply to Richard Leger from comment #77)

Any finding or progress following perf profile published about the issue?

Sorry, no, I have been focusing on other priorities and have also been away from work a bit, so I haven't yet looked at this again. It is in my todo items. Thanks for the details and profile link in your previous comment.

Flags: needinfo?(paul)
See Also: → 1658032
OS: Windows 10 → All
See Also: → 1658026
See Also: → 1659558
See Also: 1582578
See Also: → 1660672

FYI, this issue is affecting 78.x and 80.0b5 basically still takes 10mn to load my 4000 items network calendar calDAV.

My suspicion would also be issues surrounding the resource:///modules/caldav/CalDavRequestHandlers.jsm > characters (aValue) {...} function - parsing of the XML response...

From my recent testing and troubleshooting I have noticed the following during the async caldav multiget requests:

  • TB is going into a "Not Responding" state during the parsing by TB of a network response, in DevTools > Network I can see:
    -- The request starts (the response from server arrive quickly as seen in Task Manager)
    -- TB become "Not Responding" for a while
    -- TB resume normally the status code 207 appears in front of the request
    -- The next request starts and the same issue occurs again

The "Not Responding" delay seems to become more and more prominent over time and really start to freeze TB at about the 20th requests (out of ~50 in total) then each request takes ages!

One may think that it is a network/server issue, but it is not because:

  • The response time as indicated in DevTools is very short (also confirmed via Task Manager > Performance > Wifi graph) - and does not match the "Not Responding" delay
  • The server logs also confirm the response is send back in due time - no delay server side...

I also noticed putting a breakpoint on the characters function that:

  • Delay occurs at some point between two pauses on that breakpoint
  • For one event item the function characters (aValue) {...} can be triggered 10,20 or more time as the aValue does not correspond to the overall item calendar caldata but only chunck of it e.g:

aValue in characters function can have the following values:

 "\n" at iteration 1
 "\n" at iteration 2, 
 "BEGIN:VCALENDAR" at iteration 3
 "\n" at iteration 4
 "PRODID:-//Mozilla.org/NONSGML Mozilla Calendar V1.1//EN" at iteration 5
 "\n" at iteration 6 
 "VERSION:2.0" at iteration 7 
 "\n" at iteration 8 ,
 etc... 

For an event like:

BEGIN:VCALENDAR
PRODID:-//Mozilla.org/NONSGML Mozilla Calendar V1.1//EN
VERSION:2.0
(...)

After a high number of iteration over the item calendar data, the varialbe this.currentResponse[this.tag] ends up with a value like:
this.currentResponse["caldata"] = "\n\nBEGIN:VCALENDAR\nPRODID:-//Mozilla.org/NONSGML Mozilla Calendar V1.1//EN\nVERSION:2.0\n etc..."

An this is just for one items... it does it again for each calendar item... with sometime the value of aValue being just "<" or ">"... surely there must be a better way to parse calendar data from an event item in one go...

Why the calendar item data is parsed in chunks and not all at once when grasping it from the XML response?

My suspicion may be issue related to the SAX xml parser currently in use which seems deprecated and non longer usable in Firefox (according to: https://developer.mozilla.org/en-US/docs/Archive/SAX)

I tend to convert email to Event or Task a lot causing item main body content to contain html (aka xml) data, so maybe could that code choke the SAX xml parser somehow at some point... unless it may be a character encoding issue... or too much iteration over something... or the javascipt engine not liking processing from the SAX parser?

Due to the async processing of such request, it make it impossible to step in the code to identify where exactly the slow down not responsiveness starts to occur... which could also indicate a Core issue...

Alternatively it could be that the parsed object is then passed over to an observer to refresh data in a view and that the view choke on the data?

Also noted, each iteration to the *characters" function trigger calls each time it is run to the this.calendar.verboseLogging() function which run a comparison prior returning a value... that seems to me as an expensive and UN-necessary repetitive computation.
Why not to use just this.calendar.mVerboseLogging variable instead which only need to be initialised once upon calendar object creation or even better don't use the character function to log the XMLlog data, either just get it from the response itself direction which is already in html format or copy the data store in this.currentResponse after computation has completed for the event item. Don't think there is any need to concatenate twice the same value into two location via the characters function... which should solely be used to grap the caldata the event item in an iCal format...

Looking at those may also be a good start to fix perf issue with CalDAV.

In comparison, the same calendar takes about 1mn10s (x10 time less) to load in TB 68.11.0 (32-bit) on the same computer with the same network connection :-)

FYI, additional perf profiles and statistics have been published in Bug 1572823 Comment 26
There seems to be quite some improvements in TB 81.0b2 (64-bit) as delay seems cut in half now down to 7mn compare to 78.x :-)
Possibly due to other fix/optimisation in core and elsewhere...

(In reply to Richard Leger from Bug 1572823 Comment 26)

Loading one network calendar calDAV with 4000+ items...

# PerfProfile TB Version LoadingTime(Disappearance) DevTools > Networking Add-ons UnifinderView
08 DevTools Not Responding when stopping recording of a perf profile making it impossible to save the profile to view/save it for sharing 78.2.1 (64-bit) ~15m15s(8s) 53 requests - 9.60MB Transferred - Finish: 14.97mn TB Core Caldav (NoReminder/NoOfflineSupport) Next 7 Days, NoHeaderColumnSorting
09 https://share.firefox.dev/2Z4Iv6Y (first 4mn) 80.0b5 (64-bit) Not Available Not Available TB Core Caldav (NoReminder/NoOfflineSupport) Next 7 Days, NoHeaderColumnSorting
10 https://share.firefox.dev/2GyoEqw (first 2mn) 81.0b2 (64-bit) ~7m (3s) 52 requests - 9.60MB Transferred - Finish: 6.98mn TB Core Caldav (NoReminder/NoOfflineSupport) Next 7 Days, NoHeaderColumnSorting

As per perf profile and attachment, the culprit seems remaining in the following part of the code or surrounding it (e.g character function - see Comment 79 for more info):

CalDavRequest.jsm Line 535 get function as per TB 81.0b2 (64-bit)

(...)
/**
 * Response class for legacy requests. Contains a listener that proxies the
 * external request handler object (e.g. CalDavMultigetSyncHandler,
 * CalDavWebDavSyncHandler, CalDavEtagsHandler) in order to resolve or reject
 * the promises for the response's "responded" and "completed" status.
 */
class LegacySAXResponse extends CalDavResponseBase {
  /** @return {nsIStreamListener} The listener passed to the channel's asyncOpen */
  get listener() {
    if (!this._listener) {
      let legacySAXResponse = this;

      legacySAXResponse._listener = new Proxy(legacySAXResponse.request._handler, {
        get(aRequestHandler, aProp) {       // Line 535 in TB 81.0b2 (64-bit)
          if (aProp == "onStartRequest") {
            return function(...args) {
              try {
                let result = aRequestHandler.onStartRequest.apply(this, args);
                legacySAXResponse._onresponded();
                return result;
(...)

(In reply to Richard Leger from comment #81)

FYI, additional perf profiles and statistics have been published in Bug 1572823 Comment 26
There seems to be quite some improvements in TB 81.0b2 (64-bit) as delay seems cut in half now down to 7mn compare to 78.x :-)
Possibly due to other fix/optimisation in core and elsewhere...

I too noticed startup being a lot faster in 81.0b2 on my old work PC. Subjectively, using my timer app on Android, it went from a warm start of around 35s to about 29/30s (when I stop seeing activity in the status bar).

Richard, if you know where to get it, 81.0b3 might also be worth profile testing to see if it's any better as well.

(In reply to Arthur K. from comment #82)

(In reply to Richard Leger from comment #81)

FYI, additional perf profiles and statistics have been published in Bug 1572823 Comment 26
There seems to be quite some improvements in TB 81.0b2 (64-bit) as delay seems cut in half now down to 7mn compare to 78.x :-)
Possibly due to other fix/optimisation in core and elsewhere...

I too noticed startup being a lot faster in 81.0b2 on my old work PC. Subjectively, using my timer app on Android, it went from a warm start of around 35s to about 29/30s (when I stop seeing activity in the status bar).

Richard, if you know where to get it, 81.0b3 might also be worth profile testing to see if it's any better as well.

Nothing between 81.0b1 and 81.0b2 that would have improved performance IIRC from https://bugzilla.mozilla.org/buglist.cgi?bug_id=1661160%2C1660687%2C1602966%2C1660403%2C1658890%2C1661203%2C1658293%2C1660917%2C1641720%2C1658438&list_id=15401730

Different story for 80.0b5 to 80.0b1 https://hg.mozilla.org/releases/comm-beta/pushloghtml?fromchange=THUNDERBIRD_80_0b5_RELEASE&tochange=THUNDERBIRD_81_0b1_RELEASE&full=1

(In reply to Wayne Mery (:wsmwk) from comment #83)

(In reply to Arthur K. from comment #82)

(In reply to Richard Leger from comment #81)

FYI, additional perf profiles and statistics have been published in Bug 1572823 Comment 26
There seems to be quite some improvements in TB 81.0b2 (64-bit) as delay seems cut in half now down to 7mn compare to 78.x :-)
Possibly due to other fix/optimisation in core and elsewhere...

I too noticed startup being a lot faster in 81.0b2 on my old work PC. Subjectively, using my timer app on Android, it went from a warm start of around 35s to about 29/30s (when I stop seeing activity in the status bar).

Richard, if you know where to get it, 81.0b3 might also be worth profile testing to see if it's any better as well.

Nothing between 81.0b1 and 81.0b2 that would have improved performance IIRC from https://bugzilla.mozilla.org/buglist.cgi?bug_id=1661160%2C1660687%2C1602966%2C1660403%2C1658890%2C1661203%2C1658293%2C1660917%2C1641720%2C1658438&list_id=15401730

Different story for 80.0b5 to 80.0b1 https://hg.mozilla.org/releases/comm-beta/pushloghtml?fromchange=THUNDERBIRD_80_0b5_RELEASE&tochange=THUNDERBIRD_81_0b1_RELEASE&full=1

I am on beta channel so upgraded from 80.0b5 to 81.0b2... I can try again on b3 when available via TB update on the channel...

Depends on: 1659558

(In reply to Richard Leger from comment #84)

I am on beta channel so upgraded from 80.0b5 to 81.0b2... I can try again on b3 when available via TB update on the channel...

Yes, in probably a couple more days. Thanks Richard.

Something seems really wrong with TB 81.0b3 (64-bit), I simply cannot load my CalDAV calendar it seems to choke on first few network GET requests... only 3 requests handled in 4mn+ out of the usual 50+ requests required! See attached...

Here attached are the corresponding TB 80.0b3 console errors appearing... if that can be of help...

And here is the TB 80.0b3performance profile... first few minutes... to load one calDAV calendar (which in effect is not loading... only 3 requests processed?)

https://share.firefox.dev/33einrO

For info, the same calendar is loading fine in TB 60.8.0 ESR on the same computer, with the same network connection... so I doubt issue is linked to my environment...

Depends on: 1658026
See Also: 1658026
Flags: wanted-calendar1.0?
Flags: blocking-calendar1.0?

FYI,

(In reply to Richard Leger from Bug 1543953 comment 19)

Created Bug 1543953 attachment 9176520 [details]
tb.81.0b4.devtools.networking.caldav.one.calendar.4000items.png

(In reply to Richard Leger from comment #7)

I think it would be more efficient to do the following at TB startup:

  1. Get the full calendar as .ics file via a GET request instead and parse it
    This take only 2seconds with my 4000 items calendar

Am I dreaming or I have been heard? What a surprise today! Yes, Yes, Yes... finally!

Just upgraded to TB 81.0b4 (64-bit) and testing loading my ~4000 items caldav network calendar... while measuring and monitoring I could see only one network request and I first thought... damn calendar loading is broken again?!

...but checking via DevTools > Networking...
...what a surprise!!!

The 50 multi-get request that have been hindering calendar loading performance for so long have now been replaced by one single GET request that load the entire .ics calendar at once!!! Oh my god! What a speed difference max 22 seconds top... (result obtained with new profile, no email, one caldav network calendar newly setup with ~4000 items)

  • Was it intended or an unintended side effect of something else implemented, corrected or removed?
  • Is that gonna be the new way?
  • Would that perf improvement "stick" from now on?
    Can someone confirm?

No info in the release note:
https://www.thunderbird.net/en-US/thunderbird/81.0beta/releasenotes/?uri=/thunderbird/releasenotes/&locale=en-US&version=81.0&channel=beta&os=WINNT&buildid=20200914232954

Nor could I find any changelog in the source code that may explain such change:
https://hg.mozilla.org/releases/comm-beta/log/a8ea452b9f4cc059a4afebaf5d54af4e72e4b5c4
If someone could point out perhaps the changeset related to this change?

If that is intended that is great, great, great news to all TB calendar users out there!
Thank you...

I keep dreaming for now... hoping someone would confirm this reality!

I have yet to test difference on startup performance with my current default profile!

FYI, in addition of previous comment...

(In reply to Richard Leger from Bug 1543953 comment #20)

(In reply to Richard Leger from Bug 1543953 comment #19)

I have yet to test difference on startup performance with my current default profile!

I just re-tested and bad news, my existing already set calendar in my default profile (that I also use for emails) does not behave in the same manner. When I enabled it and run the sync, instead of one GET request as described in my previous comment, it still generate multi-GET requests to sync and therefore still affected by Bug 1642292.

This suggest that to get the new behaviour and related performance the calendar needs either to be deleted/re-created from scratch or that settings in my current already set calendars may need some adjustments to benefit from new behaviour... I would expect TB to take care of that upon upgrade :-)

Any clue?

(In reply to Richard Leger from comment #90)

Just upgraded to TB 81.0b4 (64-bit) and testing loading my ~4000 items caldav network calendar... while measuring and monitoring I could see only one network request and I first thought... damn calendar loading is broken again?!
The 50 multi-get request that have been hindering calendar loading performance for so long have now been replaced by one single GET request that load the entire .ics calendar at once!!! Oh my god! What a speed difference max 22 seconds top... (result obtained with new profile, no email, one caldav network calendar newly setup with ~4000 items)

False hope :-(
Explanation Bug 1543953 Comment 21
Back to square one...

Would the TB dev team have any update (and timetable expected) on the resolution of this bug?

Still happening in TB 83.0b1 (64-bit), same issue...
See Call Tree picture attached and related perf profile at https://share.firefox.dev/3jJMvBA

(In reply to Richard Leger from comment #93)

...
Would the TB dev team have any update (and timetable expected) on the resolution of this bug?

This regression awaits a new calendar developer, unless Geoff is prepared to take it.

There is perhaps also a question of which should be done first, this or Bug 1544375 - Lightning - Add network calendar UI options: Chunk size, Initial sync time range filters (sync timespan future and past)

Flags: needinfo?(geoff)
You need to log in before you can comment on or make changes to this bug.