Closed Bug 880289 Opened 11 years ago Closed 9 years ago

slow 20-30 sec to respond, and blank new message alerts after wake from hibernation only when application (TB/FF) was hibernated in foreground in V.22.0b1. We are spinning in NtUserWaitMessage

Categories

(MailNews Core :: Networking, defect)

x86_64
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: crueby1, Unassigned)

References

Details

(Keywords: perf, regression, testcase-wanted, Whiteboard: [regression:tb22.0a1][lacks testcase])

User Agent: Mozilla/5.0 (Windows NT 5.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 (Beta/Release)
Build ID: 20130511120803

Steps to reproduce:

System up and running, including Thunderbird. Hibernated system, later come back and wake it. 


Actual results:

Sometimes after waking, thunderbird is very very slow - seems to take 20 or 30 seconds to respond to any mouse clicks or actions (scrolling, picking message, buttons, anything). If I shut down/restart thunderbird it is then fine. This only started in the latest beta release - v.22.0b1. Before that update it was fine. It does not do this every time, seems to be most in cases where new mail came in and is ready to download from the pop server during the time it was hibernated. Not sure if that is a trigger, but it seems to make it happen more. The popup to notify of new mail from the system tray pops up just as a blank frame. It can be dismissed if I click where the x would be, but the main application is still running slow after that.
There was a simaler bug I had reported - 746345 - where just the pulldown menus were slow. That one effected both thunderbird and firefox menus. This time, everything in thunderbird is slow. Have not seen any problem in Firefox.


Expected results:

should have run normally
Another occurance - this time did not have any new unread mail, same hang as described above.
When you resume from hibernate, is there much disk IO occurring from other applications?

Does behavior improve if you have set Thunderbird to be offline before hibernating?  I.e. File | Offline | Work Offline?
Flags: needinfo?(crueby1)
Keywords: perf
Summary: hangs/slow after wake from hibernation in V.22.0b1 → slow after wake from hibernation in V.22.0b1
not much disk io at all. Only apps running are thunderbird, Firefox, and explorer (disk, not internet). AVG antivirus may be active at startup.

One other thing I have noticed recently: While tbird shows this problem about 20-25% of the time, every once in a while Firefox (v22.0) is doing the same thing (about 5% of the time).

Also, since I filed this bug tbird has gone through a couple beta updates, now says it is v23.0. Same issues still happening.

I will try the offline setting and report back.
Flags: needinfo?(crueby1)
Well, that was a quick test - put tbird in offline mode, hibernated, woke it up, and it showed the issue first time. That ought to narrow things down some.
Are running "windows server" on desktop or on server?
Does this reproduce with Thunderbird being the only app running?
Version: 22 → 23
Am running XP Pro 64, 2003, SP2, on an Intel I7 processor, desktop.

It does happen with no other apps running (just verified it).
Same issue still happening with Beta 24.0 version.
Version: 23 → 24
Chris, so you had also run version 21 beta?

Not necessary to change the version #. We assume the problem continues to all future versions unless otherwise stated. I set a bad example in comment 5 - not sure why I changed the version #.
Whiteboard: [regression:tb22.0b1]
Correct - first saw it in 21, and same in the updates since then.
Correction - it was TB V.22.0b1 that I first reported it in.
Does it also happen with 
22.0a2 - ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2013/04/2013-04-26-00-40-01-comm-aurora/thunderbird-22.0a2.en-US.win32.installer.exe
22.0a1 - ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2013/03/2013-03-03-03-05-05-comm-central/thunderbird-22.0a1.en-US.win32.installer.exe

backup your thunderbird profile before testing
Summary: slow after wake from hibernation in V.22.0b1 → slow after wake from hibernation with networking enabled in V.22.0b1
Just tested both of those - problem occurred the same in both on first try.
thansk for checking.

so you will want to check thunderbird-22.0a1 builds of ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2013/ of the comm-central/ variety, in a binary search till you hit a one day interval where problem happens with builddate and not builddate-1.  

ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2013/04/2013-04-01-03-08-20-comm-central/thunderbird-22.0a1.en-US.win32.installer.exe might be a decent starting point
That is going wrong direction, isn't it? If it happens with the 3/3 version, why would I go newer with the 4/1 version?

I'll try some of the earlier ones to find where it went wrong.
Okay, looks like we have a timeframe - builds from 2/1 through 2/21 all worked fine, multiple attempts. Anything from 2/22 on shows the issue. I tried the ones several days either side of the 2/22 boundary, all before are fine, all after show the problem.

Hope this helps you find it.
Thanks for the great work on finding the range. 

approximating the regression to http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2013-02-21+03%3A00&enddate=2013-02-22+05%3A00 ... nothing in that list stands out for me (and Bug 842008 is only checkin for comm-central). perhaps irving or mconley can hazard a guess.

also, no standout matches in bugs reported as regressions since 2013-02-21 - https://bugzilla.mozilla.org/buglist.cgi?keywords=regression%2C%20&keywords_type=allwords&f1=OP&o3=anywordssubstr&list_id=7688076&resolution=---&classification=Client%20Software&classification=Components&o2=anywordssubstr&f4=CP&chfieldto=Now&chfield=[Bug%20creation]&query_format=advanced&j1=OR&chfieldfrom=2013-02-21&f3=keywords&f2=short_desc&product=MailNews%20Core&product=Thunderbird
Flags: needinfo?(mconley)
Flags: needinfo?(irving)
Summary: slow after wake from hibernation with networking enabled in V.22.0b1 → slow after wake from hibernation with networking enabled in V.22.0b1. 20-30 to respond, and blank new message elerts
Whiteboard: [regression:tb22.0b1] → [regression:tb22.0a1]
Version: 24 → 22
Chris, for completeness I just realized it will help for you to check development tip to ensure it hasn't been fixed there (now version 26.0a1). You can find it at  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-central/
Worth the Hail Mary try, but still there...
(In reply to Chris Rueby from comment #19)
> Worth the Hail Mary try, but still there [when using TB26] 

This is a little odd given your regression range is eerily close to the regression range of a firefox issue with similar symptoms and the following history ...

FF21 2013-02-14 checkin of bug 822490 - Radically simplify windows TimeStamp implementation

caused regressions in FF21 like Bug 873110 - When OS Windows 7 wakes up after sleep Firefox doesn't respond to actions 

fixed in FF23 by Bug 836869 - Make QueryPerformanceCounter bad leap detection heuristic smarter
Well, I've mentioned a number of times in this bug that I've seen the same thing in Firefox, just no where near as often. Currently running 23.0.1 FF. Maybe that fix was not quite smart enough, still some issues with it?
(In reply to Wayne Mery (:wsmwk) from comment #16)
> Thanks for the great work on finding the range. 
> 
> approximating the regression to
> http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2013-02-
> 21+03%3A00&enddate=2013-02-22+05%3A00 ... nothing in that list stands out
> for me (and Bug 842008 is only checkin for comm-central). perhaps irving or
> mconley can hazard a guess.
> 
> also, no standout matches in bugs reported as regressions since 2013-02-21 -
> https://bugzilla.mozilla.org/buglist.
> cgi?keywords=regression%2C%20&keywords_type=allwords&f1=OP&o3=anywordssubstr&
> list_id=7688076&resolution=---
> &classification=Client%20Software&classification=Components&o2=anywordssubstr
> &f4=CP&chfieldto=Now&chfield=[Bug%20creation]&query_format=advanced&j1=OR&chf
> ieldfrom=2013-02-
> 21&f3=keywords&f2=short_desc&product=MailNews%20Core&product=Thunderbird

I'm afraid nothing for me stands out in that list. A manual bisection is probably your best bet.
Flags: needinfo?(mconley)
Chris, can you run Thunderbird from a command shell with the output redirected to a file, so we can capture log messages?

Please set the environment variable NSPR_LOG_MODULES=all:5 before you run Thunderbird. This will create a huge spew of tracing messages, including quite a bit of private information (names of email accounts, contents of messages, etc) so you probably won't want to attach it to the bug. You can go through and remove private information if you want, or if you're comfortable with this, contact me directly (bugzilla will show you my email address) to get a copy of the log output to me.

See https://wiki.mozilla.org/MailNews:Logging for more information about NSPR_LOG_MODULES. Note that I'd rather you didn't use NSPR_LOG_FILE, so that the captured log contains both the NSPR log messages *and* the other messages that TB prints directly to the terminal.
Flags: needinfo?(irving)
(In reply to :Irving Reid from comment #23)
> Chris, can you run Thunderbird from a command shell with the output
> redirected to a file, so we can capture log messages?
> 
> Please set the environment variable NSPR_LOG_MODULES=all:5 before you run
> Thunderbird. This will create a huge spew of tracing messages, including
> quite a bit of private information (names of email accounts, contents of
> messages, etc) so you probably won't want to attach it to the bug. You can
> go through and remove private information if you want, or if you're
> comfortable with this, contact me directly (bugzilla will show you my email
> address) to get a copy of the log output to me.
> 
> See https://wiki.mozilla.org/MailNews:Logging for more information about
> NSPR_LOG_MODULES. Note that I'd rather you didn't use NSPR_LOG_FILE, so that
> the captured log contains both the NSPR log messages *and* the other
> messages that TB prints directly to the terminal.
Hi Irving,

Okay, got 3 log files for you - am sending them in seperate email to you. In all
three it showed the problem (was hoping to get one that did not, but it seemed to
be doing it every time). In each case, I started up tbird, clicked on your email, waited
a bit, then hibernated, waited a bit, then woke up the system. Waited a bit, then
tried clicking on messages, buttons, and it was running really slow, some responses
taking a minute or more.

One note - I had to use the NSPR_LOG_FILE option to get anything at all to log. Without 
it, nothing at all come back to the command window (when running with no redirect) or
to the log file (when with redirecting). 

Anything else you need, just let me know.
I missed one detail - can you run the test again with NSPR_LOG_MODULES=all:5,timestamp to get timestamps on the messages, so we can see where the delay is?

Another approach would be to try the Gecko Profiler. You can get the latest pre-release version of the Profiler add-on from https://github.com/bgirard/Gecko-Profiler-Addon/raw/master/geckoprofiler.xpi (it is much better than the one at addons.mozilla.org), go to Tools/Add-ons, click the gear icon in the upper right and select "install add-on from file" to load the .xpi file into Thunderbird.

Once it's installed, click on "disabled" at the bottom right of the TB main window; it should change to say "enabled". Do your suspend and resume test, and then click "dump profile" (again, at the bottom right of the main window).

Feel free to look through the profile yourself; if you want one of us to look at it, click on the "share" button at the lower right of the Cleopatra window - this uploads a copy of the profile to a Mozilla server and generates a unique URL you can send us.
(In reply to :Irving Reid from comment #25)
> I missed one detail - can you run the test again with
> NSPR_LOG_MODULES=all:5,timestamp to get timestamps on the messages, so we
> can see where the delay is?
> 
> Another approach would be to try the Gecko Profiler. You can get the latest
> pre-release version of the Profiler add-on from
> https://github.com/bgirard/Gecko-Profiler-Addon/raw/master/geckoprofiler.xpi
> (it is much better than the one at addons.mozilla.org), go to Tools/Add-ons,
> click the gear icon in the upper right and select "install add-on from file"
> to load the .xpi file into Thunderbird.
> 
> Once it's installed, click on "disabled" at the bottom right of the TB main
> window; it should change to say "enabled". Do your suspend and resume test,
> and then click "dump profile" (again, at the bottom right of the main
> window).
> 
> Feel free to look through the profile yourself; if you want one of us to
> look at it, click on the "share" button at the lower right of the Cleopatra
> window - this uploads a copy of the profile to a Mozilla server and
> generates a unique URL you can send us.

Just sent you another logfile, this time with the timestamps. After the wakeup,
was clicking on message lines to display emails, and noticed errors at times
like 12:08:10.859000 in the log - may be a good lead?
The lines around 12:08:10 and later that contain "os error = 10035" are OK; those come from us using non-blocking network I/O.

Looking at the timestamped log, nothing jumps out at me. I'm inferring from the log that your computer was asleep from 12:03:29 until 12:05:02, since that's the only wide gap in timestamps. From then on there are several gaps in the timestamps, but no indication of any odd activity.

Do you have extensions installed? From the logs it looks like you're using the adaptive junk mail filter; could you try (a) restarting with add-ons disabled (from the Help menu) and (b) turn off the adaptive junk filter (it's in the Account Settings for each account), and see if the problem still occurs.

Aside from that, I think our best next step is to catch this in the Gecko Profiler.
Honza, we're trying to sort out a resume-from-sleep bug in Thunderbird, and it sounds a lot like the problems you fixed in bug 836869.

We've captured detailed logs of the suspend/resume (not attached to the bug due to user-private content). The only messages from the Timestamp module are right at the beginning of the log:

2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: HasStableTSC=1
2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: QPC frequency=2660000000
2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: timeIncrement=156250 [100ns]

There are no TimeStamp messages after the resume.

Any suggestions on how to investigate further to determine whether TimeStamp is involved in this problem?

Wayne, comment 4 indicates that the problem *does* occur in offline mode so I've edited the subject.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(honzab.moz)
Summary: slow after wake from hibernation with networking enabled in V.22.0b1. 20-30 to respond, and blank new message elerts → slow to respond, and blank new message alerts after wake from hibernation in V.22.0b1. 20-30
(In reply to :Irving Reid from comment #27)
> The lines around 12:08:10 and later that contain "os error = 10035" are OK;
> those come from us using non-blocking network I/O.
> 
> Looking at the timestamped log, nothing jumps out at me. I'm inferring from
> the log that your computer was asleep from 12:03:29 until 12:05:02, since
> that's the only wide gap in timestamps. From then on there are several gaps
> in the timestamps, but no indication of any odd activity.
> 
> Do you have extensions installed? From the logs it looks like you're using
> the adaptive junk mail filter; could you try (a) restarting with add-ons
> disabled (from the Help menu) and (b) turn off the adaptive junk filter
> (it's in the Account Settings for each account), and see if the problem
> still occurs.
> 
> Aside from that, I think our best next step is to catch this in the Gecko
> Profiler.

busy here at the moment - should be able to set up the gecko profiler and get
you that data by Monday.
(In reply to :Irving Reid from comment #27)
> The lines around 12:08:10 and later that contain "os error = 10035" are OK;
> those come from us using non-blocking network I/O.
> 
> Looking at the timestamped log, nothing jumps out at me. I'm inferring from
> the log that your computer was asleep from 12:03:29 until 12:05:02, since
> that's the only wide gap in timestamps. From then on there are several gaps
> in the timestamps, but no indication of any odd activity.
> 
> Do you have extensions installed? From the logs it looks like you're using
> the adaptive junk mail filter; could you try (a) restarting with add-ons
> disabled (from the Help menu) and (b) turn off the adaptive junk filter
> (it's in the Account Settings for each account), and see if the problem
> still occurs.
> 
> Aside from that, I think our best next step is to catch this in the Gecko
> Profiler.

Tried it with both the adaptive junk filter and addons off, still saw the 
problem. Looking into setting up the profiler now....
(In reply to :Irving Reid from comment #25)
> I missed one detail - can you run the test again with
> NSPR_LOG_MODULES=all:5,timestamp to get timestamps on the messages, so we
> can see where the delay is?
> 
> Another approach would be to try the Gecko Profiler. You can get the latest
> pre-release version of the Profiler add-on from
> https://github.com/bgirard/Gecko-Profiler-Addon/raw/master/geckoprofiler.xpi
> (it is much better than the one at addons.mozilla.org), go to Tools/Add-ons,
> click the gear icon in the upper right and select "install add-on from file"
> to load the .xpi file into Thunderbird.
> 
> Once it's installed, click on "disabled" at the bottom right of the TB main
> window; it should change to say "enabled". Do your suspend and resume test,
> and then click "dump profile" (again, at the bottom right of the main
> window).
> 
> Feel free to look through the profile yourself; if you want one of us to
> look at it, click on the "share" button at the lower right of the Cleopatra
> window - this uploads a copy of the profile to a Mozilla server and
> generates a unique URL you can send us.

Ran the profiler and uploaded the profile - took a long time to do that since
the GUI was running so slow due to the problem. Here is the url:
http://people.mozilla.org/~bgirard/cleopatra/?1379855359230#report=cc8a717fe801efb7b4a19c89ee1ad61d498b4c21
(In reply to :Irving Reid from comment #28)
> Honza, we're trying to sort out a resume-from-sleep bug in Thunderbird, and
> it sounds a lot like the problems you fixed in bug 836869.
> 
> We've captured detailed logs of the suspend/resume (not attached to the bug
> due to user-private content). The only messages from the Timestamp module
> are right at the beginning of the log:
> 
> 2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: HasStableTSC=1
> 2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: QPC
> frequency=2660000000
> 2013-09-19 12:03:05.593000 UTC - 0[210f140]: TimeStamp: timeIncrement=156250
> [100ns]
> 
> There are no TimeStamp messages after the resume.
> 
> Any suggestions on how to investigate further to determine whether TimeStamp
> is involved in this problem?
> 
> Wayne, comment 4 indicates that the problem *does* occur in offline mode so
> I've edited the subject.

HasStableTSC=1 means the timestamp is using directly the QPC API.  If there is a problem with it, then it's in QPC it self.

I'll think of whether hi values of the frequency could overflow easily the 64-bit storage and cause trouble that way.  It might be the cause.  The frequency is in hex 9E8C6100, so up to a 32 bit number.  However, it wraps every cca 220 years.
Flags: needinfo?(honzab.moz)
Chris, thanks for the profile. It shows that we're spending over 99% of the time in NtUserWaitMessage(). I'm not familiar with the details of our Windows event queue handling, so I'm not sure what to look at next.
Is this happening with the application based on platform 23 ?
(In reply to Honza Bambas (:mayhemer) from comment #34)
> Is this happening with the application based on platform 23 ?

I am seeing it with both TB and FF (happens much more often in TB, only
about 1 in 10 times with FF). Currently am running version 24 Tb, and
version 23.0.1 FF. Since first reporting this have updated several times,
seen it in all since then. Earlier comments list out the regression tests
I did that identified when it started.

Hope that answers your question - am assuming the platform number 
relates to the version major number?
Possible new clue - went a week or two without any occurance of
the problem, then had it again. One thing that seems to matter, is
whether the TB/FF windows are showing at all when going to hibernation,
or are completely obscured by other applications. If they are
not showing, it seems to prevent the problem, and if on top/showing, it
happens. Some interaction with waking/refreshing the windows maybe? I'll
keep track of what apps are on top when hibernating and see if that is
a definite pattern, or just a coincidence, and post again....
Interesting - seems to be holding true that problem only shows if
the TB window is visible (not hidden by another applications window)
when the computer is hibernated.

(In reply to Chris Rueby from comment #36)
> Possible new clue - went a week or two without any occurance of
> the problem, then had it again. One thing that seems to matter, is
> whether the TB/FF windows are showing at all when going to hibernation,
> or are completely obscured by other applications. If they are
> not showing, it seems to prevent the problem, and if on top/showing, it
> happens. Some interaction with waking/refreshing the windows maybe? I'll
> keep track of what apps are on top when hibernating and see if that is
> a definite pattern, or just a coincidence, and post again....
Anyone ever find a possible cause for this issue? It must be something common between firefox and thunderbird, since it shows up in BOTH products now - has gotten so bad that I have gone back to an old version of Firefox, since it was hanging completely on almost every wake from hibernation....
(In reply to Chris Rueby from comment #38)
> Anyone ever find a possible cause for this issue? It must be something
> common between firefox and thunderbird, since it shows up in BOTH products
> now - has gotten so bad that I have gone back to an old version of Firefox,
> since it was hanging completely on almost every wake from hibernation....

How v23 works?  See bug 836869.
As mentioned back in comment 3, V23 of tbird showed same issue.
As for V23 of firefox, have gone back to that version since it shows the issue less
often than the latest, but it is still there. Having the tbird and/or ff windows completely
behind other applications when hibernating seems to make it happen less often - if the
app is on top when hibernating, it is pretty much guaranteeded to hang up on the restart.
Chris, I don't have new info. But to get a new baseline, could you install 38.0b1 from beta link at http://www.mozilla.org/en-US/thunderbird/channel/
Is the problem gone?
Flags: needinfo?(crueby1)
Summary: slow to respond, and blank new message alerts after wake from hibernation in V.22.0b1. 20-30 → slow 20-30 sec to respond, and blank new message alerts after wake from hibernation only when application (TB/FF) was hibernated in foreground in V.22.0b1. We are spinning in NtUserWaitMessage
Wayne - I no longer have the PC from when this was first put in back in 2013, so cannot test it anymore. The newer machine is running win7, dont see the issue with this one. Problem never went away on old win-XP machine till its retirement late last year.
Flags: needinfo?(crueby1)
Well, it was a valiant effort Chris.

So to summarize, on XP Pro 64bit first appears TB22.0a1 2013-02-22, and TB23.0a1 after bug 836869 was fixed 2013-05-01, and still occurred in TB26.  

I'll leave it open for a bit more, but we lack a "working" testcase.  Other two hibernate bugs we have open are bug 896072 and bug 835766
OS: Windows Server 2003 → Windows XP
Whiteboard: [regression:tb22.0a1] → [regression:tb22.0a1][lacks testcase]
These have tended to be networking related, so moving there
Component: Untriaged → Networking
Product: Thunderbird → MailNews Core
Whiteboard: [regression:tb22.0a1][lacks testcase] → [closeme 2015-09-01][regression:tb22.0a1][lacks testcase]
closing based on comment 43


(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #43)
> Other two hibernate bugs we have open are bug 896072 and bug 835766

and bug 598313.  http://mzl.la/1QCGT7r
Status: NEW → RESOLVED
Closed: 9 years ago
Keywords: testcase-wanted
Resolution: --- → INCOMPLETE
See Also: → 836869
Whiteboard: [closeme 2015-09-01][regression:tb22.0a1][lacks testcase] → [regression:tb22.0a1][lacks testcase]
You need to log in before you can comment on or make changes to this bug.