Closed Bug 1197152 Opened 9 years ago Closed 9 years ago

Timer rescheduling can incorrectly offset timers created after wakeup but before wake notification fires

Categories

(Core :: XPCOM, defect, P1)

40 Branch
x86_64
All
defect

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox40 --- wontfix
firefox41 + verified
firefox42 + verified
firefox43 --- verified
firefox-esr38 --- wontfix
thunderbird_esr38 41+ fixed

People

(Reporter: mozbugs, Assigned: jesup)

References

Details

(Keywords: regression)

Attachments

(13 files, 2 obsolete files)

1.07 KB, text/plain
Details
1.67 KB, text/plain
Details
308.13 KB, text/plain
Details
1.00 MB, text/plain
Details
871.07 KB, text/plain
Details
91.35 KB, text/plain
Details
3.16 KB, text/plain
Details
3.84 KB, text/plain
Details
3.79 KB, patch
Details | Diff | Splinter Review
3.24 KB, text/plain
Details
3.17 KB, text/plain
Details
3.13 KB, text/plain
Details
3.63 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:41.0) Gecko/20100101 Firefox/41.0
Build ID: 20150817163452

Steps to reproduce:

- start Firefox with a new profile
- ensure the Firefox is the active application!!
- put Mac to sleep (Apple-menu -> Sleep)
- wait for 10-40 seconds to ensure the Mac is in sleep mode
- bring back Mac from sleep (press shift key)
- click + to open a new tab several times (2-5 times), because nothing happens


Actual results:

Nothing happens for up to 10-15 seconds, after that the new tabs are opened


Expected results:

A new tab should have opened immediately
Trunk regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=03e78fff1844&tochange=f7e1f596d57d

Aurora regression range:
https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=95ff4ba30e93&tochange=1f4185f18537

Regressed between Firefox 40.0b3 and 40.0b4
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=ec483a9f4dbf&tochange=664b98d33a1b

Bug suspected of causing this regression, the bug is in all the 3 ranges:

Randell Jesup — Bug 1178890: Update timer arrays after sleep to account for time sleeping r=bwc,froydnj
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
Keywords: regression
I can reproduce the problem on:
- 13" Macbook Pro Retina with MacOS 10.10.5
- 13" Macbook Pro (2010) with MacOS 10.10.5
Can confirm this behavior.

FF 40.0.2, Yosemite 10.10.5, Retina MBP

If computer goes to sleep long enough to require my password to be entered, upon logging back in Firefox becomes unresponsive for a significant period of time. Reverted to 39.0.3 and the problem disappears.
Blocks: 1178890
Assignee: nobody → rjesup
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Component: Untriaged → XPCOM
Product: Firefox → Core
I think the approach to troubleshooting this is to see if removing the sleep adjustment helps. It may be that Yosemite does something that confuses TimeDuration when sleep occurs.
I've updated one of my Macs to Yosemite, so I can check that
Can anyone verify if it happens on pre-10.10 Macs?
Ok, 40.0.2, new profile on 10.10.4 MBP 13" retina 3GHz (2014?): I don't see a delay, either waiting 30ish seconds to a minute, or waiting long enough to force password entry.  New tab is instantaneous after coming out of sleep.  I'll update to 10.10.5 just in case.

Perhaps something else is running, or some other setting is interfering?  People who can verify this, please try to strip down your systems (disable helper stuff, turn off unusual/non-default options) and see if anything correlates to this occurring.

Even if this affects a subset of mac users, we want to nail it.  Andrew has commented in the past that we have had reports going back years of beachballing coming out of sleep on mac, intermittently.
Flags: needinfo?(mozbugs)
Flags: needinfo?(greenythebeast)
Flags: needinfo?(eshepherd)
Flags: needinfo?(continuation)
10.10.5 - no difference; works well
I've done some extra testing on: Macbook Pro Retina running on battery (MacOS 10.10.5)

Disabled BetterTouchTool (http://www.bettertouchtool.net/) on the machine, which might be a factor, but the problem still occurs.

Steps to reproduce:

- create new user account on Mac
- login to this account
- start Firefox 40.0.2
- set Firefox as default browser
- close Firefox
- reboot Mac
- login to new account
- start Firefox again
- wait 2 minutes, until the display switches off (sometimes it takes 4 minutes before the screen switches off)
- wait another 5 seconds
- wake machine with Shift-key
- use touchpad to click on the new tab + sign

With these steps I can 100% reproduce the problem.

If you wait until the lock screen gets active and then wake the machine the problem does not occur.
(lock screen default after 5 minutes, but I had to set it to 30 minutes otherwise the lock screen sometimes got active after just 2 minutes)

During the hang / freeze there is no beach ball, the Firefox GUI is just frozen.

I can only reproduce the problem about 50% of the time with the steps described in comment 0:
- put Mac to sleep (Apple-menu -> Sleep)
- wait for 60 seconds to ensure the Mac is in sleep mode
Flags: needinfo?(mozbugs)
Similar to Onno, I have BetterTouchTool as well, but I do not believe it is the problem I quit the program to test and the problem still occurs 100%. My steps to reproduce are similar to Onno, however, I do not even have to wait several minutes. I select Sleep from the Apple menu and then rewake the computer using the touchpad. Then using, the touchpad I attempt to open a new tab and the GUI does not respond for 10-15 seconds.

Notes:
•I feel that the longer I let the computer sleep, the longer the GUI freeze occurs.
•During the GUI freeze, Firefox's Apple toolbar (File, Edit, View, History, Bookmarks, etc.) is responsive.
•GUI freeze still occurs when attempting to use "⌘+T" shortcut as well. The Apple toolbar flashes to denote a shortcut was entered, but the new tab doesn't appear for several seconds.
Flags: needinfo?(greenythebeast)
More notes:
•GUI freeze does not occur if Mac goes to sleep while the farthest left tab is active.
Yeah, people keep reporting it, going back for years, but I unfortunately never figured anything out.
Flags: needinfo?(continuation)
I'm still unable to repro it with Eli's sequence either, though I haven't tested it with more than one tab open - I'll try that this morning.

A diagnostic patch attached to another possible regression from this (in Thunderbird):
https://bugzilla.mozilla.org/attachment.cgi?id=8652739
Run with NSPR_LOG_MODULES=nsTimerImpl:3 NS_LOG_FILE=whatever

builds against inbound with the patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0bd296c164e

If that doesn't help me figure it out (or maybe just to speed things up), then try it with  Bug 1178890 backed out (so I can compare with/without bug), and perhaps trying one with with the diagnostic patch (and the bug) and NSPR_LOG_MODULES=nsTimerImpl:4 (which will generate an enormous amount of logs - try to repro quickly! and then quit).  It may be so large it'll need to be transferred by dropbox/etc.
Flags: needinfo?(mozbugs)
Flags: needinfo?(greenythebeast)
(In reply to Randell Jesup [:jesup] from comment #13)
> I'm still unable to repro it with Eli's sequence either, though I haven't
> tested it with more than one tab open - I'll try that this morning.
> 
> A diagnostic patch attached to another possible regression from this (in
> Thunderbird):
> https://bugzilla.mozilla.org/attachment.cgi?id=8652739
> Run with NSPR_LOG_MODULES=nsTimerImpl:3 NS_LOG_FILE=whatever
> 
> builds against inbound with the patch:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0bd296c164e
> 
> If that doesn't help me figure it out (or maybe just to speed things up),
> then try it with  Bug 1178890 backed out (so I can compare with/without
> bug), and perhaps trying one with with the diagnostic patch (and the bug)
> and NSPR_LOG_MODULES=nsTimerImpl:4 (which will generate an enormous amount
> of logs - try to repro quickly! and then quit).  It may be so large it'll
> need to be transferred by dropbox/etc.

Yeah, I think it's important to try it with at least 2 tabs. As I said, the bug does not occur if the leftmost tab is active. So if you only have 1 tab open, that would satisfy that situation I would think. I always have a pinned tab when using Firefox.
Flags: needinfo?(greenythebeast)
Attachment #8652904 - Attachment description: 2 tabs open, problem reproduced (level 4) → 2 tabs open, problem reproduced (level 4) (child-1)
Reproduced with try-build, with steps from comment 9:
- 2 tabs open (this bug and New Tab page)

log files with level 3 and level 4 logging attached
Flags: needinfo?(mozbugs)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6b3405e3177

I added more details to the log so I can correlate each timer entry with later debugs (with :4 logging).  Can you generate new logs with :4?  Thanks!
Flags: needinfo?(mozbugs)
From the  Thunderbird bugreport (which is apparently easier to hit):

So this provides a smoking gun as to the cause:  timers that are set after we wake up get modified as well.  Likely this was also the cause of the hard-to-reproduce problems that mccr8 has referred to going back a Long Time (the old timer code pushed all timers forward, and effectively restarted them from scratch, so a 100sec timer with 1 second left went back to 100s).  This bug can make that worse, if the machine is asleep a long time, *and* the timer gets rescheduled before the timer code itself processes the wakeup (which may be rare in "normal" use; in Firefox I've been totally unable to reproduce the OSX-freezes-on-wake bug even with step-by-step instructions).
Attached file try2-level-4.txt
Flags: needinfo?(mozbugs)
Yeah, this matches: 2080355072[1003a52d0]:   
Update mTimeout[22] (timer=12aad8c20, type = 0) from 3951ms to 16377ms (original delay 4000ms)
Update mTimeout[23] (timer=11d3643a0, type = 0) from 4848ms to 17275ms (original delay 5000ms)
Update mTimeout[24] (timer=11d583dc0, type = 0) from 9999ms to 22426ms (original delay 10000ms)
Byron notes (correctly) that not adjusting them all could break an (unstated) invariant that set_timer(n); set_timer(n+m for any m >= 0) will always expire in that order.  If "sleep" happens between the two set_timer calls and the second is *before* the wakeup, with this patch could break that expectation.  This is a pretty edge-case-ish issue, but if that's an invariant then the current patch could break it.  So that's something we should decide.

Also, instead of trying to figure out what was before or after, we could just not adjust anything.  Downside might be high - large numbers of timers expiring at once on wakeup.  *Some* usecases (mostly *long* multi-minute/hour/day timers) might want that.  Most wouldn't.  (We could even expose an API point (flag) for indicating "I want this timer to be absolute" and not reschedule those, while reschduling all the others - but that's a major-ish change.)
Flags: needinfo?(nfroyd)
Flags: needinfo?(docfaraday)
So, let's look at the spec for setTimeout, since we're going to need to be able to honor that, and it uses the same timer code.

http://www.w3.org/TR/2011/WD-html5-20110525/timers.html#dom-windowtimers-settimeout

In "The setTimeout() method must run the following steps: ", we see:

"If the method context is a Window object, wait until the Document associated with the method context has been fully active for a further timeout milliseconds (not necessarily consecutively)."

The definition of "fully active" does not explicitly mention machine sleep, but it would be reasonable to interpret this as applying, meaning that we ought to perform the adjustment. However, I'm not seeing any code that actually causes timers to stop ticking if a Document goes inactive, so it may be that we're already pretty non-compliant with this.

We also have:

"Wait until any invocations of this algorithm started before this one whose timeout is equal to or less than this one's have completed."

This is the invariant I've been talking about.
Flags: needinfo?(docfaraday)
Attached file try3-run1.txt
Attached file try3-run2.txt
(In reply to Randell Jesup [:jesup] from comment #24)
> Possible fix:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=4e5887e3288e
> (patch is on the other bug 1196662)

The problem still exists, tried twice, both times the hang / freeze occurred
Onno - thanks.  I think I know why it still failed; I'd need to re-sort the array if I don't offset them all by the same amount.  (I made this one quickly while about to run out the door, and only verified it compiled.)  That's a somewhat less trivial patch, though fundamentally straightforward.

After thinking about this, I'm strongly considering dropping all attempts to adjust the timers on wakeup; this would mean that many timers may fire immediately on wakeup.  For some it's useless or unproductive; for others it can actually help ("re-register every N minutes" or "refresh every N minutes" type - though a lot of them going off at once can have it's own downsides (perf load/sluggishness on wakeup)).  We don't handle sleep/wakeup on linux today.  It may slightly violate the "fully active" part above; but it's a bit unclear what that means, and in any case we don't adjust on Linux today.

If we can reliably tell if a timer was set after wakeup or not, I think we can adjusttimers while retaining the invariant above by adding some bookkeeping perhaps.  If we reliably know a lower bound on the time slept, then we can adjust all timer values by that that don't have an apparent start time in the gap, and the invariant should hold.  The trick is to be sure of a lower bound on that, which isn't so easy.   But, per above, maybe we shouldn't try so hard to do so.
[Tracking Requested - why for this release]:
Important regression (though hard to hit in most cases)
Note: Windows and Mac only (no sleep/wakeup on Linux)

See also byron's comment in bug 1196662 comment 50 about how one might maintain the invariant, at a fair bit of complexity cost.
Severity: normal → major
OS: Mac OS X → All
Priority: -- → P1
Summary: Firefox hangs for several seconds after wakeup (MacOS 10.10) → Timer rescheduling can incorrectly offset timers created after wakeup but before wake notification fires
Blocks: 1196662
Fully removes all sleep/wake functionality; the large 'changed' block is just re-indented after removing the if (mSleeping) test.  Note that the observer notifications still happen, so code can react to sleep/wakeup.
Safe patch that's upliftable to remove wakeup timer adjustments.  Note that the 'remove all' patch I put up may not want to be so extreme - it should probably still Notify() the monitor to wake up the timer thread, even if it doesn't touch the array, so the observers, runnable and ::DoAfterSleep (looking much like this patch) would remain.
(In reply to Byron Campen [:bwc] from bug 1196662 comment 44)
> So the callback is firing late or something? 

Scenarios where a timer creeps in:
* Another wake observer runs before ours (likely the Thunderbird case, see bug 1196662 comment 52).  Note that Download Manager also watches wakeup.
* We went to sleep while processing an event on MainThread, and on wakeup, before processing new messages and noticing the wakeup, we scheduled a timer
* Another thread ran after wakeup and before MainThread processed the wakeup notification, and set a timer.
* and your bug 1196662 comment 45 (timer thread loop ran before wakeup notification fires); I'm not sure I want to think about that, but if that did happen, it would appear like a *very* short sleep (we're reset the last run time), and we can ignore doing adjustments on all "short" sleeps in any case, so this one is solvable.
(In reply to Randell Jesup [:jesup] from comment #33)
> Created attachment 8653929 [details] [diff] [review]
> possible fix; use with NSPR_LOG_MODULES=nsTimerImpl:3 (has invariant issues)
> 
> should fix the sorting issue, but not retaining all invariants.  Try:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfb800636cf8

Tried with 3 different sleep times: 30 seconds, 5 minutes and 10 minutes sleep.

I can not reproduce the hang / freeze problem with this build :)
Blocks: 1197625
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

Let's land the simple disabling, which should be totally safe, and uplift it.  We can continue to discuss if we should go to the effort to retain all the invariants while still adjusting timers, or drop all support for timer adjustment (as there are use-cases for timers where you wouldn't want them adjusted).

Note: I'll add a comment in place of the removed block: // Timer adjustment after wakeup disabled; see bug 1197152

Nathan may not be available, so r? to Andrew as well.
Attachment #8653946 - Flags: review?(nfroyd)
Attachment #8653946 - Flags: review?(continuation)
Flags: needinfo?(eshepherd)
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

I'm happy to review this, but I've never looked at this code before, so it is going to take me some time. Maybe bwc can finish a review of this before I'm done with that, if he's around.
Attachment #8653946 - Flags: review?(docfaraday)
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

In IRC, bsmedberg said we should just wait for froydnj, who should be back in a day or two, so I'm clearing the other review requests.
Attachment #8653946 - Flags: review?(docfaraday)
Attachment #8653946 - Flags: review?(continuation)
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

Review of attachment 8653946 [details] [diff] [review]:
-----------------------------------------------------------------

I can't give this an r+ since I'm not a peer, but here's some feedback.

::: xpcom/threads/TimerThread.cpp
@@ -755,5 @@
>  void
>  TimerThread::DoAfterSleep()
>  {
>    // Mainthread
>    TimeStamp now = TimeStamp::Now();

We don't need this anymore.

@@ -776,2 @@
>    mSleeping = false;
>    mLastTimerEventLoopRun = now;

We don't need this member anymore.

@@ -776,4 @@
>    mSleeping = false;
>    mLastTimerEventLoopRun = now;
>  
>    // Wake up the timer thread to process the updated array

I'm not sure it is necessary to wake up the timer thread here, but even if it is this comment is not right anymore.
Wondering if this is a recent regression. Tracked for 41 and 42. Tabs opening up after a 10-second delay is bad :(
(In reply to Byron Campen [:bwc] from comment #44)
> >    // Mainthread
> >    TimeStamp now = TimeStamp::Now();
> 
> We don't need this anymore.
> 
> @@ -776,2 @@
> >    mSleeping = false;
> >    mLastTimerEventLoopRun = now;
> 
> We don't need this member anymore.

Yeah, for the "full" remove-unneeded-stuff patch, see one of the others.  This is safer for uplift to beta, and we may want to discuss full removal more before committing to it.

> 
> @@ -776,4 @@
> >    mSleeping = false;
> >    mLastTimerEventLoopRun = now;
> >  
> >    // Wake up the timer thread to process the updated array
> 
> I'm not sure it is necessary to wake up the timer thread here, but even if
> it is this comment is not right anymore.

True, easily dealt with. We do want to wake up the timer thread I think; I'm not sure I trust everything to happen "right" without it.  And it's safe.
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

Review of attachment 8653946 [details] [diff] [review]:
-----------------------------------------------------------------

Long comment just to log this stuff in a bug somewhere:

The initial behavior, originally from bug 197863, on wakeup was that we'd reset the expiration point of the timer to |wakeup_time + delay|, where |delay| was the original delay of the timer.  On reflection, that's not correct because a timer might have been almost ready to fire when we went to sleep.  Moving it forward by the delay of the timer means that the timer was effectively set for |2*delay|, and might possibly never go off, depending on suspend/resume cycles and how long the delay was.

The modified behavior from bug 1178890 basically does the same thing, except that we're trying to keep the timer array invariants correct.

I think there are two reasonable things to do:

1. No adjustments whatsoever, let timers expire en masse on resume.  This treats the delays on timers as requests for absolute times in the future.
2. For timers set prior to suspend, determine how much time they had remaining, and reschedule them for that amount of time post-resume.  This treats the delays on timers as "I want to do this after Firefox has been running for N ${UNIT}s."  Time spent suspended is then treated as time that Firefox isn't running.

This patch implements #1.  It's worth pointing out that bug 197863 was implemented specifically because the storm of timers going off after sleep caused CPU issues.  While CPUs have gotten faster in the last decade or so, having all those timers wake up doesn't seem like the greatest thing.

It's also worth pointing out the high CPU usage problems were originally seen on Macs.

I've tried looking around to see what other folks do.  I haven't been able to find the equivalent code in WebKit (but I don't have a WebKit checkout and haven't figured out how to let MXR give me lots of results for search queries).  I'm having a hard time grokking the Linux kernel code...I think it might do #2?  I do not have a copy of the OS X kernel source handy to grovel around in, and clicking through the website seems tedious.

I'm reluctant to take this, given the problems that it initially fixed (poorly, let it be said).
Attachment #8653946 - Flags: review?(nfroyd)
I think I addressed the ni?, canceling.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #47)
> 
> This patch implements #1.  It's worth pointing out that bug 197863 was
> implemented specifically because the storm of timers going off after sleep
> caused CPU issues.  While CPUs have gotten faster in the last decade or so,
> having all those timers wake up doesn't seem like the greatest thing.
> 
> It's also worth pointing out the high CPU usage problems were originally
> seen on Macs.

I would not be surprised if the horrible load coming out of sleep was due to TYPE_REPEATING_PRECISE (which was removed in bug 1190735); I would expect the old TYPE_REPEATING_PRECISE to rapidly reschedule itself until it had fired enough times to cover the duration spent in sleep. If we uplift the patch for disabling timer adjustment, this could be a problem; we would likely need to also uplift bug 1190735, or a more minimal patch. Even taking this in consideration, this is a vastly smaller (and less risky) changeset than would be required to reschedule while maintaining the invariants we are supposed to.
I should also point out that invariant-preserving rescheduling would need to treat repeating timers differently, adding further to the complexity of the code. Lastly, we have no test-cases for sleep whatsoever, which means that this code would have no test coverage.
(In reply to Byron Campen [:bwc] from comment #49)
> (In reply to Nathan Froyd [:froydnj] from comment #47)
> > 
> > This patch implements #1.  It's worth pointing out that bug 197863 was
> > implemented specifically because the storm of timers going off after sleep
> > caused CPU issues.  While CPUs have gotten faster in the last decade or so,
> > having all those timers wake up doesn't seem like the greatest thing.
> > 
> > It's also worth pointing out the high CPU usage problems were originally
> > seen on Macs.
> 
> I would not be surprised if the horrible load coming out of sleep was due to
> TYPE_REPEATING_PRECISE (which was removed in bug 1190735); I would expect
> the old TYPE_REPEATING_PRECISE to rapidly reschedule itself until it had
> fired enough times to cover the duration spent in sleep.

Ah, thanks for pointing that out!  That's a reasonable hypothesis.

> If we uplift the
> patch for disabling timer adjustment, this could be a problem; we would
> likely need to also uplift bug 1190735, or a more minimal patch. Even taking
> this in consideration, this is a vastly smaller (and less risky) changeset
> than would be required to reschedule while maintaining the invariants we are
> supposed to.

Taking bug 1190735 seems less risky than writing a bunch of new code.
Comment on attachment 8653946 [details] [diff] [review]
Alternative patch to minimally remove wakeup timer adjustment

I am OK with this patch for beta as long as bug 1190735 comes with it, as bwc pointed out.

ni? to jesup for what he wants to do with the more complete removal patch.  Seems like it would be nice to land the complete removal patch if possible...
Flags: needinfo?(rjesup)
Attachment #8653946 - Flags: review+
REPEATING_PRECISE would in fact cause some fun (which also makes me wonder whether anyone has been complaining on Linux... since it doesn't appear to have sleep/wakeup triggers).  I hadn't noticed that only landed in 42.  "hog the CPU for minutes, while processing timers that should have fired during the time that the machine was asleep" => implies to me strongly REPEATING_PRECISE or equivalent.

As for complete removal - if we're going to stick with no rescheduling, then it makes total sense to land complete removal, either on Nightly with the just-disable patch on Aurora/Beta, or land just-disable on all branches, and follow up shortly with complete removal on Nightly.  (This would be to get extra testing in on Nightly while we're landing it on Aurora/Beta/ESR38.)

At this point, given the arguability of "correct" behavior, and the complexity of trying to retain all invariants in all cases, I lean towards a complete removal, with the just-disable patch used for the uplifts.  (The specs as Byron notes do imply we should offset the timers on wakeup, but I doubt anyone depends on that - though it would be interesting to find out what chrome does.)  

It's too bad we can't reliably get notified first.  If we want to continue to offset, we could infer on new-timer-scheduling (or timer-firing) that we'd been asleep if the time is a long time after it "should" be, or a long time after sleep notification.  Still tricky to get right, and I think not worth the hassle.
Flags: needinfo?(rjesup)
Attachment #8653943 - Flags: review?(nfroyd)
Attachment #8653943 - Flags: review?(nfroyd)
Comment on attachment 8653943 [details] [diff] [review]
Alternative to remove all sleep/wake functionality from Timers

We shouldn't remove quite so much (I threw this together to get an idea what might be in a full removal); we want to re-process the array on wakeup, and the code to slow timer firing during sleep should be retained unless we have a good reason to yank it (though perhaps that warrants separate investigation as to the use/benefit/reason for it).  New patch shortly, which is much closer to the basic removal patch (and thus safer).
Attachment #8653943 - Attachment is obsolete: true
this should be good for nightly/aurora/beta (with the REPEATING_PRECISE change)
Attachment #8658214 - Flags: review?(nfroyd)
Attachment #8658214 - Flags: review?(nfroyd) → review+
Depends on: 1190735
Attachment #8653946 - Attachment is obsolete: true
Comment on attachment 8658214 [details] [diff] [review]
Bug 1197152: Remove all sleep/wake adjustments from Timers

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:

Thunderbird builds on ESR38, and this makes Thunderbird fairly unusable.  Note: this will need the REPEATING_PRECISE removal on ESR38 as well.


Approval Request Comment
[Feature/regressing bug #]: bug 1178890

[User impact if declined]: Problems waking up on mac/windows, especially on Thunderbird, but also some instances on some macs (depends on timing).  May cause a few hard-to-reproduce problems.

[Describe test coverage new/current, TreeHerder]: Lots use timers, but nothing tests or can easily test sleep.

[Risks and why]: Low risk - simply doesn't touch the timer array at all on wakeup; almost the same as what happens on linux - we just wake up the Timer thread to re-process the array and fire any expired timers.  Note: needs the dependent bug for beta (remove REPEATING_PRECISE timers)

[String/UUID change made/needed]: none
Attachment #8658214 - Flags: approval-mozilla-esr38?
Attachment #8658214 - Flags: approval-mozilla-beta?
Attachment #8658214 - Flags: approval-mozilla-aurora?
We will need the patch for ESR from bug 1190735 as well for this.
Nathan, could you please provide your assessment on the risk associated with uplifting this fix to Beta41? 

We will gtb 41RC on Monday so we do not have a lot of wiggle room for unexpected regressions. I am wondering if this is a fix that is low risk and an issue that is a must for Beta41 or can it wait to be fixed in 42?
Flags: needinfo?(nfroyd)
FYI, part of my concern with fixing this (outside of the impact on Thunderbird) is that in cases where it does bite us, we may not realize it (i.e. timer never fires, or fires hours late is very hard to diagnose).  The Mac tab-open bug (which I still can't repro, but others like lizzard can) might lead to a delay as long as the time the machine was asleep before the timer in question fires.  If asleep for a day, it effectively will be the same as "browser is locked when opening, and I had to kill it".  Those sorts of instances are hard to track, and can lead to "it's broken, I'll just use Xxxxx".

The positive side would be that it is hard to hit it appears (outside of thunderbird), and it's unclear if any people are hitting it on windows (again, outside of thunderbird which apparently watches for wake notifications).

The diagnostic patches did very clearly highlight the problem.
Adding checkin-needed since the tree is still closed and I may be afk when it opens
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/aa9951b7efeb
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
(In reply to Ritu Kothari (:ritu) from comment #58)
> Nathan, could you please provide your assessment on the risk associated with
> uplifting this fix to Beta41? 
>
> We will gtb 41RC on Monday so we do not have a lot of wiggle room for
> unexpected regressions. I am wondering if this is a fix that is low risk and
> an issue that is a must for Beta41 or can it wait to be fixed in 42?

It's an issue that's pretty annoying when it comes up, so I'd like to see it fixed.  jesup has repeatedly said that this patch is low risk, and I don't think these patches can make the situation much worse than it already is.
Flags: needinfo?(nfroyd)
Comment on attachment 8658214 [details] [diff] [review]
Bug 1197152: Remove all sleep/wake adjustments from Timers

This bug is pretty severe and the risk assessment provided by devs is low. Given that let's uplift to Beta41 and Aurora42.
Attachment #8658214 - Flags: approval-mozilla-beta?
Attachment #8658214 - Flags: approval-mozilla-beta+
Attachment #8658214 - Flags: approval-mozilla-aurora?
Attachment #8658214 - Flags: approval-mozilla-aurora+
Flags: qe-verify+
OK, after some discussion with the rest of the team, we are not taking this for esr38 since we're going to try to keep more strictly to security fixes for 38 ongoing. 

But, we could take this on a relbranch for Thunderbird so that it addresses what seems to be the most painful point of failure. That's Sylvestre's suggestion, and I'm not sure how that works, so maybe he can chime in here or n-i whoever should be aware of that intention.
If I understand the relbranch for TB correctly, the idea would be to put the fix for this bug into the ESR for TB. As one of the many users affected by bug 1196662, I would urge this fix go in as quickly as feasible for TB. This affects TB for everyone who hibernates or sleeps their devices. For devices which are put to sleep several times a day, the work-around for bug 1196662 is more painful than for us who only hibernate or sleep our devices at night. That work-around is closing, then restarting TB. Forgetting to do this means not receiving new e-mails until one remembers to recycle TB. For critical endeavors, this could be costly.
The TB developers are familiar with this process. This has been done before.
Flags: needinfo?(sledru)
Attachment #8658214 - Flags: approval-mozilla-esr38? → approval-mozilla-esr38-
Reproduced the initial issue using an iMac 21.5-inch, Late 2013 on Firefox 40.0.2 following steps from comment 9. Verified that the issues does not reproduce on Firefox 41.0 build 3.
Also verified as fixed using Firefox 42 beta 2 and latest Developer Edition 43.0a2 on the same machine with Mac OS X 10.10.5.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: