Closed Bug 762620 Opened 11 years ago Closed 11 years ago

Evidence that idle-daily can start misfiring

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
critical

Tracking

(firefox14+ wontfix, firefox15+ fixed, blocking-fennec1.0 .N+)

RESOLVED FIXED
Firefox 16
Tracking Status
firefox14 + wontfix
firefox15 + fixed
blocking-fennec1.0 --- .N+

People

(Reporter: gcp, Assigned: gcp)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(6 files, 1 obsolete file)

It seems that some stuff that fires in idle-daily can start to go bonkers in Native Fennec:

https://bugzilla.mozilla.org/show_bug.cgi?id=758193
https://bugzilla.mozilla.org/show_bug.cgi?id=762590
Severity: normal → critical
Telemetry is getting basically DDOS-ed from Fennec devices, so this is important.
blocking-fennec1.0: --- → ?
Blocks: 762590
qawanted for str
Assignee: nobody → gpascutto
blocking-fennec1.0: ? → betaN+
Keywords: qawanted
Can the telemetry guys give us some indication of they can see if its specific to a certain Android device or Android version?
Gian-Carlo, On the data service I was seeing Firefox 14.0 on Android mostly. I don't record the payloads themselves right now when I reject the requests so I can't give you much more detail than that. Here's an example user-agent string that I log:

Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0
Do you have the possibility to record one or some of the payloads?
Need a way of logging or tracking to continue here; re comment #2.
GCP - Can you add logging to verify that idle-daily is really firing too much?
Is this a problem with the general idle-daily mechanism or is it specific to Telemetry?  I'm wondering if things like AUS or Blocklist could also be getting "DOS'ed"
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #8)
> Is this a problem with the general idle-daily mechanism or is it specific to
> Telemetry?  I'm wondering if things like AUS or Blocklist could also be
> getting "DOS'ed"

it's a general problem... So it's very likely those will be affected too, can you check logs?

For telemetry, I'm not sure how much metrics can do since duplicate telemetry pings are discarded asap on server. Xavier I think the way to go here is to check when Fennec 14 started firing high-frequency idle-daily pings to get  a regression range. Do we have http logs that go a month or two back?
I'd suggest to put this in Nightlies so we have more info/are able to detect if someone manages to reproduce.
Attachment #632006 - Flags: review?(mark.finkle)
blocking-basecamp: --- → ?
blocking-basecamp: ? → ---
Comment on attachment 632006 [details] [diff] [review]
Patch 1. Add logging for idle-daily

I'm OK with adding this for a week or two, until we get a better feel for the situation.

Let's remember to remove it though.
Attachment #632006 - Flags: review?(mark.finkle) → review+
>I'm wondering if things like AUS or Blocklist could also be getting "DOS'ed"

FWIW AUS/blocklist are on the update/application-timer, not on idle-daily, so we're expecting them not to be affected. Of course if it's something like a device/firmware bug that is triggering this then all bets are off.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #12)
> >I'm wondering if things like AUS or Blocklist could also be getting "DOS'ed"
> 
> FWIW AUS/blocklist are on the update/application-timer, not on idle-daily,
> so we're expecting them not to be affected. Of course if it's something like
> a device/firmware bug that is triggering this then all bets are off.

I've asked the AMO guys to let us know if they're seeing higher than normal traffic, just in case.
I/PRLog   (10837): 855712[5f051080]: idleService: Register idle observer 5d50d5a0 for 300 seconds
I/PRLog   (10837): 855712[5f051080]: idleService: Register: adjusting next switch from -1 to 300 seconds
I/PRLog   (10837): 855712[5f051080]: idleService: next timeout 1339486359856191 usec (-551 msec from now)
I/PRLog   (10837): 855712[5f051080]: idleService: SetTimerExpiryIfBefore: next timeout 1339486359856191 usec
I/PRLog   (10837): 855712[5f051080]: idleService: reset timer expiry from 0 usec to 1339486360418479 usec
I/PRLog   (10837): 855712[5f051080]: idleService: Get idle time: polled 1546536376 msec, valid = 0
I/PRLog   (10837): 855712[5f051080]: idleService: Get idle time: time since reset 300561 msec
I/PRLog   (10837): 855712[5f051080]: idleService: Idle timer callback: current idle time 300561 msec
I/PRLog   (10837): 855712[5f051080]: idleService: Reset idle timeout (last interaction 300561 msec)
I/PRLog   (10837): 855712[5f051080]: idleService: next timeout 1339486359857350 usec (-561 msec from now)
I/PRLog   (10837): 855712[5f051080]: idleService: SetTimerExpiryIfBefore: next timeout 1339486359857350 usec
I/PRLog   (10837): 855712[5f051080]: idleService: reset timer expiry from 0 usec to 1339486360428651 usec
I/PRLog   (10837): 855712[5f051080]: idleService: ReconfigureTimer: no idle or waiting observers
I/PRLog   (10837): 855712[5f051080]: idleService: Idle timer callback: tell observer 5d50d5a0 user is idle
I/PRLog   (10837): 855712[5f051080]: idleService: Remove idle observer 5d50d5a0 (300 seconds)
I/PRLog   (10837): 855712[5f051080]: idleService: Get idle time: polled 1546536376 msec, valid = 0
I/PRLog   (10837): 855712[5f051080]: idleService: Get idle time: time since reset 300569 msec
I/PRLog   (10837): 855712[5f051080]: idleService: Idle timer callback: current idle time 300569 msec
I/PRLog   (10837): 855712[5f051080]: idleService: Reset idle timeout (last interaction 300569 msec)

Some observations:
1) I had to fix the logging to get the "-551 msec from now". The logging in the code is wrapping around. Is this value supposed to be negative?
2) If I understand what the code is doing, it's reprogramming the next timeout *before* the last one fired. There is some code that protects against that by adding 10 ms to the timeout. As a result if fires twice consecutively.

I'm going to dig a bit more into this, but the above looks a bit strange to me. If anyone more familiar with this service wants to comment...
> Some observations:
> 1) I had to fix the logging to get the "-551 msec from now". The logging in
> the code is wrapping around. Is this value supposed to be negative?

Yes, this is the case where somebody registers an idle observer saying "notify me when we've been idle for 300 seconds" and the application has already been idle for 300.551 seconds.

> 2) If I understand what the code is doing, it's reprogramming the next
> timeout *before* the last one fired. There is some code that protects
> against that by adding 10 ms to the timeout. As a result if fires twice
> consecutively.
> 
> I'm going to dig a bit more into this, but the above looks a bit strange to
> me. If anyone more familiar with this service wants to comment...

I might have an idea of what's happening. Does Telemetry want the behaviour "call my callback every 300 seconds while the application is idle"? The idle service doesn't actually support that right now. Instead what you get is:

Telemetry registers idle observer for 300 seconds
Idle timer fires when app is idle 300 seconds 
 Calls Telemetry observer
   Telemetry observer de-registers itself, does its action, registers a *new* idle observer for 300 seconds
Idle timer fires immediately (plus 10 ms fudge factor) because app is already idle more than 300 seconds
  Calls Telemetry observer again
    (and we either go into our loop, or fail out based on the bug below)


The difference that I saw, working on desktop Thunderbird, was that the Telemetry observer would throw an exception rather than continually re-register the observer; I filed https://bugzilla.mozilla.org/show_bug.cgi?id=743776 about that.
Blocks: 763991
just for information, idle-daily is nothing special, it is just a simple idle observer that once gets an idle notification verifies if it did fire in the last 24 hours, and, if not, it notifies.
The problem here seems to be more related to the idle service itself, than idle-daily, maybe the Android backed is just wrongly measuring idle?
Telemetry just needs to be notified about the first idle _after_ idle-daily.
For what it's worth, nsIdleDaily on Android doesn't use polling, but just timers and resets based on Gecko Events. i.e. there is no Android-specific code in the service when compiled for Android.

Even if it were measuring idle time wrongly, I don't think that should start the rapid-fire behavior that the reports in the description seem to suggest.

I'd certainly not be surprised if the clocks on some Android devices could behave..."unexpectedly" to trigger this, though.
Attached file Log of timer service
Here's some more strange behavior caught when logging.
Add more logging so we can see what is going on. This will be fairly low output if things work normally, but more verbose if stuff goes bonkers.
Attachment #632006 - Attachment is obsolete: true
Attachment #632351 - Flags: review?(mark.finkle)
Attachment #632351 - Flags: review?(mark.finkle) → review+
You need to use the whiteboard annotation [1], or the (new, still in testing) tool will close as fixed regardless :-)

[1] https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound
Whiteboard: [leave open]
(In reply to Gian-Carlo Pascutto (:gcp) from comment #17)
> For what it's worth, nsIdleDaily on Android doesn't use polling

I guess you meant nsIdleService here, as I said idleDaily is just a normal observer added to the service, I think we are making confusion here, the problem must be idle, not the daily notification.
Btw, maybe the "difference" is exactly this one, polling has been fixed/changed recently, while the events based one may be mis-firing.
with telemetry disabled, I don't think this blocks anymore
blocking-fennec1.0: betaN+ → ?
blocking-fennec1.0: ? → .N+
(In reply to Gian-Carlo Pascutto (:gcp) from comment #18)
> Created attachment 632347 [details]
> Log of timer service
> 
> Here's some more strange behavior caught when logging.

That log is interesting in that it shows the underlying one-shot timer we set up at http://mxr.mozilla.org/comm-central/source/mozilla/widget/xpwidgets/nsIdleService.cpp#680 is calling us back much sooner than we expect; this could point at a problem with the timer service on Android.

The log also shows that the idle service correctly keeps resetting the timer until the desired idle time has passed, and only calls the idle listener then.


The idle daily service is also using the timer service to wait for one day between activities (see http://mxr.mozilla.org/comm-central/source/mozilla/widget/xpwidgets/nsIdleService.cpp#113). If the underlying timer service is calling back too quickly this would cause the daily event to re-run. Adding a log call to nsIdleServiceDaily::DailyCallback() would help us spot this.
To clarify, the sequence of events I'm postulating is:

1) idle-daily registers listener waiting for 300 sec. of idle time

2) idle service calls back idle-daily handler

3) idle-daily handler de-registers listener, notifies daily observers, registers @mozilla.org/timer;1 for SECS_PER_DAY * MSEC_PER_SEC

4) timer service calls back idle-daily long before the requested day has passed

5) idle-daily timer handler registers idle listener waiting for 300 sec.

6) idle service calls back handler listener immediately, because we've already been idle more than 300 sec.

7) Go to 3) and repeat
(In reply to Irving Reid (:irving) from comment #27)
> To clarify, the sequence of events I'm postulating is:
> 
> 1) idle-daily registers listener waiting for 300 sec. of idle time
> 
> 2) idle service calls back idle-daily handler
> 
> 3) idle-daily handler de-registers listener, notifies daily observers,
> registers @mozilla.org/timer;1 for SECS_PER_DAY * MSEC_PER_SEC
> 
> 4) timer service calls back idle-daily long before the requested day has
> passed
> 
> 5) idle-daily timer handler registers idle listener waiting for 300 sec.
> 
> 6) idle service calls back handler listener immediately, because we've
> already been idle more than 300 sec.
> 
> 7) Go to 3) and repeat

Irving, are you able to provide steps to reproduce for this bug?
Tony, I haven't observed the bug myself; that's :gcp. I'm just trying to help interpret his log file based on the log messages I added to the timer service a little while ago.
I never observed this bug either, the log I pasted is from running Fennec normally. We seem to be seeing that this basically never works and likely never worked correctly on Android.
So would this mean just keeping the device charged and then having fennec in the foreground, but just letting the device go to sleep?
As requested above in the comments.
Attachment #637386 - Flags: review?(blassey.bugs)
Attachment #637386 - Flags: review?(blassey.bugs) → review+
>The idle daily service is also using the timer service to wait for one day between 
>activities (see http://mxr.mozilla.org/comm-central/source/mozilla/widget/xpwidgets
>/nsIdleService.cpp#113). If the underlying timer service is calling back too 
>quickly this would cause the daily event to re-run. Adding a log call to 
>nsIdleServiceDaily::DailyCallback() would help us spot this.

DailyCallback only calls AddIdleObserver, which would make a log entry. In the strange log that is only shown once. Of course absence of evidence is not evidence of absence here.
I see several things wrong here:

I/IdleService(15997): Storing last idle time as 1341212144 (Mon, 02 Jul 2012 06:55:44 GMT)
I/IdleService(16793): DailyCallback started
I/IdleService(16793): Storing last idle time as 1341213045 (Mon, 02 Jul 2012 07:10:45 GMT)
I/IdleService(16878): DailyCallback started
I/IdleService(16878): Storing last idle time as 1341213529 (Mon, 02 Jul 2012 07:18:49 GMT)

So within the space of 25 minutes, we fired the DailyCallback twice, which shouldn't be possible.

I/IdleService(15997): next timeout 1341210519361288 usec (-1625476 msec from now)
I/IdleService(15997): SetTimerExpiryIfBefore: next timeout 1341210519361288 usec
I/IdleService(15997): next timeout 5636177514361288 usec (4294965369508 msec from now)
I/IdleService(15997): SetTimerExpiryIfBefore: next timeout 5636177514361288 usec

Note that the android log uses %lld, so the second value is really that big and not negative. Something in the code is overflowing fatally.

Additionally, I found a logic error in the IdleService::Init function:
We check if more than a day has passed since the last check. If so, we call DailyCallback. If not, we set the timer a day from now. This means that it can take up to *two* days before that timer refires.

Unfortunately I don't have real STR. But the above will give some clues where to start looking.
qawanted removed; tried to find STR but could not.  It seems like there is something more than just idling device...
Keywords: qawanted
I found STR for the wrong behavior. Not the behavior that can explain the rapid-fire of idles, but a very normal usage scenario that will cause idle-daily to be triggered multiple times per day.

1) Launch Fennec
2) Browse a webpage
3) Switch away from Fennec (putting it into the background)
4) Wait for 5 minutes so idle-daily fires
5) Force Fennec to quit via task manager
6) Repeat

Each iteration will fire idle-daily. Note that step 5 will happen naturally as other Android apps start using RAM.

I'm guessing the Preferences aren't hitting disk before Fennec is killed.
What is happening is the following: the moment the user switches away from Fennec, we get ACTIVITY_PAUSING followed by ACTIVITY_STOPPING.

https://mxr.mozilla.org/mozilla-central/source/widget/android/nsAppShell.cpp#418

This forces a SavePrefFile call at that point. There are no further message when the activity gets destructed, though I do see a COMPOSITOR_PAUSE just before we get killed.

The problem is now that the idle-daily call will fire after 5 minutes, while the app is back-grounded. idle-daily will then store its updated last-run time in preferences, but at this point there will be no more SavePrefFile. As a result, the updated preference is lost if the activity dies with the user foregrounding it again.

The most simple solution here seems to modify nsIdleServiceAndroid to force a SavePrefFile call after the idle-daily has run, though a case could be made to just do it in the generic code, too (it would avoid repeated idle-daily calls if Firefox crashes, for example).
Be careful calling savePrefFile as it will block main thread IO: bug 683808
I think we can live with that given that it'll trigger once per day after the application has been idle for 5 minutes.
This adds the save in the generic path, I think this will help in situations where desktop Firefox crashes as well.
Attachment #638988 - Flags: review?(blassey.bugs)
>4) timer service calls back idle-daily long before the requested day has passed
>
>5) idle-daily timer handler registers idle listener waiting for 300 sec.
>
>6) idle service calls back handler listener immediately, because we've already 
>been idle more than 300 sec.

I believe you're exactly right here. I was looking at logic errors in the code because the logs looked so strange, but I couldn't pinpoint any. (The logic is quite...complicated in places, but it looks correct)

I then modified the code to artificially let the "one day" timer fire too quickly, and lo and behold, I get exactly the bad behavior *and* the same log as above.

nsITimer doesn't work correctly on Android...sometimes.

To prevent this loop, perhaps we can do the same as the Idle Timer is doing, and verify that the passed time is actually what we set the timer for, then refire if it returned to soon.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #43)
...
> To prevent this loop, perhaps we can do the same as the Idle Timer is doing,
> and verify that the passed time is actually what we set the timer for, then
> refire if it returned to soon.

I was thinking about taking that approach too. The other way would be to use a recurring nsITimer for the daily trigger, and then schedule a one-shot idle timer each time we get the daily event from the recurring timer. We might still want an extra check to make sure the recurring timer really is only calling us once a day.
Quick heads up guys - looks like you guys have this on lockdown, but we should be targeting Aurora 15 this week, for uplift to Beta 14 prior to our final beta go-to-build Monday (7/9).
(In reply to Gian-Carlo Pascutto (:gcp) from comment #36)
...
> I/IdleService(15997): next timeout 1341210519361288 usec (-1625476 msec from
> now)
> I/IdleService(15997): SetTimerExpiryIfBefore: next timeout 1341210519361288
> usec
> I/IdleService(15997): next timeout 5636177514361288 usec (4294965369508 msec
> from now)
> I/IdleService(15997): SetTimerExpiryIfBefore: next timeout 5636177514361288
> usec
> 
> Note that the android log uses %lld, so the second value is really that big
> and not negative. Something in the code is overflowing fatally.

The overflow is because we've somehow failed to correctly keep track of (or handle) the next timeout event - it shouldn't be in the past, as the "-1625427 msec from now" line shows.

Do we get idle events while the device is asleep? If not, that could be a factor.

The patch looks good, I'm doing a bit of testing on it now, but I'm nowhere near a peer for this module so I'll give feedback but we should find a peer for the r?.
>The overflow is because we've somehow failed to correctly keep track of (or 
>handle) the next timeout event - it shouldn't be in the past, as the "-1625427 
>msec from now" line shows.

This will happen if the idle timeout has already been exceeded and the daily-timer fires. It will check when the next timeout (300s idle) should be, see that it's in the past (because we've been idle longer), and get that negative value. The large one is not an overflow - I assumed so too, but it's really mean to be INT_MAX - it just means that there is no "next" time to fire, because all observes are in Idle already.

>Do we get idle events while the device is asleep? If not, that could be a factor.

At least on my test device idle-daily fires even if it's turned "off".
Attachment #639402 - Flags: review?(roc)
Attachment #639402 - Flags: review?(irving)
Attachment #639402 - Flags: feedback?(irving)
Just a thought: are there any signed vs. unsigned compiler warnings in the Android build of this code? It builds without warnings on Mac OS X.
Attachment #638988 - Flags: review?(blassey.bugs) → review+
>Just a thought: are there any signed vs. unsigned compiler warnings in the Android 
>build of this code?

IIRC only in the logging code. As noted in comment 48 the negative/large values in the log look "correct" to me.
Comment on attachment 638988 [details] [diff] [review]
Patch 1. Force saving perferences after updating idle time

gcp, please request aurora and beta approval for both of these patches
Comment on attachment 638988 [details] [diff] [review]
Patch 1. Force saving perferences after updating idle time

[Approval Request Comment]
User impact if declined: Required to re-enable telemetry on Fennec.
Testing completed (on m-c, etc.): Landed on m-c a day ago.
Attachment #638988 - Flags: approval-mozilla-beta?
Attachment #638988 - Flags: approval-mozilla-aurora?
Comment on attachment 639402 [details] [diff] [review]
Patch 2. Restart daily timers that return early.

[Approval Request Comment]
User impact if declined: Required to re-enable Fennec telemetry.
Testing completed (on m-c, etc.): Landed on m-c day ago.
Risk to taking this patch (and alternatives if risky): Daily timers not firing correctly (all platforms) if there's a bug.
Attachment #639402 - Flags: approval-mozilla-beta?
Attachment #639402 - Flags: approval-mozilla-aurora?
Comment on attachment 638988 [details] [diff] [review]
Patch 1. Force saving perferences after updating idle time

[Triage Comment]
.N+ blocker, so let's land on all branches in preparation for our final beta build Monday.

Please advise how QA can verify this fix.
Attachment #638988 - Flags: approval-mozilla-beta?
Attachment #638988 - Flags: approval-mozilla-beta+
Attachment #638988 - Flags: approval-mozilla-aurora?
Attachment #638988 - Flags: approval-mozilla-aurora+
Attachment #639402 - Flags: approval-mozilla-beta?
Attachment #639402 - Flags: approval-mozilla-beta+
Attachment #639402 - Flags: approval-mozilla-aurora?
Attachment #639402 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/e860b5fe213b
https://hg.mozilla.org/releases/mozilla-aurora/rev/e10bd54565c8

https://hg.mozilla.org/releases/mozilla-beta/rev/ad82f6e15f40
https://hg.mozilla.org/releases/mozilla-beta/rev/93e99923951b

QA: verify that in the Android log, under the IdleService tag, the following message pops up exactly once per day: "DailyCallback registering Idle observer", after Firefox is idle for 5 minutes (but still in memory of the phone). You might need to visit at least one webpage for this to trigger.

The steps described in comment 38 should not cause it to be printed more than once per day.
Just confirming here, that the only message with the IdleService tag I have seen after delay is: 'Setting timer a day from now'.
This means that the service has already run once today. You should see the message "DailyCallback registering Idle observer" when you use that phone again after 24hrs. (Or on a phone you didn't use for at least a day)
Target Milestone: --- → Firefox 16
Comment on attachment 639402 [details] [diff] [review]
Patch 2. Restart daily timers that return early.

Review of attachment 639402 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, was on vacation. Glad that this is already in the tree.
Attachment #639402 - Flags: feedback?(irving) → feedback+
Blocks: 792801
Blocks: 834327
Depends on: 1085554
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.