Closed Bug 1051083 Opened 7 years ago Closed 6 years ago

[OTA update] Automatic OTA update not working

Categories

(Firefox OS Graveyard :: Gaia::System, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(tracking-b2g:backlog, b2g-v2.0 affected, b2g-v2.1 affected)

RESOLVED WONTFIX
2.1 S9 (21Nov)
tracking-b2g backlog
Tracking Status
b2g-v2.0 --- affected
b2g-v2.1 --- affected

People

(Reporter: anshulj, Assigned: gsvelto)

References

()

Details

(Keywords: foxfood, Whiteboard: [caf priority: p2][CR 725351][systemsfe])

Attachments

(6 files, 4 obsolete files)

The OTA update setting of Daily/weekly/monthly doesn't seem to be working.

STR:
1. Go to the settings app -> Device Information and select daily under Check for updates section.
2. Enable the same settings on multiple phones and leave the phones untouched for more than a day.

Observed: None of the phones showed that there is an OTA update available. Worse yet, none of the phones even attempted to try to look for an update as observed in the adb logs.

Expected: With check for updates setting set to daily, the devices should check for an OTA update once a day.
I have tried this experiment several times over few weeks and was never able to get an update automatically. On clicking "Check now" option to check for updates, all the devices always find an update. The devices also find OTA update on bootup.
I'm seeing that also on our 2.1 nightlies.
I a seeing this issue on 2.0 as well.
[Blocking Requested - why for this release]:

This was a feature introduced in 1.01, so this is definitely a regression. It's a bad one at that, since it involves periodic system update checks not working at all. I'm adding qawanted to do additional branch checks.
blocking-b2g: --- → 2.0?
Keywords: qawanted, regression
blocking-b2g: 2.0? → 2.0+
NI :gerard/dhylands to get help with investigation here...
Flags: needinfo?(lissyx+mozillians)
Flags: needinfo?(dhylands)
This is not code I touched, and there is no log to investigate.
Flags: needinfo?(lissyx+mozillians)
I don't have time to look into this right now.

I suspect that the update server URL is incorrect and the phone is looking in the wrong place.

Having some logcat would be useful.

Perhaps something like:

> adb logcat -v threadtime | grep AUS > log.txt

If you do a Check Now does that work?

If Check Now doesn't work - then daily updates are definitely not going to work.

If Check Now works, then we'd need to look at the update.xml file being served up and see how the version information in there compares to what's in the phone doing the update requests.
Flags: needinfo?(dhylands)
`Check now works`, the once per day timer doesn't (at least on nighlty). So it's not an issue with the update url. The forced check on startup works too iirc.
So then I think the next most important thing to do is to get a logcat with timestamps (adb -v threadtime)

I wonder if this is related to the RTC not working problems?
(In reply to Dave Hylands [:dhylands] from comment #9)
>
> I wonder if this is related to the RTC not working problems?

RTC works well on Anshul's build
Note that I only tested on Flame, and I have no user visible date/time issue with the device.

Can QA set up one for a long lasting test run?
QA Contact: jmercado
I was able to reproduce this issue on the Flame 2.0 and Buri 2.1 nightly builds.  The phone did not find an update during the daily check.

2.0 Flame
Environmental Variables:
Device: Flame 2.0
BuildID: 20140813000201
Gaia: cade2fdbb2230670788dcf2fc7b100f4a37b6458
Gecko: a7c673dae1ed
Version: 32.0 (2.0) 
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0

2.1 Buri
Environmental Variables:
Device: Buri Master
BuildID: 20140813040202
Gaia: 9f35fca9d818b26c06aa6b7e5c0bef25886f8f20
Gecko: 7fc96293ada8
Version: 34.0a1 (Master) 
Firmware Version: v1.2device.cfg
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

I was unable to reproduce this issue on the 2.1 Flame nightly build.  The phone did find an update.

2.1 Flame
Environmental Variables:
Device: Flame Master
BuildID: 20140813040202
Gaia: 9f35fca9d818b26c06aa6b7e5c0bef25886f8f20
Gecko: 7fc96293ada8
Version: 34.0a1 (Master) 
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Keywords: qawanted
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
This will need to be covered by a new test case.
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(rmitchell)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage?][2.0-signoff-need]
QA Whiteboard: [QAnalyst-Triage?][2.0-signoff-need] → [QAnalyst-Triage?][2.0-signoff-need+]
Alex, can you take a look?
Flags: needinfo?(lissyx+mozillians)
Sure. Would it be the bug that Natalia was having and we were not able to properly isolate for so long?
Flags: needinfo?(lissyx+mozillians)
(In reply to Alexandre LISSY :gerard-majax from comment #15)
> Sure. Would it be the bug that Natalia was having and we were not able to
> properly isolate for so long?

Okay, according to the STR, it's exactly this.
Assignee: nobody → lissyx+mozillians
Anshul, can we get your logs ? I'm going to have a look at the Gecko code and try to see how much logging we could do more.
Flags: needinfo?(anshulj)
Whiteboard: [systemsfe]
(In reply to Alexandre LISSY :gerard-majax from comment #17)
> Anshul, can we get your logs ? I'm going to have a look at the Gecko code
> and try to see how much logging we could do more.
Alexander, the logs really don't provide much info. As I said the nsUpdaterService doesn't even attempt to do an OTA. Everytime nsUpdaterService attempts to find an OTA, there are a bunch of messages it prints and I don't see any even if let the phone sit for two days or more.
Flags: needinfo?(anshulj)
Yes, the first thing to do is probably to set up the update timer to a smaller interval and check if it triggers or why it doesn't.
Attached patch updates.diff (obsolete) — Splinter Review
Anshul, this patch should help determining whether the timers are properly setup. Please spy |logcat -v threadtime| after applying it. It should not take more than 5 to 10 minutes to have the proper informations in the logcat.

If you can run it on your current Gaia profile, do a second run on a Gaia profile after doing a wipe, and a third run after changing |app.update.interval| in the default settings and doing a wipe.

Changing the default setting value means modifying the |app.update.interval| value in |build/config/common-settings.json|, and then issuing a reset-gaia.
Flags: needinfo?(anshulj)
This patch can also help by exposing finer update interval.
Attached file updates.log
Log of update with both attachment 8474158 [details] [diff] [review] and attachment 8474240 [details] [diff] [review] applied, update interval set to 5 minutes from the Settings app, and disabling screen timeout. Phone was idling and connected to USB.

This is on a Flame, with master build from yesterday.
After more testing, it seems the check is never triggered when the screen is turned off.
Device with the screen locked, there is no periodic checking (was configured at 5 min), but as soon as I unlock the device, I see a check performed in logcat.
(In reply to Alexandre LISSY :gerard-majax from comment #24)
> Device with the screen locked, there is no periodic checking (was configured
> at 5 min), but as soon as I unlock the device, I see a check performed in
> logcat.

Interesting... but I never get the check when set to once a day, even if my phone is turned on and connected to the network several times per day.
(In reply to Fabrice Desré [:fabrice] from comment #25)
> (In reply to Alexandre LISSY :gerard-majax from comment #24)
> > Device with the screen locked, there is no periodic checking (was configured
> > at 5 min), but as soon as I unlock the device, I see a check performed in
> > logcat.
> 
> Interesting... but I never get the check when set to once a day, even if my
> phone is turned on and connected to the network several times per day.

I've found some more interesting things:

> $ grep 'Checker:checkForUpdates' updates2_screentimeout-1m_updatesinterval_1m.log 
> 08-18 14:31:22.500   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml?force=1
> 08-18 14:31:22.500   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml?force=1
> 08-18 14:31:27.730   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml?force=1
> 08-18 14:31:27.730   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml?force=1
> 08-18 14:40:12.471   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:40:12.471   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:42:12.601   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:42:12.601   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:44:12.731   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:44:12.731   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:46:12.941   288   288 I Gecko   : *** AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml
> 08-18 14:46:12.951   288   288 E GeckoConsole: AUS:SVC Checker:checkForUpdates - sending request to: http://localhost/update.xml

This was with:
 - screen timeout configured on 1m
 - update interval configured on 1m

The first checks are after reboot. Then we can see checks every 2 minutes instead of every one minute. I'm going to continue to dig to understand.
Attached patch updates-gecko.diff (obsolete) — Splinter Review
Anshul and fabrice, since you seems to be reproducing the issue, can you check with this new gecko patch? it should make the output more readable.
Attachment #8474158 - Attachment is obsolete: true
Flags: needinfo?(fabrice)
So far I cannot see any problem when playing with intervals, whatever the values:
 - low values shows a proper timer registration, and triggering even when device has its screen powered off
 - higher values (1d) shows proper timer registration, I haven't had time yet to wait for the completion.
If I turn screen off and unplug the device, I don't see:
 - any timer activity
 - any update checking activity

This is after leaving the device with update interval configured at 15 min, screen off, unplugged, for more than one hour.

When plugging it back and turning on screen, there is no update checking attempt performed.
And 15 minutes after plugging and turning on the screen, TimerManager does its job and I see an update checking attempt. So I fear we would need the device to be fully alive for the period of the update interval checking, but this is far from being totally clear to me.

Anshul, I really need more informations on the way you reproduce this. Are the devices left unattented, still plugged with USB ? Can you check the same way I did with shorter intervals?
(In reply to Alexandre LISSY :gerard-majax from comment #29)
> If I turn screen off and unplug the device, I don't see:
>  - any timer activity
>  - any update checking activity

Note that the timer service is affected by power events so the issue might be lurking somewhere there:

http://hg.mozilla.org/mozilla-central/file/bf27e27c994d/xpcom/threads/TimerThread.cpp#l502
(In reply to Alexandre LISSY :gerard-majax from comment #30)
> And 15 minutes after plugging and turning on the screen, TimerManager does
> its job and I see an update checking attempt. So I fear we would need the
> device to be fully alive for the period of the update interval checking, but
> this is far from being totally clear to me.
> 
> Anshul, I really need more informations on the way you reproduce this. Are
> the devices left unattented, still plugged with USB ? Can you check the same
> way I did with shorter intervals?
Alexandre, I left the device plugged in the USB with adb logging running so I can check if the device is at least attempting an OTA. I did try to reduce the timer time and still see no attempt. Could you please also provide a patch to reduce the timer from 1 day to 1 minute.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #32)
> Alexandre, I left the device plugged in the USB with adb logging running so
> I can check if the device is at least attempting an OTA. I did try to reduce
> the timer time and still see no attempt. Could you please also provide a
> patch to reduce the timer from 1 day to 1 minute.

So that would NOT be what I'm seeing. Besides, I already attached those patches as of comment 21, comment 22 and comment 27.

Can you get me more informations on the builds you are using ? So far it seems to be okay on my side.

What's your build target, .userconfig and any other environment variables that can impact ?
Flags: needinfo?(anshulj)
The 1m update interval being done every two minutes I was commenting on in comment 26 is irrelevant: TimerManager has a timer minimum delay of 120 secs.
Attached patch updates-gecko.diff (obsolete) — Splinter Review
Anshul, I've added more debug in this new patch. Especially, I could see on my Flame that the very early boot has a strange system date (01-01 06:42:00 for example) and that Gecko gets started before the proper date is being set.

With this in mind, I added those logs in the early triggering of TimerManager, and I could get funny output:
> 01-01 01:16:17.610   291   291 I Gecko   : *** UTM:SVC _ensureTimer(120000): this._timer.initWithCallback, this.lastTimerReset=977612
> 01-01 01:16:17.610   291   291 E GeckoConsole: UTM:SVC _ensureTimer(120000): this._timer.initWithCallback, this.lastTimerReset=977612

this.lastTimerReset is being set by a |Date.now()| call. I'm wondering what kind of funny bugs suchs big leap may trigger.
Attachment #8474521 - Attachment is obsolete: true
(In reply to Alexandre LISSY :gerard-majax from comment #35)
> this.lastTimerReset is being set by a |Date.now()| call. I'm wondering what
> kind of funny bugs suchs big leap may trigger.

Ugh, Date.now() is not monotonic so it cannot be used for such comparisons.

You might want to try the data gathered from nsIAppStartup.getStartupInfo() instead. In bug 793735 I made it so that it recalculates the process creation time based on the current value of PR_Now() (which is used to implement Date.now()). Basically whenever you call nsIAppStartup.getStartupInfo() if the time base for Date.now() has changed you'll get a different value of the process creation time (|process| field) adjusted to the new "present" time. In a nutshell Date.now() will be monotonic WRT to that field because the time elapsed from the process creation time is computed using a monotonic timer.

You don't have to cache the result of nsIAppStartup.getStartupInfo() for this to work though, the trick works because the values it returns are recomputed at the time you call it.
Test case added to moztrap:

https://moztrap.mozilla.org/manage/case/14376/
QA Whiteboard: [QAnalyst-Triage?][2.0-signoff-need+] → [QAnalyst-Triage+][2.0-signoff-need+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(rmitchell)
Flags: in-moztrap+
I was a bit tired yesterday night so I'm not sure if I made much sense in comment 36. What I meant is: if you need a monotonic timer since startup (which I think you need) use something like:

function uptime() {
  return Date.now() - Services.startup.getStartupInfo().process;
}

That's guaranteed to be monotonic even in the face of adjustments to the system clock. This will probably need some extra adjustments since we're storing the last update time into a pref and that's an absolute value.
Thanks gabriele, I'm investigating and seeing funny behavior with the updateNextDelay() calls where we are making use of negative values :).
QA Whiteboard: [QAnalyst-Triage+][2.0-signoff-need+] → [2.0-signoff-need+]
At least I'm seeing:
> 08-20 08:17:52.070   305   305 E GeckoConsole: [JavaScript Error: "assignment to undeclared variable lastUpdateTime" {file: "jar:file:///system/b2g/omni.ja!/components/nsUpdateTimerManager.js" line: 270}]

Line 270 is line 239 in the original source code.
Depends on: 1056034
(In reply to Alexandre LISSY :gerard-majax from comment #41)
> At least I'm seeing:
> > 08-20 08:17:52.070   305   305 E GeckoConsole: [JavaScript Error: "assignment to undeclared variable lastUpdateTime" {file: "jar:file:///system/b2g/omni.ja!/components/nsUpdateTimerManager.js" line: 270}]
> 
> Line 270 is line 239 in the original source code.

Tracked in bug 1056034
Attached patch updates-gecko.diff (obsolete) — Splinter Review
Logging more things and changing let to var, just in case.
Attachment #8474684 - Attachment is obsolete: true
Sorry Alexandre, I am not able to find time to try your patch. I will get back to you early next week.
Alex, do you still need something from me or do you think that the fix in bug 1056034 will be enough?
Flags: needinfo?(fabrice)
(In reply to Fabrice Desré [:fabrice] from comment #45)
> Alex, do you still need something from me or do you think that the fix in
> bug 1056034 will be enough?

Nope, for now it should be good :)
Whether I try with an engineering build or a userdebug build, I still have updates properly checked every 5 minutes as asked.
I did a try over a long period:
 - device booted at ~17:20
 - update interval configured at 1h
 - device unplugged and left unattented (in my pocket)

A first check at 18:45 shows no update check being performed, despite one should have been done.
A second check at 21:30 shows no update check being performed. At this point, reported uptime was:
> $ uptime                                                         
up time: 04:08:31, idle time: 00:09:25, sleep time: 04:03:08

So we should have had at least 2 checks for sure.

Connecting data (with no credit over it, so just the APN gets up but then no trafic can go through) triggers instantly an update check.
Was wifi connected for comment 48 ?

Checking for updates only occurs when the device is online.
(In reply to Dave Hylands [:dhylands] from comment #49)
> Was wifi connected for comment 48 ?
> 
> Checking for updates only occurs when the device is online.

I could see triggering of the checks when WiFi was enabled but connected to no network. This was of course failing because of being offline, but still, the "Checker:checkForUpdates" lines and their "AUS" friends were present.
Anshul -- can you please try this out to see if this is still an issue.
Target Milestone: --- → 2.1 S3 (29aug)
Switching to Gabriele, since I'm off for this week.
Assignee: lissyx+mozillians → gsvelto
I've toyed with this today and for now I'm seeing the same behavior as described by Alexandre: with the phone plugged in via USB and the WiFi connection active the timer triggers correctly and updates are checked. Leaving the phone unattended but still connected via USB did also cause updates to be checked correctly (apparently). Unplugging the phone caused the update checks to stop, not sure why but I think that's the key to figuring out what's going wrong. There's also another issue I noticed. With update checks set to 1h the actual interval was more like ~45 minutes.
I did some further investigation on this and I've noticed that update checks stop almost exclusively because of network connectivity going away (that happens normally when the phone is off and left idle for a while). Once network connectivity is restored we normally check for an update if the predefined interval has elapsed but I cannot confirm this happens reliably. One thing that seems to be off are the following update checks. Basically after network connectivity is restored the update timer doesn't seem to tick at the right times anymore and the checks are staggered at different times.
Some more testing tonight, I've left the phone with the screen turned off and unattended for the whole night with the update interval set to 1h. This morning when I turned it on it correctly checked for updates. Leaving it alone with the screen on and network active also made it check for updates correctly at 1h intervals.

This *seems* to be working correctly, the only thing that still confused me is the timer firing at somewhat unpredictable times but I've come to the conclusion that's also expected: we effectively use a single timer to cover multiple events staggered at different times and we keep adjusting its delay depending on the events we're waiting for. Frankly I'm not sure why this component was implemented that way instead of using multiple timers; it would have saved us a lot of convoluted code that makes figuring out how this works very complicated.

Anyway at this stage I'm convinced that bug 1056034 fixed the problem. Anshul I'll be waiting for your confirmation here before closing the bug.
striking regression-window wanted tag to prevent anyone sinking man-hours into it when this bug seems to be fixed. If it comes back that this bug does still repro please feel free to re-add the tag. Thanks!
(In reply to Gabriele Svelto [:gsvelto] from comment #55)
> Some more testing tonight, I've left the phone with the screen turned off
> and unattended for the whole night with the update interval set to 1h. This
> morning when I turned it on it correctly checked for updates. Leaving it
> alone with the screen on and network active also made it check for updates
> correctly at 1h intervals.
> 
> This *seems* to be working correctly, the only thing that still confused me
> is the timer firing at somewhat unpredictable times but I've come to the
> conclusion that's also expected: we effectively use a single timer to cover
> multiple events staggered at different times and we keep adjusting its delay
> depending on the events we're waiting for. Frankly I'm not sure why this
> component was implemented that way instead of using multiple timers; it
> would have saved us a lot of convoluted code that makes figuring out how
> this works very complicated.
> 
> Anyway at this stage I'm convinced that bug 1056034 fixed the problem.
> Anshul I'll be waiting for your confirmation here before closing the bug.
Gabriele, for some reason the ADB is getting disconnected over night when I left the phone unattended couple of times I tried. Could you please let me know how to reset the timer to less than 24 hrs so I can do more meaningful test.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #57)
> Gabriele, for some reason the ADB is getting disconnected over night when I
> left the phone unattended couple of times I tried.

You can disable the adb disable timer using the following script:

https://gist.github.com/dhylands/6253999

> Could you please let me
> know how to reset the timer to less than 24 hrs so I can do more meaningful
> test.

Alexandre has posted a gaia patch that exposes shorter durations in comment 21.
Anshul, do you have news?
Flags: needinfo?(anshulj)
The only way I get an update notification is if I test manually or if I reboot.... (flame, 2.0)
I've done some further side-by-side testing with my Keon and I don't seem to be getting any updates on the Flame but while I get them regularly on the Keon. What's strange is that if I leave the Flame alone for a while not only it doesn't check periodically for updates but it doesn't even check when I turn the internet connectivity on :-/
(In reply to Joshua Mitchell [:Joshua_M] from comment #56)
> striking regression-window wanted tag to prevent anyone sinking man-hours
> into it when this bug seems to be fixed. If it comes back that this bug does
> still repro please feel free to re-add the tag. Thanks!

This bug isn't fixed. We really need a window here, since this seems to be a harder bug to fix.
(In reply to Jason Smith [:jsmith] from comment #62)
> This bug isn't fixed. We really need a window here, since this seems to be a
> harder bug to fix.

Getting a regression window will be complicated to say the least as currently it seems that you need at least 24 hours to reproduce the bug. That being said I'm now running an experiment with my Flame and Hamachi set to the same configuration and some logging to see if the bug reproduces on both. I'm doing this because for some reason it seems to reproduce only on my Flame while my other devices do regularly get updates.
Oh yeah, that's true. Clearing the flag then.
Gabriele, I was thinking about this bug on my bicycle this morning, and I was wondering that maybe, for some reason, the clock may skew at some point on the Flame like it happens. We already checked that skew at startup was not a problem, but I don't think we asserted that improper time streching happening during the run itself would not have an impact. I'm thinking of this, because I still have a dogfooding Flame that, while it has proper network, sometimes looses its time.

I'm not sure, however, how we could check this.

The flow would be:
 - improper start time
 - starting of nsUpdateTimerManager
 - device time gets fixed
 - first 120s delay of nsUpdateTimerManager is not affected
 - long timers installed (1d)
 - clock skew and goes back in time of 2 days
Flags: needinfo?(gsvelto)
Hubert, you told me that you constantly reproduce this issue. So far, you are our only hope. could you please help us by applying the patches and document the behavior?

Thanks!
Flags: needinfo?(hub)
That mean this phone will no longer be a dogfooding phone. That's the rule.

Why can't we just roll the fix in and I can see if it fix by doing a manual OTA...
Flags: needinfo?(hub)
(In reply to Hubert Figuiere [:hub] from comment #67)
> That mean this phone will no longer be a dogfooding phone. That's the rule.
> 
> Why can't we just roll the fix in and I can see if it fix by doing a manual
> OTA...

There's no fix, just a logging patch. Forced update checks work, not the ones triggered by the daily timer.
I am sorry I am really tied up with 2.1 work and won't have time to test it this week.
Flags: needinfo?(anshulj)
I'll see what I can do. It is not as easy as it sounds.
Gecko patch doesn't apply cleanly :-(
Flags: needinfo?(lissyx+mozillians)
also will that work on a release build?
I set the interval check to 1 minute.


I/Gecko   (12312): *** UTM:SVC TimerManager:notify: value=@mozilla.org/b2g/webapps-update-timer;1,getService,background-update-timer,webapps.update.interval,86400
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: prefInterval: webapps.update.interval
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: defaultInterval: 86400
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: interval: 86400
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: prefLastUpdate: app.update.lastUpdateTime.background-update-timer
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: lastUpdateTime: 1410329914
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: now: 1410369215
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: tryFire1 1410416314
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [webapps.update.interval] IN
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [webapps.update.interval] intendedTime=1410416314
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [webapps.update.interval] now=1410369215
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [webapps.update.interval] updateNextDelay(47099)
I/Gecko   (12312): *** UTM:SVC TimerManager:updateNextDelay(47099) nextDelay=null
I/Gecko   (12312): *** UTM:SVC TimerManager:updateNextDelay(47099) nextDelay updated.
I/Gecko   (12312): *** UTM:SVC TimerManager:notify: tryFire2 1410898827
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [timerID:user-agent-updates-timer] IN
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [timerID:user-agent-updates-timer] intendedTime=1410898827
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [timerID:user-agent-updates-timer] now=1410369215
I/Gecko   (12312): *** UTM:SVC TimerManager:notify:tryFire: [timerID:user-agent-updates-timer] updateNextDelay(529612)
I/Gecko   (12312): *** UTM:SVC TimerManager:updateNextDelay(529612) nextDelay=47099
I/Gecko   (12312): *** UTM:SVC TimerManager: after callbackToFire: [null]: nextDelay=47099 ;; skippedFirings=false
I/Gecko   (12312): *** UTM:SVC TimerManager: after callbackToFire: [null]: timer.delay=47099000
(In reply to Hubert Figuiere [:hub] from comment #72)
> also will that work on a release build?

Yes.
Flags: needinfo?(lissyx+mozillians)
Updated Gecko patch.
Attachment #8475899 - Attachment is obsolete: true
So I've run my little experiment flashing the exact same version of Gecko and Gaia on a Keon and on a Flame and left them around for a couple of days with wifi on. The Keon regularly checked for updates and correctly reported that one was available at the correct time of day while the Flame didn't.

Since the issue seems to be Flame-specific I'm now fairly convinced that Flame's creatively jumping system clock has something to do with it. Now there's two solutions we can follow in this case.

The first one is to switch the update timer manager service to use a monotonic clock instead of Date.now(). Since this is JavaScript and we don't have access to the TimeStamp class we must use the approach I've outlined in comment 38. This will require some tweaks as the service is storing absolute timestamps in the prefs and those will need to be tweaked to use a relative interval instead.

The second approach would be to drop the updater manager service weird use of a single timer for all subscribers and use multiple timers instead. Since the timer service is already using a monotonic clock internally this should solve any problems caused by an unreliable system clock but would require significantly more changes to the code. This also begs the question, why weren't multiple timers used in the first place? The whole multiplex-multiple-intervals-over-a-single-timer approach in the update timer service looks incredibly convoluted to me.
Flags: needinfo?(gsvelto)
(In reply to Fabrice Desré [:fabrice] from comment #77)
> Gabriele, Components.utils.now() is also monotonic (see
> https://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/
> XPCComponents.cpp#3557).

I know, but it's been implemented after I had left my original comment :)

Anyway here's a tentative patch that drops Date.now() in favor of a base timestamp + monotonic timer approach. This *should* fix the issue if it's caused by the system clock jumping around but since it takes 24h to test we'll have to wait until tomorrow to be sure.
Target Milestone: 2.1 S3 (29aug) → 2.1 S4 (12sep)
Comment on attachment 8487979 [details] [diff] [review]
[PATCH WIP] Make the update timer manager use a monotonic clock to avoid problems with clock skew

Review of attachment 8487979 [details] [diff] [review]:
-----------------------------------------------------------------

Nice solution, I'm jealous not to have implemented this.
Hurray! With attachment 8487979 [details] [diff] [review] applied my Flame received its first update automatically. Time to polish the patch.
... but unfortunately it wasn't a system update. It was an app update :-/ I'll leave the device alone again and see if I can get a gecko+gaia update.
(In reply to Gabriele Svelto [:gsvelto] from comment #81)
> ... but unfortunately it wasn't a system update. It was an app update :-/
> I'll leave the device alone again and see if I can get a gecko+gaia update.

Can you share logcat? As far as I can say, detecting app updates is triggered by the very same mechanism.
Flags: needinfo?(gsvelto)
Attached file logcat output
This is the relevant bit of logcat. I think that this line means that at least the timer for app updates has triggered properly:

09-11 18:45:08.174   291   291 E GeckoConsole: UTM:SVC TimerManager:notify - notified @mozilla.org/b2g/webapps-update-timer;1

And this would be the regular update IIRC:

09-11 18:47:08.184   291   291 E GeckoConsole: UTM:SVC TimerManager:notify - notified timerID: user-agent-updates-timer

No update was proposed though but that could have been because there wasn't any (I flashed the nightly build and only altered the relevant JS file in the omni.ja archive).
Flags: needinfo?(gsvelto)
Attached file logcat.log
Here's the full logcat for today.
(In reply to Gabriele Svelto [:gsvelto] from comment #83)
> Created attachment 8488498 [details]
> logcat output
> 
> This is the relevant bit of logcat. I think that this line means that at
> least the timer for app updates has triggered properly:
> 
> 09-11 18:45:08.174   291   291 E GeckoConsole: UTM:SVC TimerManager:notify -
> notified @mozilla.org/b2g/webapps-update-timer;1

This one is the b2g specific apps only update timer, that fires once per day if the global update timer is not set to once per day. So it's a good sign that it was triggered.

> And this would be the regular update IIRC:
> 
> 09-11 18:47:08.184   291   291 E GeckoConsole: UTM:SVC TimerManager:notify -
> notified timerID: user-agent-updates-timer

No, this one is the timer to fetch the UA string overrides.
Gabriele, you can also look at https://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateTimerManager.js that manages all the long lasting timers. But if this one was failing you probably would not have any app update triggered, so...
(In reply to Fabrice Desré [:fabrice] from comment #86)
> Gabriele, you can also look at
> https://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/
> nsUpdateTimerManager.js that manages all the long lasting timers. But if
> this one was failing you probably would not have any app update triggered,
> so...

Yes, that's what I'm looking at. I've left both phones unattended during the weekend with my patch applied but wasn't prompted for an update on both... and then I realized that Performance.now() doesn't return an integer, it returns a floating point number and I don't know how that plays with the rest of the code. So I've reverted to my solution and I'm now monitoring the phones again. The Flame clock seem to have finally adjusted to the actual time but upon starting up it still goes back to the epoch causing bug 1021698 whenever I run an update.
Quick update: I'm doing more testing with my WIP and with a proper setup this time and this morning's result look promising but I'll have to wait until this afternoon to be sure everything worked well.
Target Milestone: 2.1 S4 (12sep) → 2.1 S5 (26sep)
Whiteboard: [systemsfe] → [CR 725351][systemsfe]
Whiteboard: [CR 725351][systemsfe] → [caf priority: p2][CR 725351][systemsfe]
I've got my tests wrong again, unfortunately finding out after 24h doesn't help. I'm setting up another test for tomorrow but I'll have to use my Flame for further development today so I might mess with the date again. If somebody else has a Flame available and lots of spare time feel free to test my patch here to see if it works for you.
And just FYI, why I was shy of installing a custom Gecko on a dogfooding phone. When trying to restore (shallow flash with the pvt tool) a version of Gecko + Gaia I ended up losing all my settings and such.

*sigh*
This bug is officially driving me nuts. I've left my phones alone for the weekend in order to run another long-term test and it seems that my patch isn't working at all. In fact when flashed on a non-Flame it seems to regress and prevent updates from being checked correctly.

I did some further tests back on the Flame w/o my patch applied and found out something weird that had nothing to do with timers: for some reason my Flame stubbornly checks for updates on the 'default' channel. Setting it to 'nightly' in the prefs or doing it by running ./edit-prefs.sh doesn't seem to change the outcome: the channel setting remains set to 'default'. Has anyone else experienced this?
One more day one more update: I verified that the underlying issue is not the clock jumping around - that would have been fixed by my patch - but rather the timer not firing at all. I haven't dug into the TimerThread/nsTimerImpl code yet but from all the data I've gathered I'm absolutely positive that yesterday the timer used by the nsUpdateTimerService was primed at 4PM for a 24h delay and by 6PM today it hadn't fired yet.

I'll try to dig further but I think there might be something very wrong going on not only with the regular system clock but also the realtime ones and I'm not sure if it's worth wasting more time trying to find a fix or workaround for something that's supposed to work on all modern hardware.
(In reply to Gabriele Svelto [:gsvelto] from comment #92)
> One more day one more update: I verified that the underlying issue is not
> the clock jumping around - that would have been fixed by my patch - but
> rather the timer not firing at all. I haven't dug into the
> TimerThread/nsTimerImpl code yet but from all the data I've gathered I'm
> absolutely positive that yesterday the timer used by the
> nsUpdateTimerService was primed at 4PM for a 24h delay and by 6PM today it
> hadn't fired yet.
> 
> I'll try to dig further but I think there might be something very wrong
> going on not only with the regular system clock but also the realtime ones
> and I'm not sure if it's worth wasting more time trying to find a fix or
> workaround for something that's supposed to work on all modern hardware.
Hi :gsvelto,

Is there an update here, we've blocked 2.0+ on this and running out of time..I see you've spent quite a bit of time debugging this, but unclear what's the status given the last few comments..can you help?
Flags: needinfo?(gsvelto)
Do you think it will help if QA gave your WIP patch a shot or anyother way we could help move this along?
(In reply to bhavana bajaj [:bajaj] from comment #93)
> Is there an update here, we've blocked 2.0+ on this and running out of
> time..I see you've spent quite a bit of time debugging this, but unclear
> what's the status given the last few comments..can you help?

I haven't had time to come back to this recently. I've waited for the Flame KK build to become commonly available before re-testing and it seems that the system clock is as broken as it was in the JB one. From the symptoms I encountered in this bug there's no real way of fixing it in gecko: timers that need to fire on long-term intervals seem almost completely unreliable on this device.
Flags: needinfo?(gsvelto)
Adding qawanted to give this a final shot if its reproducible, else we will blocking -
Keywords: qawanted
Please try it on latest 2.0 and 2.1 with KK
Target Milestone: 2.1 S5 (26sep) → 2.1 S6 (10oct)
Just another data point: I've tested a device with the latest flame-kk master build on it and it exhibits the exact same behavior. Alexandre pointed out that this SoC doesn't seem to have an RTC - or the kernel doesn't recognize it - you can find this line in the dmesg output:

<6>[22432.269171] alarm_set_rtc: Failed to set RTC, time will be lost on reboot
This issue still occurs on both 2.2 Flame KK and 2.1 Flame KK.

Results:  The device did not find the new update (built at 4pm) automatically.

Environmental Variables:
Device: Flame 2.2
BuildID: 20141006040204
Gaia: 470826d13ae130a5c3d572d1029e595105485fb0
Gecko: e0d714f43edc
Version: 35.0a1 (2.2) 
Firmware Version: L1TC10011800
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0

Environmental Variables:
Device: Flame 2.1
Build ID: 20141006000205
Version: 34.0a2 (2.1)
Gaia:  778ebac47554e1c4b7e9a952d73e850f58123914
Gecko: c4a4b04c617c
Version: 34.0a2 (2.1)
Firmware Version: L1TC10011800
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0
QA Whiteboard: [2.0-signoff-need+] → [QAnalyst-Triage?][2.0-signoff-need+]
Flags: needinfo?(jmitchell)
Keywords: qawanted
QA Whiteboard: [QAnalyst-Triage?][2.0-signoff-need+] → [QAnalyst-Triage+][2.0-signoff-need+]
Flags: needinfo?(jmitchell)
(In reply to Gabriele Svelto [:gsvelto] from comment #98)
> Just another data point: I've tested a device with the latest flame-kk
> master build on it and it exhibits the exact same behavior. Alexandre
> pointed out that this SoC doesn't seem to have an RTC - or the kernel
> doesn't recognize it - you can find this line in the dmesg output:
> 
> <6>[22432.269171] alarm_set_rtc: Failed to set RTC, time will be lost on
> reboot

On master you need an updated blob for b2g_time to maintain the RTC over reboots as a Gecko interface changed.  Fabrice might know about that.   If you see the RTC issue on v2.0 the stock Flame image (/system/b2g/distribution/ preserved), LMK as b2g_time should be fully functional there.
Target Milestone: 2.1 S6 (10oct) → 2.1 S7 (Oct24)
No longer blocks: CAF-v2.1-FC-metabug
:gsvelto,

Based on :m1's feedback and given QA reproducibility, does that help move along?
Flags: needinfo?(gsvelto)
(In reply to bhavana bajaj [:bajaj] from comment #101)
> Based on :m1's feedback and given QA reproducibility, does that help move
> along?

I'm testing on 2.0 now to see if this reproduces; I'll come back with feedback ASAP but remember that this takes at least 24h to test so it's not going to be quick.
Flags: needinfo?(gsvelto)
Quick update: I've tested a 2.0 build verifying that the b2g_time executable was the one from the base image as described in comment 100. On 2.0 I can't reproduce the issue, the updates are regularly checked which confirms IMHO that the problem is caused on master by problems with timekeeping. What I don't understand however is why long-term timers are not firing on master; if only the system clock was affected they should fire and the patch in attachment 8487979 [details] [diff] [review] should be enough to solve the issues. Instead not only the time jumps around more or less randomly on master but long term timers don't seem to fire at all (I haven't verified at which level this is happening, it seems to affect nsITimers but I'm not sure if it's because pthread_cond_timedwait() is not returning at the right time or something else).
Spoke offline with gabriele on this and he is unable to reproduce the issue on 2.0 with the 188 base image anymore. Can QA please give this a shot too with the latest base image and 2.0 build ? Thanks!

Gabriele will also help document his understanding on 2.1/master so far on why we are seeing this issue.
Flags: needinfo?(jmercado)
Keywords: qawanted
Target Milestone: 2.1 S7 (24Oct) → 2.1 S8 (7Nov)
I was able to reproduce this issue on 2.0 Flame KK.

Actual Results: No update was found automatically.

Environmental Variables:
Device: Flame 2.0
BuildID: 20141027000202
Gaia: 2183b4f3ec0eb47ab1f133c31732ec53b08ad253
Gecko: 43bee45176c4
Gonk: 6e51d9216901d39d192d9e6dd86a5e15b0641a89
Version: 32.0 (2.0)
Firmware: V188
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0
QA Whiteboard: [QAnalyst-Triage+][2.0-signoff-need+] → [QAnalyst-Triage?][2.0-signoff-need+]
Flags: needinfo?(jmercado) → needinfo?(jmitchell)
Keywords: qawanted
i noticed that Daily update will send the notification, only if you reboot the device.
QA Whiteboard: [QAnalyst-Triage?][2.0-signoff-need+] → [QAnalyst-Triage+][2.0-signoff-need+]
Flags: needinfo?(jmitchell)
QA Contact: jmercado
Duplicate of this bug: 1090210
I've re-tested on master with a v188 base build and it still happens there. I've also re-tested on a couple of other devices (keon & hamachi) and OTAs work fine there so this is really Flame-specific and most likely *not* a gecko issue.

I might add that no matter how much I test this the results are always inconclusive. On my personal phone I seem to get regular updates and it's a v188 + master build too, yet on my other Flame which I've used explicitly for testing this reproduces.

Anyhow we *need* help from the vendor here, this bug and the other time-centric issues we're having (bug 1061797) happen only on this device which is a pretty strong indication that there's something amiss with the base firmware.
(In reply to Gabriele Svelto [:gsvelto] from comment #108)
> I've re-tested on master with a v188 base build and it still happens there.
> I've also re-tested on a couple of other devices (keon & hamachi) and OTAs
> work fine there so this is really Flame-specific and most likely *not* a
> gecko issue.
> 
> I might add that no matter how much I test this the results are always
> inconclusive. On my personal phone I seem to get regular updates and it's a
> v188 + master build too, yet on my other Flame which I've used explicitly
> for testing this reproduces.
> 
> Anyhow we *need* help from the vendor here, this bug and the other
> time-centric issues we're having (bug 1061797) happen only on this device
> which is a pretty strong indication that there's something amiss with the
> base firmware.

:mvines, can you help here or know of existing issues that can impact the timing on the OS?
Flags: needinfo?(mvines)
No longer blocks: CAF-v2.1-CC-metabug
Target Milestone: 2.1 S8 (7Nov) → 2.1 S9 (21Nov)
(In reply to bhavana bajaj [:bajaj] from comment #109)
> (In reply to Gabriele Svelto [:gsvelto] from comment #108)
> > I've re-tested on master with a v188 base build and it still happens there.
> > I've also re-tested on a couple of other devices (keon & hamachi) and OTAs
> > work fine there so this is really Flame-specific and most likely *not* a
> > gecko issue.
> > 
> > I might add that no matter how much I test this the results are always
> > inconclusive. On my personal phone I seem to get regular updates and it's a
> > v188 + master build too, yet on my other Flame which I've used explicitly
> > for testing this reproduces.
> > 
> > Anyhow we *need* help from the vendor here, this bug and the other
> > time-centric issues we're having (bug 1061797) happen only on this device
> > which is a pretty strong indication that there's something amiss with the
> > base firmware.
> 
> :mvines, can you help here or know of existing issues that can impact the
> timing on the OS?

This bug has been open for a really long time now. Michael, can you please respond?
(:cserran - this bug is on my radar still but I won't make the time to look at it for a while.  Given that we haven't been seeing daily, weekly or even monthly updates to commercial devices in practice to date, this probably shouldn't be v2.0+ at this point.)
I don't really see any point on blocking on this at this point, so removing it from the nom. Ofcourse, if we can get more help from folks to help investigate the root cause, we will move it along as new information comes across.
blocking-b2g: 2.0+ → -
Flags: needinfo?(mvines)
[Blocking Requested - why for this release]:
Issue happening on Aries too.
blocking-b2g: - → 2.5?
Keywords: foxfood
Some weird prefs on my device:

app.update.lastUpdateTime.background-update-timer: 1439574655 (1970-01-17T15:52:54.655Z)
app.update.lastUpdateTime.user-agent-updates-timer: 1439331650 (1970-01-17T15:48:51.650Z)

Some other prefs:
app.update.url: https://aus4.mozilla.org/update/3/%PRODUCT%/%VERSION%/%BUILD_ID%/%PRODUCT_DEVICE%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml
app.update.channel: dogfood



Following Alexandre's advice I reseted the 2 weird prefs above and I'll see if I get an update next week.
> app.update.lastUpdateTime.background-update-timer: 1439574655 (1970-01-17T15:52:54.655Z)
> app.update.lastUpdateTime.user-agent-updates-timer: 1439331650 (1970-01-17T15:48:51.650Z)

=> ok, I forgot to add 000 at the end to get ms. So here are the right values:

> app.update.lastUpdateTime.background-update-timer: 1439574655 (2015-08-14T17:50:55.000Z)
> app.update.lastUpdateTime.user-agent-updates-timer: 1439331650 (2015-08-11T22:20:50.000Z)

08/11 is likely when I installed an update lastly.
08/14 is likely when last update was issued (but not sure).
I think this should be a feature since it's never been done from v1.0. Wilfred, can you or someone make the call if we need it in v2.5?
Flags: needinfo?(wmathanaraj)
[Tracking Requested - why for this release]:

Since it looks like a new feature, we prefer put it back to backlog.
blocking-b2g: 2.5? → ---
Keywords: regression
Flags: needinfo?(wmathanaraj) → needinfo?(aaime)
From the UI perspective this is a feature that is already available as there is an option to select the frequency of the updates check (settings-->device information--> software updates-->check for updates daily, weekly, montly). In addition this issue would impact the number of FOTA updates since the user isn't notified about the availability of a new software but he has to manually check for it.

Therefore, based on the above I would recommend to keep it for 2.5.
Flags: needinfo?(aaime)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.