Closed Bug 806494 Opened 7 years ago Closed 7 years ago

The first sync after startup always fails

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla20

People

(Reporter: kamil.paral, Assigned: rnewman)

References

Details

Attachments

(9 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Build ID: 20121026113527

Steps to reproduce:

I use Sync to synchronize Firefox data between home computer and office computer (both Fedora 17 at the moment). On the home computer everything works perfect. But on the office computer, every time I start Firefox (after system reboot, resume, or long period of inactivity) the initial sync fails ("bad username or password" error). If I manually trigger another sync, it sometimes work, sometimes it still fails ("unknown reason" error). Third sync usually works correctly.

I should note that on the office computer I run two instances of Firefox separated by --profile and --no-remote command line options. When I start my "home" Firefox, the "company" Firefox is already running on that machine. That might be relevant.

This has been happening for many months, very regularly (I see it literally all the time).
1351534432702	Sync.Service	INFO	Logging in user vi4usibtxmx2cgoutp5piyzw2uqtlbod
1351534432702	Sync.Service	DEBUG	Caching URLs under storage user base: https://scl2-sync4.services.mozilla.com/1.1/vi4usibtxmx2cgoutp5piyzw2uqtlbod/
1351534445602	Sync.Resource	DEBUG	mesg: GET fail 401 https://scl2-sync4.services.mozilla.com/1.1/vi4usibtxmx2cgoutp5piyzw2uqtlbod/info/collections
1351534445602	Sync.Resource	DEBUG	GET fail 401 https://scl2-sync4.services.mozilla.com/1.1/vi4usibtxmx2cgoutp5piyzw2uqtlbod/info/collections
1351534445603	Sync.Service	WARN	401: login failed.
1351534445603	Sync.Service	DEBUG	Finding cluster for user vi4usibtxmx2cgoutp5piyzw2uqtlbod
1351534446587	Sync.Resource	DEBUG	mesg: GET success 200 https://auth.services.mozilla.com/user/1.0/vi4usibtxmx2cgoutp5piyzw2uqtlbod/node/weave
1351534446587	Sync.Resource	DEBUG	GET success 200 https://auth.services.mozilla.com/user/1.0/vi4usibtxmx2cgoutp5piyzw2uqtlbod/node/weave
1351534446587	Sync.Service	DEBUG	Cluster value = https://scl2-sync4.services.mozilla.com/
1351534446587	Sync.Status	DEBUG	Status.login: success.login => error.login.reason.account
1351534446587	Sync.Status	DEBUG	Status.service: success.status_ok => error.login.failed
1351534576224	Sync.Collection	WARN	Got exception calling onProgress handler during fetch of https://scl2-sync4.services.mozilla.com/1.1/vi4usibtxmx2cgoutp5piyzw2uqtlbod/storage/history?newer=1351526738.82&full=1
1351534576225	Sync.Collection	DEBUG	JSON.parse: unexpected character JS Stack trace: deserialize@record.js:65 < @record.js:607 < Channel_onDataAvail@resource.js:520
1351534576226	Sync.Status	DEBUG	Status for engine history: error.engine.reason.unknown_fail
1351534576226	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1351534576226	Sync.ErrorHandler	DEBUG	history failed: NS_ERROR_XPC_JS_THREW_JS_OBJECT JS Stack trace: Res_get@resource.js:398 < SyncEngine__processIncoming@engines.js:873 < SyncEngine__sync@engines.js:1284 < WrappedNotify@util.js:142 < Engine_sync@engines.js:481 < WeaveSvc__syncEngine@service.js:1393 < @service.js:1286 < WrappedNotify@util.js:142 < WrappedLock@util.js:97 < _lockedSync@service.js:1186 < @service.js:1177 < WrappedCatch@util.js:71 < sync@service.js:1165
1351534795951	Sync.Collection	WARN	Got exception calling onProgress handler during fetch of https://scl2-sync4.services.mozilla.com/1.1/vi4usibtxmx2cgoutp5piyzw2uqtlbod/storage/tabs?full=1
1351534795952	Sync.Collection	DEBUG	JSON.parse: unexpected character JS Stack trace: deserialize@record.js:65 < @record.js:607 < Channel_onDataAvail@resource.js:520
1351534795952	Sync.Status	DEBUG	Status for engine tabs: error.engine.reason.unknown_fail
1351534795952	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1351534795953	Sync.ErrorHandler	DEBUG	tabs failed: NS_ERROR_XPC_JS_THREW_JS_OBJECT JS Stack trace: Res_get@resource.js:398 < SyncEngine__processIncoming@engines.js:873 < SyncEngine__sync@engines.js:1284 < WrappedNotify@util.js:142 < Engine_sync@engines.js:481 < WeaveSvc__syncEngine@service.js:1393 < @service.js:1286 < WrappedNotify@util.js:142 < WrappedLock@util.js:97 < _lockedSync@service.js:1186 < @service.js:1177 < WrappedCatch@util.js:71 < sync@service.js:1165
1351534795953	Sync.Service	INFO	Sync completed at 2012-10-29 19:19:55 after 21.95 secs.
1351534795953	Sync.SyncScheduler	DEBUG	Next sync in 90000 ms.
1351534795953	Sync.ErrorHandler	DEBUG	Some engines did not sync correctly.
Trying to change Product and Component to Firefox Sync, I hope this is correct.
Component: Untriaged → Firefox Sync: Backend
Product: Firefox → Mozilla Services
Version: 16 Branch → unspecified
I experience similar problems on my Mac [806494]
https://bugzilla.mozilla.org/show_bug.cgi?id=806494
That should be https://bugzilla.mozilla.org/show_bug.cgi?id=807074

(cannot edit comment)
Lots of people are having this issue. We're working on it. There's some discussion in bug 805386.
Status: UNCONFIRMED → NEW
Depends on: 805693
Ever confirmed: true
OK, there are two bugs here. One is Bug 805693, and the other is Bug 794652.

I'm going to treat this as the latter.

Kamil, are you still seeing an issue with the current version of Firefox?
No longer depends on: 805693
Flags: needinfo?(kamil.paral)
Hello, Richard, yes I do. Firefox 17.0.1. It might be a bit less frequent, but I still see it very often.

Do you need a fresh set of logs?
Flags: needinfo?(kamil.paral)
(In reply to Kamil Páral from comment #11)
> Hello, Richard, yes I do. Firefox 17.0.1. It might be a bit less frequent,
> but I still see it very often.
> 
> Do you need a fresh set of logs?

Yes please!
Ironically I can't trigger it right now, but I attach last tree error logs I see in about:sync-log.
Excellent, thanks for the logs.

You seem to be being bitten by Bug 798430, which is supposed to be fixed in Firefox 17. Evidently that's not the case, so let's look at that.
Blocks: 794652
Priority: -- → P1
Kamil, would you be able to provide me with HTTP logs, as described here?

https://developer.mozilla.org/en-US/docs/HTTP_Logging

Also set

  services.sync.log.logger.network.resources

to "Trace" (including capital "T").

Enable HTTP logging, then restart Firefox and trigger a sync with Tools > Sync Now. If you get a yellow bar, great; please email me the new error log and your HTTP log.

You can then set that config value back to "Debug", and restart Firefox.

With any luck, this will provide enough information to figure out what's going on…

Thanks!
Flags: needinfo?(kamil.paral)
Duplicate of this bug: 808416
I've emailed you the logs, Richard. It seems to me that setting services.sync.log.logger.network.resources to Trace triggers the error much more often than Debug. I've hit it on the first attempt. But maybe it's just coincidence.
Flags: needinfo?(kamil.paral)
 09:28:21.098549 617957184[590]: Creating HttpBaseChannel @f3d8000
 09:28:21.098552 617957184[590]: Creating nsHttpChannel [this=7f2b0f3d8000]
 09:28:21.098558 617957184[590]: HttpBaseChannel::Init [this=7f2b0f3d8000]
 09:28:21.098563 617957184[590]: host=scl2-sync4.services.mozilla.com port=-1
 09:28:21.098566 617957184[590]: uri=https://scl2-sync4.services.mozilla.com/1.1/USERNAME/info/collections
 09:28:21.098568 617957184[590]: Creating nsHttpConnectionInfo @bab1d40
 09:28:21.098577 617957184[590]: nsHttpChannel::Init [this=7f2b0f3d8000]
 09:28:21.098583 617957184[590]: nsHttpChannel::SetupReplacementChannel [this=7f2b0efcb000 newChannel=7f2b0f3d8050 preserveMethod=1]
 09:28:21.098586 617957184[590]: HttpBaseChannel::SetupReplacementChannel [this=7f2b0efcb000 newChannel=7f2b0f3d8050 preserveMethod=1]
 09:28:21.098657 617957184[590]: HttpBaseChannel::SetApplyConversion [this=7f2b0f3d8000 value=1]
 09:28:21.098665 617957184[590]: nsHttpChannel::WaitForRedirectCallback [this=7f2b0efcb000]
 09:28:21.098672 617957184[590]: nsHttpChannel::CloseCacheEntry [this=7f2b095f8800] mStatus=804b001e mCacheAccess=2
 09:28:21.098675 617957184[590]:   dooming cache entry!!
 09:28:21.099034 617957184[590]: nsHttpChannel::OnRedirectVerifyCallback [this=7f2b0efcb000] result=0 stack=2 mWaitingForRedirectCallback=1
 09:28:21.099041 617957184[590]: nsHttpChannel::AsyncOpen [this=7f2b0f3d8000]
 09:28:21.099048 617957184[590]: Resolving host [scl2-sync4.services.mozilla.com].
 09:28:21.099070 617957184[590]: HttpBaseChannel::SetRequestHeader [this=7f2b0f3d8000 header="Cookie" value="" merge=0]
 09:28:21.099074 617957184[590]: nsHttpChannelAuthProvider::AddAuthorizationHeaders? [this=7f2b0b81f6e0 channel=7f2b0f3d82d0]
 09:28:21.099079 617957184[590]: nsHttpAuthCache::GetAuthEntryForPath [key=https://scl2-sync4.services.mozilla.com:-1 path=/1.1/USERNAME/info/]
 09:28:21.099084 617957184[590]: nsHttpHandler::NotifyObservers [chan=f3d8050 event="http-on-modify-request"]
 09:28:21.099156 617957184[590]: nsHttpChannel::Connect [this=7f2b0f3d8000]
 09:28:21.099165 617957184[590]: nsHttpConnectionMgr::SpeculativeConnect [ci=.S.scl2-sync4.services.mozilla.com:443]
 09:28:21.099170 617957184[590]: STS dispatch [7f2b0f20a6c0]
 09:28:21.099178 617957184[590]: nsHttpChannel::OpenCacheEntry [this=7f2b0f3d8000]
 09:28:21.099179 442496768[ae0]:     ...returned after 2 milliseconds
 09:28:21.099182 617957184[590]: OpenCacheEntry failed [rv=80040111]
 09:28:21.099186 617957184[590]: nsHttpChannel::SetupTransaction [this=7f2b0f3d8000]
 09:28:21.099186 442496768[ae0]: nsHttpConnectionMgr::OnMsgSpeculativeConnect [ci=.S.scl2-sync4.services.mozilla.com:443]
 09:28:21.099192 442496768[ae0]: Creating nsHttpConnectionInfo @f27ecc0
 09:28:21.099193 617957184[590]: Creating nsHttpTransaction @86c1650
 09:28:21.099197 617957184[590]: nsHttpTransaction::Init [this=86c1650 caps=9]
 09:28:21.099198 442496768[ae0]: nsHttpConnectionMgr::AtActiveConnectionLimit [ci=.S.scl2-sync4.services.mozilla.com:443 caps=1]
 09:28:21.099201 442496768[ae0]:    connection count = 0, limit 6
 09:28:21.099204 442496768[ae0]:   result: false
 09:28:21.099206 617957184[590]: http request [
 09:28:21.099207 442496768[ae0]: Creating nsHalfOpenSocket [this=7f2b0b8add30 trans=7f2b0b8670b0 ent=scl2-sync4.services.mozilla.com]
 09:28:21.099210 617957184[590]:   GET /1.1/USERNAME/info/collections HTTP/1.1
 09:28:21.099213 442496768[ae0]: creating nsSocketTransport @71a10a0
 09:28:21.099214 617957184[590]:   Host: scl2-sync4.services.mozilla.com
 09:28:21.099219 442496768[ae0]: nsSocketTransport::Init [this=71a10a0 host=scl2-sync4.services.mozilla.com:443 proxy=:0]
 09:28:21.099222 617957184[590]:   User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0
 09:28:21.099226 617957184[590]:   Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
 09:28:21.099227 442496768[ae0]: Reset callbacks for secinfo=0 callbacks=7f2b0b8add40
 09:28:21.099229 617957184[590]:   Accept-Language: cs,en-us;q=0.7,en;q=0.3
 09:28:21.099230 442496768[ae0]: nsSocketTransport::OpenOutputStream [this=71a10a0 flags=2]
 09:28:21.099232 617957184[590]:   Accept-Encoding: gzip, deflate
 09:28:21.099233 442496768[ae0]: nsSocketTransport::PostEvent [this=7f2b071a10a0 type=0 status=0 param=0]
 09:28:21.099235 617957184[590]:   Connection: keep-alive
 09:28:21.099237 442496768[ae0]: STS dispatch [7f2b0b1a9df0]
 09:28:21.099239 617957184[590]:   Pragma: no-cache
 09:28:21.099242 617957184[590]:   Cache-Control: no-cache


To me, this looks like an internal redirect occurs, then auth headers are set. Honza et al: does this (a) show any evidence of custom headers being set in the async redirect handler, or (b) appear to show necko setting auth and cookie headers regardless of what happens inside the redirect handler?
Flags: needinfo?(honzab.moz)
Added some logging:

https://hg.mozilla.org/integration/mozilla-inbound/rev/4b7e8f82a341
https://hg.mozilla.org/services/services-central/rev/5c5e1e93e3a9
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
Whiteboard: [leave open]
(In reply to Richard Newman [:rnewman] from comment #20)
> To me, this looks like an internal redirect occurs, then auth headers are
> set. Honza et al: does this (a) show any evidence of custom headers being
> set in the async redirect handler, or (b) appear to show necko setting auth
> and cookie headers regardless of what happens inside the redirect handler?

Do you have a complete log?
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #22)

> Do you have a complete log?

Sent to your m.c address.
Duplicate of this bug: 734828
(In reply to Ryan VanderMeulen from comment #25)
> https://hg.mozilla.org/mozilla-central/rev/4b7e8f82a341

Thanks for doing this.  I'm going to contact Kamil directly and let him try this build.

I've modified locally one of our xpcshell necko tests to check whether copying "Authorization" header when we go the AsyncDoReplaceWithProxy path works.  Header is copied correctly.
Duplicate of this bug: 754729
I was unable to reproduce the problem using FF Nightly. Maybe it was fixed in the mean time? I provided Honza Bambas with the success log, maybe you can find some details there.
(In reply to Kamil Páral from comment #28)
> I was unable to reproduce the problem using FF Nightly. Maybe it was fixed
> in the mean time? I provided Honza Bambas with the success log, maybe you
> can find some details there.

After looking at the logs again I can say:

In the old (faulty) log I can see we call SetRequestHeader for "user-agent" and "authorization" only on the *old* channel.  But not on the *new* channel, so the code to copy headers is apparently faulty ("authorization" is not in the array of headers to copy?  Could case sensitivity be involved?  I was trying to understand the piece of code around headers but it was very complicated).

In the new (success) log I don't see any redirect, so hard to say.
(In reply to Honza Bambas (:mayhemer) from comment #30)

> In the old (faulty) log I can see we call SetRequestHeader for "user-agent"
> and "authorization" only on the *old* channel.  But not on the *new*
> channel, so the code to copy headers is apparently faulty ("authorization"
> is not in the array of headers to copy?  Could case sensitivity be involved?
> I was trying to understand the piece of code around headers but it was very
> complicated).

I *hope* case-sensitivity isn't involved -- we originally set the headers using the same strings that we then use for lookup when copying.

1. AsyncResource.setHeader writes into this._headers with the lowercased name.
2. AsyncResource._createRequest walks those headers (plus the ones from the authenticator), calling channel.setRequestHeader(key, value, false). It also creates and registers ChannelNotificationListener with the header names that it used.
3. When that listener is passed two channels, it copies the named headers from the old to the new.

We know that code gets to run in Kamil's case:

1355822901098	Sync.Resource	TRACE	Copying headers for safe internal redirect.

but until that last logging patch, we didn't get to see if individual headers were actually copied. Now that Kamil can't repro, we're out of luck.


But this is part of a broader issue: I have no way to test this code in the context of a redirect, because as far as I can tell there is no way from JS (or maybe even from C++ until Bug 765934 lands) to cause an internal redirect to occur. The only way we *were* able to do it -- setting a PAC file -- disappeared a few releases ago.

So this code is essentially untested, and will remain so until there's some way to test internal redirect handlers. :/

How likely is it that Necko will add a channel flag that forces an internal redirect to allow testing?
I have tried Nightly just a few times and the problem didn't occur. Unfortunately with FF 17 I don't see it consistently either, the error happened yesterday but not today. I'll try to use Nightly full-time for a week and I'll report back whether I'll see anything suspicious.
I can see two reasons this doesn't work:
- copy header code doesn't work
- auth cache is involved in clearing or overriding the header somehow, however it seems not to be likely according Kamil's faulty logs

(In reply to Richard Newman [:rnewman] from comment #31)
> But this is part of a broader issue: I have no way to test this code in the
> context of a redirect, because as far as I can tell there is no way from JS
> (or maybe even from C++ until Bug 765934 lands) to cause an internal
> redirect to occur. The only way we *were* able to do it -- setting a PAC
> file -- disappeared a few releases ago.

Not true.  test_proxy-failover_passing.js is an example how to invoke an internal redirect when proxy is failing.  It is using PAC.

I did some modifications to it to check we don't drop Authorization header on internal redirect, no luck to reproduce this issue, but you can get inspired to build a test.

> 
> So this code is essentially untested, and will remain so until there's some
> way to test internal redirect handlers. :/
> 
> How likely is it that Necko will add a channel flag that forces an internal
> redirect to allow testing?

You can apply the patch from bug 765934 to try.
Well, that'll do it:

Error copying headers: newChannel.setRequestHeader is not a function JS Stack trace: asyncOnChannelRedirect@resource.js:639
Attached patch Proposed patch. v1 (obsolete) — Splinter Review
Thanks for the test pointer, Honza. I managed to write a test, and found that the problem was a failure to QI the new channel, so we failed to set the header. With the additional logging I added earlier in this bug, the failure was obvious once reproduced.
Attachment #698221 - Flags: review?(gps)
Slightly tidier version.

Note that rest.js already QIs appropriately, and correctly includes the third argument to setRequestHeader, and thus is not affected by this bug. If we had less legacy code, we'd have fewer problems.
Attachment #698221 - Attachment is obsolete: true
Attachment #698221 - Flags: review?(gps)
Attachment #698222 - Flags: review?(gps)
Comment on attachment 698222 [details] [diff] [review]
Proposed patch. v2

Review of attachment 698222 [details] [diff] [review]:
-----------------------------------------------------------------

Please land on inbound so this makes 20.
Attachment #698222 - Flags: review?(gps) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/08570379483e

If this stays green and cheerful, I will request uplift to beta next week.
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [leave open]
Target Milestone: --- → mozilla20
And how is it that getRequestHeader works on the old channel that is not QI'ed to nsIHttpChannel?
And by the way, if I'm not wrong, we then never was able to copy the headers, so how that could fix the 401 uptick?

And even more, how is that the error log is not visible in the faulty logs from Kamil?  Could it be that _log.error() doesn't work for some reason on release branches?
> And how is it that getRequestHeader works on the old channel that is not
> QI'ed to nsIHttpChannel?

Presumably because it's the same object that we already QIed when we set the headers in the first place. newChannel is the only new object.


> And by the way, if I'm not wrong, we then never was able to copy the
> headers, so how that could fix the 401 uptick?

It didn't fix the uptick, but the numbers weren't conclusive enough to tell. (Some 401s are acceptable, and the problem is apparently not easy to repro in the wild.)  So it was "wait and see"… and we were lucky enough to get one log that confirmed that the bug was still present, after several users reported that fx17 fixed the problem for them.

> And even more, how is that the error log is not visible in the faulty logs
> from Kamil?  Could it be that _log.error() doesn't work for some reason on
> release branches?

His error log was produced before that additional logging code landed, and now he can't repro. Error logs work just fine in all branches -- I rely on them a lot :D
https://hg.mozilla.org/mozilla-central/rev/08570379483e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I wasn't able to reliably reproduce this to begin with.  Reporter, can you verify this is fixed on Nightly builds?
I have been using Nightly since 2013-01-03 (i.e. Nightly not containing the fix). I haven't seen a single error log since that time.
It is a week now that I've been using Nightly and I haven't yet seen any error.log. That means I can't confirm whether the commit resolved the issue or not, maybe something else resolved it in the past. But the good news is that with Nightly this is no longer a problem.
Kamil,  That's great news. Thank you for following up.  I'm going to verify this per your comments.
Status: RESOLVED → VERIFIED
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.