Closed Bug 1045691 Opened 10 years ago Closed 10 years ago

Firefox sync fails with "unknown error" graybar message and 502s

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: michal, Assigned: bobm)

References

Details

(Whiteboard: [qa+])

Attachments

(6 files)

Attached file sync-log.txt
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0 (Beta/Release)
Build ID: 20140716183446

Steps to reproduce:

Since 28/7/2014 my two Firefox installations started to spit "unknown errors" while trying to sync data. One is FF31 on Windows 7 32bit, other one FF31 on Gentoo Linux. This issue has been confirmed with two other users, who reported simillar issue on #sync IRC channel.



Actual results:

502 bad gateway error is seen in attached sync-log. The problem is persistent. Happens every time a sync attempt is initiated, on both platforms.


Expected results:

Successfull sync shoud occur. ;-)
There were some server-side issues on a couple of our storage nodes, which should now be resolved.  Can you please confirm whether sync is working now for you?
Flags: needinfo?(michal)
Attached file sync-log.txt
Hello, I don't see any more 502 bad gateway error in sync-log, but a new one is beeing shown. Don't know, if it is relevant to this issue. Log attached.
Flags: needinfo?(michal)
I see: "history failed: Error: App. Quitting"

This usually indicates that Firefox was shut down mid-sync, and things should pick up again when you restart firefox.  If this happens sporadically then it shouldn't be a problem, but if you see these errors from pretty much every run of firefox, it may indicate a deeper issue.

I will tentatively close this as resolved since the 502/grey-bar is gone; please don't hesitate the file a new bug if the above errors persist.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
After letting Firefox more time to sync with all accumulated changes, problem seems to be solved indeed. Thank you.
Well, looks like someone is playing with me. :-) 502 bad gateway error just got back with graybar unknown error message. :-( I'm attaching sync-log again.
Status: RESOLVED → VERIFIED
Attached file sync-log.txt
Status: VERIFIED → REOPENED
Ever confirmed: true
Resolution: FIXED → ---
Whiteboard: [qa+]
Summary: Firefox sync fails with "unknown error" graybar message → Firefox sync fails with "unknown error" graybar message and 502s
I have this exact same issue.  Log attached.
I'm running Disconnet.Me, Ad Block Plus, and HTTPS everywhere if those would affect sync connectivity
me too

1406830720223	Sync.Collection	DEBUG	mesg: POST fail 502 https://sync-39-us-west-2.sync.services.mozilla.com/1.5/1798505/storage/history
1406830720223	Sync.Collection	DEBUG	POST fail 502 https://sync-39-us-west-2.sync.services.mozilla.com/1.5/1798505/storage/history
1406830720224	Sync.Engine.History	DEBUG	Uploading records failed: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.6.0</center>
</body>
</html>

1406830720244	Sync.Status	DEBUG	Status for engine history: error.engine.reason.record_upload_fail
1406830720244	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1406830720244	Sync.ErrorHandler	DEBUG	history failed: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.6.0</center>
</body>
</html>

I don't get this with CLI:
https://github.com/edmoz/fxa-sync-client
Looks like ongoing performance issues on sync-39-us-west-2.sync.services.mozilla.com, escalating for Bob's attention.  Bob, does this machine have the increased --graceful-timeout value?  If not, it may be worth rolling out even without the other config changes.
Flags: needinfo?(bobm)
I have been having these issues for a couple of days now across multiple platforms.
I do not get a 502, but instead a 504 from sync-40-us-west-2.sync.services.mozilla.com

1406876833967	Sync.Resource	DEBUG	mesg: GET fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/<snip>/info/collections
1406876833967	Sync.Resource	DEBUG	GET fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/<snip>/info/collections
I randomly get both errors 504 and 502, which lead to that graybar appearing randomly. You can see the errors here:

https://gist.github.com/gonzalezsieira/c28dbd8a8c151f2f66c9
I have this problem too. I think it is worth mentioning this issue on the status page since I had to go to #sync on IRC to learn about this ticket:

>15:39 <BYK> anyone there?
>15:39 <BYK> Getting 504 for https://sync-40-us-west-2.sync.services.mozilla.com/1.5/1871300/info/collections request for the last 2 days now
>15:46 <@rnewman> BYK: known problem, bug 1045691 is probably the best place to track
>15:46 <firebot> https://bugzil.la/1045691 — REOPENED — Firefox sync fails with "unknown error" graybar message and 502s
>15:47 <BYK> rnewman: oh, thanks!
>15:47 <BYK> the status page should indicate this IMO
>15:47 <BYK> any plans for that?
>15:47 <@rnewman> I think it's an AWS issue that's only affecting one node; it's up to services ops to decide whether to post on one of the status pages
>15:47 <@rnewman> might be done on a different one to the one you're looking at
>15:49 <BYK> rnewman: alright, will mention this on the ticket

Also it's a bit weird to see that I'm trying to connect to US West Coast servers since I'm based in Turkey.
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> Looks like ongoing performance issues on
> sync-39-us-west-2.sync.services.mozilla.com, escalating for Bob's attention.
> Bob, does this machine have the increased --graceful-timeout value?  If not,
> it may be worth rolling out even without the other config changes.

It does.  All machines have a --graceful-timeout setting now.  Also, ~20% of the users on that node have been migrated.  Operations is discussing how to revamp the status page for Sync.  I filed a bug 1047498 to track that issue.
Flags: needinfo?(bobm)
Apologies if I'm not submitting this correctly.
Sync log showing the intermittent nature of the problem - initial GET and POST both return 200, but subsequent post returns 504.
Still having problems

1407187537416	Sync.Collection	DEBUG	mesg: POST success 200 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/4693169/storage/forms
1407187537416	Sync.Collection	DEBUG	POST success 200 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/4693169/storage/forms
1407187537538	Sync.Engine.Forms	INFO	Uploading 100 - 200 out of 3596 records
1407187537538	Sync.BrowserIDManager	DEBUG	_ensureValidToken already has one
1407187537541	Sync.Collection	DEBUG	POST Length: 30093
1407187547388	Sync.Collection	DEBUG	mesg: POST fail 502 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/4693169/storage/forms
1407187547388	Sync.Collection	DEBUG	POST fail 502 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/4693169/storage/forms
1407187547389	Sync.Engine.Forms	DEBUG	Uploading records failed: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.6.0</center>
</body>
</html>

1407187547455	Sync.Status	DEBUG	Status for engine forms: error.engine.reason.record_upload_fail
1407187547455	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1407187547455	Sync.ErrorHandler	DEBUG	forms failed: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.6.0</center>
</body>
</html>
 No traceback available
1407187547455	Sync.Synchronizer	INFO	Aborting sync for failure in forms
1407187547455	Sync.Synchronizer	INFO	Sync completed at 2014-08-04 23:25:47 after 178.54 secs.
1407187547456	Sync.Declined	DEBUG	Handling remote declined: []
1407187547456	Sync.Declined	DEBUG	Handling local declined: []
1407187547456	Sync.Declined	DEBUG	Declined changed? false
1407187547456	Sync.Service	INFO	No change to declined engines. Not reuploading meta/global.
1407187547457	Sync.SyncScheduler	DEBUG	Starting client-initiated backoff. Next sync in 4573852 ms.
1407187547457	Sync.SyncScheduler	DEBUG	Next sync in 4573852 ms.
1407187547457	Sync.ErrorHandler	DEBUG	Some engines did not sync correctly.
I'm seeing this problem too, on an OSX machine and and 2 Win7 machines.
Node is sync-40-us-west-2 same as Michal.

1407188151148	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151149	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151150	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151150	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151150	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188151151	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188241148	Sync.Service	DEBUG	User-Agent: Firefox/31.0 FxSync/1.33.0.20140716183446.
1407188241149	Sync.Service	INFO	Starting sync at 2014-08-05 07:37:21
1407188241160	Sync.SyncScheduler	DEBUG	Clearing sync triggers and the global score.
1407188241161	Sync.Status	INFO	Resetting Status.
1407188241161	Sync.Status	DEBUG	Status.service: success.status_ok => success.status_ok
1407188241161	Sync.BrowserIDManager	DEBUG	_ensureValidToken already has one
1407188301738	Sync.Resource	DEBUG	mesg: GET fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/1843843/info/collections
1407188301738	Sync.Resource	DEBUG	GET fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/1843843/info/collections
1407188301739	Sync.SyncScheduler	DEBUG	Starting client-initiated backoff. Next sync in 1225409 ms.
1407188301739	Sync.SyncScheduler	DEBUG	Next sync in 1225409 ms.
At this point I consider sync non-functional and in outage.
:michal Ops is working on it. Usually the first thing is to move sync accounts off of the problematic nodes: 39 and 40.  After service is restored, we should investigate route cause.  I got moved to sync-78 and I'm not longer having any issues.
Component: General → Operations
Assignee: nobody → bobm
Status: REOPENED → ASSIGNED
Related to the cause but more about the sync UI - can we do something about sync error bar popping up constantly? It's annoying and obscures webpage and having to to close it again and again...
Still having problems here. Seems like the 502 error has dissapeared from the logs, but the graybar with the error is still appearing.

Error trace is the following:

1407255017642	Sync.Collection	DEBUG	SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data (resource://services-sync/record.js:77:4) JS Stack trace: deserialize@record.js:77:5 < Collection.prototype.recordHandler/this._onProgress@record.js:624:9 < Channel_onDataAvail@resource.js:542:7 < waitForSyncCallback@async.js:99:7 < Res__request@resource.js:389:7 < Res_get@resource.js:413:5 < SyncEngine.prototype._processIncoming@engines.js:1069:11 < SyncEngine.prototype._sync@engines.js:1481:7 < WrappedNotify@util.js:141:15 < Engine.prototype.sync@engines.js:655:5 < _syncEngine@enginesync.js:199:7 < sync@enginesync.js:149:13 < onNotify@service.js:1254:7 < WrappedNotify@util.js:141:15 < WrappedLock@util.js:96:9 < _lockedSync@service.js:1248:1 < sync/<@service.js:1239:7 < WrappedCatch@util.js:70:9 < sync@service.js:1227:5
1407255017643	Sync.Status	DEBUG	Status for engine tabs: error.engine.reason.unknown_fail
1407255017643	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1407255017644	Sync.ErrorHandler	DEBUG	tabs failed: Error: NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS (resource://services-sync/resource.js:393:10) JS Stack trace: Res_get@resource.js:413:5 < SyncEngine.prototype._processIncoming@engines.js:1069:11 < SyncEngine.prototype._sync@engines.js:1481:7 < WrappedNotify@util.js:141:15 < Engine.prototype.sync@engines.js:655:5 < _syncEngine@enginesync.js:199:7 < sync@enginesync.js:149:13 < onNotify@service.js:1254:7 < WrappedNotify@util.js:141:15 < WrappedLock@util.js:96:9 < _lockedSync@service.js:1248:1 < sync/<@service.js:1239:7 < WrappedCatch@util.js:70:9 < sync@service.js:1227:5
I'm having this issue on multiple computers, both OS X and Windows. Been happening for about a week.
Well, I did not say anything. In the subsequent logs I received both 504 and 502 errors, too.
It would appear from the continuing reports, that the problem has been resolved for sync-39-us-west-2 but persists (or has appeared) on sync-40-us-west-2.  Can we get the same love they got? :)
1407256751578	Sync.Engine.History	INFO	Uploading 100 - 200 out of 200 records
1407256751578	Sync.BrowserIDManager	DEBUG	_ensureValidToken already has one
1407256751583	Sync.Collection	DEBUG	POST Length: 55730
1407256812817	Sync.Collection	DEBUG	mesg: POST fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/1849639/storage/history
1407256812818	Sync.Collection	DEBUG	POST fail 504 https://sync-40-us-west-2.sync.services.mozilla.com/1.5/1849639/storage/history
1407256812818	Sync.Engine.History	DEBUG	Uploading records failed: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.6.0</center>
</body>
</html>

1407256812824	Sync.Status	DEBUG	Status for engine history: error.engine.reason.record_upload_fail
1407256812824	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1407256812824	Sync.ErrorHandler	DEBUG	history failed: <html>


Yup, same here, 504 https://sync-40-us-west-2.sync.services.mozilla.com.
We'll be rolling out some node maintenance today (I'm looking at you sync-40).  As well as making some operational improvements, that should alleviate the 502 and 504 errors in general, later this week.
Mentor: rnewman
Mentor: rnewman
As of this morning, the sync errors seem to have been eliminated.  Am I being overly optimistic?
In the morning I got a complete sync (the first in the week), but an hour ago I received both 502  and 504 errors again.
I don't see warning either, but sync fails every now and then. Yesterday it failed every time, today only five times, so there's some progress. I can also see history and settings synchronized.

Time whehn synchronization failed in CEST:
12:08:28 AM
12:27:42 AM
1:04:29 AM
4:57:15 PM
5:35:14 PM
I'm cautiously optimistic - last error logged was at 1:55pm CDT (1855Z) yesterday, and curiously that was a 401, not a 502 or 504.  No sign of any popups or logged errors since then.  Thanks!
The server side issues have been resolved.  There may be occaisional load related errors, however the servers have been configured to convert these errors into a response that the Sync client should deal with gracefully (not cause a popup or error bar), and scale back traffic if necessary.

Please note, a 401 isn't the sign of a failed Sync, it just indicates that the client should fetch a new authentication token, and perhaps new Sync node.  For more information see: https://docs.services.mozilla.com/storage/apis-1.5.html

Operations would like to close this bug.  Anyone still seeing an error bar?
Looks good to me. Thanks.
Looking good from here.  Thanks for your efforts.
:bobm thumbsup mate, resolution looks good to me.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Calling this one fixed.
Status: RESOLVED → VERIFIED
Fixed, thank you.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: