Closed
Bug 1066493
Opened 11 years ago
Closed 5 years ago
TPS failure: Error: signIn() failed with: Weave logged in; expected "success.login", got "error.sync.reason.serverMaintenance" (resource://tps/auth/fxaccounts.jsm:93:12)
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cosmin-malutan, Assigned: jrgm)
References
Details
(Whiteboard: [tps][needs fix for bug 1057892])
Attachments
(1 file)
7.42 MB,
text/plain
|
Details |
This fails in intermittently in more tests but at least once in a testrun:
>13:02:34 CROSSWEAVE INFO: ----------event observed: weave:service:login:error
>13:02:34 CROSSWEAVE ERROR: [phase1] Exception caught: Error: signIn() failed with: undefined (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
>13:02:34 CROSSWEAVE INFO: ----------event observed: quit-application-requested
>13:02:34 CROSSWEAVE INFO: Wiping data from server.
>13:02:34 CROSSWEAVE INFO: ----------event observed: weave:service:login:error
>13:02:35
>13:02:35 TEST-UNEXPECTED-FAIL | test_bug501528.js | [phase1] Exception caught: Error: signIn() failed with: undefined (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
>13:02:35
>13:02:35 phase1: FAIL
>13:02:35 Running test test_bug575423.js
I'll try to reproduce it and return with an update.
Comment 1•11 years ago
|
||
Strange. We get an exception but without a message set. I wonder what this is:
http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/tps/resource/auth/fxaccounts.jsm#93
Reporter | ||
Comment 2•11 years ago
|
||
>00:50:20 CROSSWEAVE INFO: Login user: coversheet-261c7acc9163@restmail.net
>00:50:21 CROSSWEAVE INFO: ----------event observed: fxaccounts:onlogin
>00:50:21 CROSSWEAVE INFO: Logging into Weave.
>00:50:21 CROSSWEAVE INFO: ----------event observed: weave:engine:start-tracking
>00:50:21 CROSSWEAVE INFO: ----------event observed: weave:service:setup-complete
>00:50:21 CROSSWEAVE INFO: ----------event observed: weave:service:login:error
>00:50:21 CROSSWEAVE ERROR: [phase1] Exception caught: Error: signIn() failed with: undefined (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
>00:50:21 CROSSWEAVE INFO: ----------event observed: quit-application-requested
>00:50:21 CROSSWEAVE INFO: Wiping data from server.
>00:50:22 CROSSWEAVE INFO: ----------event observed: weave:service:login:error
>00:50:22
>00:50:22 TEST-UNEXPECTED-FAIL | test_prefs.js | [phase1] Exception caught: Error: signIn() failed with: undefined (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
>00:50:22
>00:50:22 phase1: FAIL
As you can see in the extended log above it logs "CROSSWEAVE INFO: Logging into Weave." so the function in which it fails is the next one, "Weave.Service.login();"
>http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/tps/resource/auth/fxaccounts.jsm#85
The reason why it fails with "undefined" is that in login method we throw sometimes strings and not exception objects, only the latest one has an "message" attribute:
http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/service.js#994
Run the code below in console:
try {
throw "Some text"
} catch (e) {
console.log("e = " + e + "\n e.message = " + e.message)
}
What I think is that we should file a bug to fix those cases in services/sync first
http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/service.js
Reporter | ||
Updated•11 years ago
|
Assignee: nobody → cosmin.malutan
Status: NEW → ASSIGNED
Reporter | ||
Comment 3•11 years ago
|
||
Or we could change the way we re-raise the exception in the link below to:
throw new Error("signIn() failed with: " + (error.message || error));
http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/tps/resource/auth/fxaccounts.jsm#93
Henrik which one do you think it will be the right approach here?
Flags: needinfo?(hskupin)
Comment 4•11 years ago
|
||
File a bug for Weave and get it fixed. I think that a simple search and replace. Hopefully no consumers will be broken because of it.
Personally I would try to limit changes to TPS until we get the new repository created, and the code stripped of from mozilla-central.
On the other side please run the tests with debugging/tracing enabled. This should give us more details about the underlying failure, so we can work on it until the Weave bug is fixed.
Flags: needinfo?(hskupin)
Reporter | ||
Comment 5•11 years ago
|
||
I changed the way we handle the exception temporarily in test in order to see what was thrown:
>TEST-UNEXPECTED-FAIL | test_bug563989.js | [phase1] Exception caught: Error: signIn() failed with: ASSERTION FAILED! Weave logged in; expected "success.login", got "error.sync.reason.serverMaintenance" (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
>
> phase1: FAIL
That means that the callback executed correctly, it couldn't connect to the sync-server, and what we catch is the assertion from:
http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/tps/resource/auth/fxaccounts.jsm?rev=6196c121c4f4#87
Where Weave.Status.login = error.sync.reason.serverMaintenance, which is "503 Service Unavailable" error.
I don't think it's feasible to mock-up the sync server, because we use the real one here, also when we hit this kind of issue we should either re-tray or abort the testrun as requirements for testing the build wasn't satisfied.
Comment 6•11 years ago
|
||
(In reply to Cosmin Malutan from comment #5)
> http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/
> tps/resource/auth/fxaccounts.jsm?rev=6196c121c4f4#87
> Where Weave.Status.login = error.sync.reason.serverMaintenance, which is
> "503 Service Unavailable" error.
Ouch. So fxa-pyhton-client creates fxa accounts on hosts which are under maintenance all the time? This is crazy. Cosmin, can you please check again with debug output so we can get an idea which hosts are affected here?
Richard and Mark, do you know who is responsible for maintaining the backend servers?
Flags: needinfo?(rnewman)
Flags: needinfo?(mhammond)
Comment 7•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #6)
> Richard and Mark, do you know who is responsible for maintaining the backend
> servers?
Depends. TPS used to use prod, which is not under maintenance most of the time. If you're using the dev hosts, yeah, they'll be broken a lot of the time.
CCed a bunch of folks.
Flags: needinfo?(rnewman)
Flags: needinfo?(mhammond)
Comment 8•11 years ago
|
||
So fxa-python-client is using the following hosts to interact with the fxa server:
https://github.com/mozilla/fxa-python-client/blob/master/fxa_client/fxa_crypto.py
https://api.accounts.firefox.com/
https://verifier.accounts.firefox.com/v2
I think it would be good to know what else you need as debug info to be able to figure out why it is broken. Please let me know, and we can check that.
Updated•11 years ago
|
Summary: TPS failure: Error: signIn() failed with: undefined (resource://tps/auth/fxaccounts.jsm:93:12) → TPS failure: Error: signIn() failed with: Weave logged in; expected "success.login", got "error.sync.reason.serverMaintenance" (resource://tps/auth/fxaccounts.jsm:93:12)
Updated•11 years ago
|
OS: Linux → All
Hardware: x86_64 → All
Reporter | ||
Comment 9•11 years ago
|
||
So the 503 status is for services temporally unavailable due to high load.
https://github.com/mozilla/fxa-auth-server/blob/master/docs/api.md#response-format
Under the Backoff-Protocol section it says that we have "retryAfter" attribute in the message we recite
https://github.com/mozilla/fxa-auth-server/blob/master/docs/api.md#backoff-protocol
I think we could catch this exception, add sleep and retry after a delay.
http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/tps/resource/auth/fxaccounts.jsm?rev=6196c121c4f4#81
Comment 10•11 years ago
|
||
(In reply to Cosmin Malutan from comment #9)
> I think we could catch this exception, add sleep and retry after a delay.
> http://mxr.mozilla.org/mozilla-central/source/services/sync/tps/extensions/
> tps/resource/auth/fxaccounts.jsm?rev=6196c121c4f4#81
So if this is really a problem with high load, we have to find a proper solution to get this fixed for the TPS tests! I'm not going with the option to delay tests by 30s or even more because we cannot sign-in. Keep in mind that this would happen to all of our tests!
I hope we can find a solution here sooner than later. Asking direct feedback from Bob and Danny.
Flags: needinfo?(dcoates)
Flags: needinfo?(bobm)
Comment 11•11 years ago
|
||
We cannot turn on the new CI system until this problem has been fixed. So putting it on our blocking list.
Whiteboard: [qa-automation-blocked]
Comment 12•11 years ago
|
||
Its possible, though highly unlikely for the production auth-server (https://api.accounts.firefox.com) to return a 503. I don't recommend adding any special cases for auth-server 503s in TPS tests.
The error noted in comment #5 occurred after the FxAccountsClient logged in successfully.
Flags: needinfo?(dcoates)
Comment 13•11 years ago
|
||
Cosmin, can you please run such a test via the --debug pref, and attach the full console log output to this bug? This might help us to get closer to this problem.
Flags: needinfo?(cosmin.malutan)
Comment 14•11 years ago
|
||
For mozilla-central and mozilla-aurora I have enabled the debug setting in the jobs to run it by default. It should give us some additional info about the user id, which Bob can query for in the logs.
We will hvae a couple of uids by tomorrow.
Flags: needinfo?(cosmin.malutan)
Reporter | ||
Comment 15•11 years ago
|
||
Here is a snippet of log from the one attached.
1413251062352 Sync.Resource TRACE In _onComplete. Error is null.
1413251062352 Sync.Resource TRACE Channel: [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)]
1413251062352 Sync.Resource TRACE Action: PUT
1413251062352 Sync.Resource TRACE Status: 503
1413251062352 Sync.Resource TRACE Success: false
1413251062352 Sync.Resource DEBUG mesg: PUT fail 503 https://sync-145-us-west-2.sync.services.mozilla.com/1.5/10822912/storage/meta/global
1413251062352 Sync.Resource DEBUG PUT fail 503 https://sync-145-us-west-2.sync.services.mozilla.com/1.5/10822912/storage/meta/global
1413251062352 Sync.Resource TRACE PUT body: <html>
<head><title>503 Service Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Cloudy: Try again later. </h1></center>
</body>
</html>
1413251062352 Sync.Resource TRACE Processing response headers.
1413251062353 Sync.Service DEBUG verifyLogin failed: <html>
<head><title>503 Service Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Cloudy: Try again later. </h1></center>
</body>
</html>
No traceback available
1413251062353 Sync.Status DEBUG Status.login: success.login => error.login.reason.network
1413251062353 Sync.Status DEBUG Status.service: success.status_ok => error.login.failed
1413251062353 Sync.ErrorHandler DEBUG Got Retry-After: 3600
1413251062353 Sync.Status DEBUG Status.login: error.login.reason.network => error.sync.reason.serverMaintenance
1413251062353 Sync.Status DEBUG Status.service: error.login.failed => error.login.failed
Comment 16•11 years ago
|
||
What happens if you create a new account?
Reporter | ||
Comment 17•11 years ago
|
||
All accounts are newly created and removed after each testrun.
Also it's worth mentioning that not all tests failed, only some of them consecutively.
Comment 18•11 years ago
|
||
I talked with Bob about that yesterday and we tried something quick. Sadly it didn't reproduce at this time. I tried later again by running the TPS tests in debug mode, and I hit it again.
Cosmin, can you please check all the results from the last days when we failed with this failure? It can be across branches. Important here would be the time and the user ID (not email) when it happened. The user ID can be extracted from the URLs as seen above in your comment. So here it would be 10822912. Mind collecting some more and all the information and commenting it here? Bob can have a look then.
Reporter | ||
Comment 19•11 years ago
|
||
Time Version Email ID
10/14/2014 7:15:35 36.0a1.en-US.linux-x86 coversheet-de762355dafb@restmail.net 10839278
10/14/2014 6:52:56 36.0a1.en-US.mac coversheet-3299ff030a89@restmail.net 10838560
10/13/2014 18:37:51 36.0a1.en-US.mac coversheet-f578e3436795@restmail.net 10822912
10/14/2014 5:47:45 35.0a2.en-US.mac coversheet-4997b321b9f4@restmail.net 10836566
Those are the jobs that ran with debug enabled, for all the others I can't retrieve the user ID.
I hope it helps, if I can do anything else please let me know.
Comment 20•11 years ago
|
||
Btw. I also see errors like the following beside the maintenance one above:
TEST-UNEXPECTED-FAIL | test_addon_restartless_xpi.js | [phase01] Exception caught: Error: signIn() failed with: ASSERTION FAILED! Weave logged in; expected "success.login", got "error.login.reason.network" (resource://tps/auth/fxaccounts.jsm:93:12) JS Stack trace: signIn@fxaccounts.jsm:93:13 < Login@tps.jsm:867:5 < TPS.RunNextTestAction@tps.jsm:595:1
Comment 21•11 years ago
|
||
Bob, have you had a chance to look into this for the given users in comment 19?
Comment 22•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #21)
The errors all look like subtle variations on this:
"PUT /1.5/10838560/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a1 FxSync/1.38.0.20141014030201.desktop" 0.083 0.083 "503"
Unfortunately, I can't easily dig into the application logs that far back to get a detailed traceback. However, the servers were operating normally during these time frames. I suspect that TPS is acting somewhat out of the range of normal user behavior, but I can't say for certain unless we get a detailed traceback for one of these meta/global POST failures.
Can you force a new one?
Flags: needinfo?(bobm) → needinfo?(hskupin)
Comment 23•11 years ago
|
||
There was happening one more yesterday and digging deeper into the debug trace it's clear that this is a server side issue:
Log: http://tps-ci-production.qa.scl3.mozilla.com:8080/job/mozilla-central_fx-account/426/console
User ID: 11071539
Here an excerpt of the trace output when this failure happens:
22:33:12 1413783362609 Sync.Service DEBUG Uploading meta/global: {"payload":"{\"syncID\":\"RHyYZvIKfB1v\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
22:33:12 1413783362609 Sync.Resource TRACE In _doRequest.
22:33:12 1413783362609 Sync.BrowserIDManager DEBUG _ensureValidToken already has one
22:33:12 1413783362610 Sync.Resource TRACE HTTP Header authorization: ***** (suppressed)
22:33:12 1413783362610 Sync.Resource DEBUG PUT Length: 94
22:33:12 1413783362610 Sync.Resource TRACE PUT Body: {"payload":"{\"syncID\":\"RHyYZvIKfB1v\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
22:33:13 1413783362728 Sync.Resource TRACE onStartRequest called for channel [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)].
22:33:13 1413783362728 Sync.Resource TRACE onStartRequest: PUT https://sync-147-us-west-2.sync.services.mozilla.com/1.5/11071539/storage/meta/global
22:33:13 1413783362728 Sync.Resource TRACE Channel for PUT https://sync-147-us-west-2.sync.services.mozilla.com/1.5/11071539/storage/meta/global: isSuccessCode(0)? true
22:33:13 1413783362728 Sync.Resource TRACE Channel: flags = 17024, URI = https://sync-147-us-west-2.sync.services.mozilla.com/1.5/11071539/storage/meta/global, HTTP success? false
22:33:13 1413783362728 Sync.Resource TRACE In _onComplete. Error is null.
22:33:13 1413783362728 Sync.Resource TRACE Channel: [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)]
22:33:13 1413783362728 Sync.Resource TRACE Action: PUT
22:33:13 1413783362728 Sync.Resource TRACE Status: 503
22:33:13 1413783362728 Sync.Resource TRACE Success: false
22:33:13 1413783362728 Sync.Resource DEBUG mesg: PUT fail 503 https://sync-147-us-west-2.sync.services.mozilla.com/1.5/11071539/storage/meta/global
22:33:13 1413783362729 Sync.Resource DEBUG PUT fail 503 https://sync-147-us-west-2.sync.services.mozilla.com/1.5/11071539/storage/meta/global
22:33:13 1413783362729 Sync.Resource TRACE PUT body: <html>
22:33:13 <head><title>503 Service Unavailable</title></head>
22:33:13 <body bgcolor="white">
22:33:13 <center><h1>503 Service Cloudy: Try again later. </h1></center>
22:33:13 </body>
22:33:13 </html>
22:33:13 1413783362729 Sync.Resource TRACE Processing response headers.
22:33:13 1413783362729 Sync.Service DEBUG verifyLogin failed: <html>
22:33:13 <head><title>503 Service Unavailable</title></head>
22:33:13 <body bgcolor="white">
22:33:13 <center><h1>503 Service Cloudy: Try again later. </h1></center>
22:33:13 </body>
22:33:13 </html>
22:33:13 No traceback available
22:33:13 1413783362729 Sync.Status DEBUG Status.login: success.login => error.login.reason.network
22:33:13 1413783362729 Sync.Status DEBUG Status.service: success.status_ok => error.login.failed
22:33:13 1413783362729 Sync.ErrorHandler DEBUG Got Retry-After: 3600
22:33:13 1413783362729 Sync.Status DEBUG Status.login: error.login.reason.network => error.sync.reason.serverMaintenance
22:33:13 1413783362729 Sync.Status DEBUG Status.service: error.login.failed => error.login.failed
22:33:13 1413783362729 Sync.SyncScheduler TRACE Handling weave:service:backoff:interval
22:33:13 1413783362729 Sync.SyncScheduler DEBUG Got backoff notification: 3600000ms
22:33:13 1413783362729 Sync.SyncScheduler DEBUG Fuzzed minimum next sync: 1413786962729
22:33:13 1413783362729 Sync.Service TRACE Event: weave:service:login:error
22:33:13 1413783362730 Sync.SyncScheduler TRACE Handling weave:service:login:error
22:33:13 1413783362730 Sync.SyncScheduler DEBUG Clearing sync triggers and the global score.
22:33:13 1413783362730 Sync.SyncScheduler TRACE _checkSync returned "".
22:33:13 1413783362730 Sync.SyncScheduler TRACE Requested interval 3600000 ms is smaller than the backoff interval. Using backoff interval 4436096.165866296 ms instead.
22:33:13 1413783362730 Sync.SyncScheduler DEBUG Next sync in 4436096.165866296 ms.
22:33:13 1413783362730 Sync.ErrorHandler TRACE Handling weave:service:login:error
22:33:13 1413783362730 Sync.ErrorHandler DEBUG Flushing file log.
22:33:13 1413783362731 Sync.ErrorHandler TRACE Beginning stream copy to error-1413783362731.txt: 1413783362731
Flags: needinfo?(hskupin) → needinfo?(bobm)
Comment 24•11 years ago
|
||
Cosmin, I will take that over so that you have more time for mozmill related tests.
Assignee: cosmin.malutan → hskupin
QA Contact: hskupin
Comment 25•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #23)
> There was happening one more yesterday and digging deeper into the debug
> trace it's clear that this is a server side issue:
This appears to have happened on the 20th, and is just beyond our app log rotation. We can go back farther with our HTTP logs, but not so far with the application logs, unless we go to the anonymized logs. Do you have anything more recent?
Flags: needinfo?(bobm) → needinfo?(hskupin)
Comment 26•11 years ago
|
||
Exactly this problem happened way lesser the last days. I will have to observe the tests and maybe trigger some additional ones, unless I'm able to reproduce the signIn() failure.
Comment 27•11 years ago
|
||
Bob, has something been changed server-side the last days? The whole last day I haven't seen any single instance of this failure anymore. I will keep watching...
Comment 28•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #27)
> Bob, has something been changed server-side the last days? The whole last
> day I haven't seen any single instance of this failure anymore. I will keep
> watching...
Not significantly.
Comment 29•11 years ago
|
||
Ok, so lots of failures happening again over the weekend and today with the release candidate builds of the next 33.0 release, and the Aurora build. The following is the excerpt of a today's build, so you should be able to get the log details from the server.
03:31:15 1414406059640 Sync.Resource TRACE onStartRequest: DELETE https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage
03:31:15 1414406059641 Sync.Resource TRACE Channel for DELETE https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage: isSuccessCode(0)? true
03:31:15 1414406059641 Sync.Resource TRACE Channel: flags = 17024, URI = https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage, HTTP success? true
03:31:15 1414406059641 Sync.Resource TRACE In _onComplete. Error is null.
03:31:15 1414406059641 Sync.Resource TRACE Channel: [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel)]
03:31:15 1414406059641 Sync.Resource TRACE Action: DELETE
03:31:15 1414406059641 Sync.Resource TRACE Status: 200
03:31:15 1414406059641 Sync.Resource TRACE Success: true
03:31:15 1414406059641 Sync.Resource DEBUG mesg: DELETE success 200 https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage
03:31:15 1414406059641 Sync.Resource DEBUG DELETE success 200 https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage
03:31:15 1414406059641 Sync.Resource TRACE DELETE body: {}
03:31:15 1414406059641 Sync.Resource TRACE Processing response headers.
03:31:15 1414406059641 Sync.Service DEBUG Uploading meta/global: {"payload":"{\"syncID\":\"HqRR4jAZZPB9\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
03:31:15 1414406059642 Sync.Resource TRACE In _doRequest.
03:31:15 1414406059642 Sync.BrowserIDManager DEBUG _ensureValidToken already has one
03:31:15 1414406059643 Sync.Resource TRACE HTTP Header authorization: ***** (suppressed)
03:31:15 1414406059643 Sync.Resource DEBUG PUT Length: 94
03:31:15 1414406059643 Sync.Resource TRACE PUT Body: {"payload":"{\"syncID\":\"HqRR4jAZZPB9\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
03:31:15 1414406059755 Sync.Resource TRACE onStartRequest called for channel [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)].
03:31:15 1414406059755 Sync.Resource TRACE onStartRequest: PUT https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage/meta/global
03:31:15 1414406059755 Sync.Resource TRACE Channel for PUT https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage/meta/global: isSuccessCode(0)? true
03:31:15 1414406059755 Sync.Resource TRACE Channel: flags = 17024, URI = https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage/meta/global, HTTP success? false
03:31:15 1414406059755 Sync.Resource TRACE In _onComplete. Error is null.
03:31:15 1414406059755 Sync.Resource TRACE Channel: [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)]
03:31:15 1414406059755 Sync.Resource TRACE Action: PUT
03:31:15 1414406059755 Sync.Resource TRACE Status: 503
03:31:15 1414406059755 Sync.Resource TRACE Success: false
03:31:15 1414406059756 Sync.Resource DEBUG mesg: PUT fail 503 https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage/meta/global
03:31:15 1414406059756 Sync.Resource DEBUG PUT fail 503 https://sync-151-us-west-2.sync.services.mozilla.com/1.5/11612811/storage/meta/global
03:31:15 1414406059756 Sync.Resource TRACE PUT body: <html>
03:31:15 <head><title>503 Service Unavailable</title></head>
03:31:15 <body bgcolor="white">
03:31:15 <center><h1>503 Service Cloudy: Try again later. </h1></center>
03:31:15 </body>
03:31:15 </html>
03:31:15 1414406059756 Sync.Resource TRACE Processing response headers.
03:31:15 1414406059756 Sync.Service DEBUG verifyLogin failed: <html>
03:31:15 <head><title>503 Service Unavailable</title></head>
03:31:15 <body bgcolor="white">
03:31:15 <center><h1>503 Service Cloudy: Try again later. </h1></center>
03:31:15 </body>
03:31:15 </html>
Flags: needinfo?(hskupin) → needinfo?(bobm)
Comment 30•11 years ago
|
||
Adding Ryan Kelly. Will update bug with information from comment 19 shortly.
Comment 31•11 years ago
|
||
Based on the log entries / tracebacks from the issue above the problem TPS is experiencing is related to Bug 1057892. We will be testing an updated version of the database backing Sync 1.5 through Bug 1089945, which may resolve the problem.
The HTTP log entries, generally look like this:
A.B.C.D [2014-10-27T10:32:03+00:00] "PUT /1.5/11612811/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/35.0a2 FxSync/1.37.0.20141027004000.desktop" 0.012 0.012 "503"
Corresponding Sync 1.5 application log entries:
{"time": "2014-10-27T10:33:12.315239Z", "v": 1, "message": "Caught operational db error: (InternalError) (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO bso11 (payload_size,userid,modified,collection,id,payload) VALUES (%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE payload_size = VALUES(payload_size),userid = VALUES(userid),modified = VALUES(modified),collection = VALUES(collection),id = VALUES(id),payload = VALUES(payload)' (58, 11612811, 1414405992310, 6, u'global', '{\"syncID\":\"Ig9OI9iyFzwG\",\"storageVersion\":5,\"declined\":[]}')\nTraceback (most recent call last):\n File \"/data/server-syncstorage/syncstorage/storage/sql/dbconnect.py\", line 446, in report_backend_errors_wrapper\n return func(self, *args, **kwds)\n File \"/data/server-syncstorage/syncstorage/storage/sql/dbconnect.py\", line 542, in execute\n return self._exec_with_cleanup(connection, query_str, **params)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/mozsvc/metrics.py\", line 183, in timed_func\n return func(*args, **kwds)\n File \"/data/server-syncstorage/syncstorage/storage/sql/dbconnect.py\", line 576, in _exec_with_cleanup\n return connection.execute(sqltext(query_str), **params)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py\", line 729, in execute\n return meth(self, multiparams, params)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/sql/elements.py\", line 321, in _execute_on_connection\n return connection._execute_clauseelement(self, multiparams, params)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py\", line 826, in _execute_clauseelement\n compiled_sql, distilled_params\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py\", line 958, in _execute_context\n context)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py\", line 1160, in _handle_dbapi_exception\n exc_info\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/util/compat.py\", line 199, in raise_from_cause\n reraise(type(exception), exception, tb=exc_tb)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py\", line 951, in _execute_context\n context)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/default.py\", line 436, in do_execute\n cursor.execute(statement, parameters)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/cursors.py\", line 132, in execute\n result = self._query(query)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/cursors.py\", line 271, in _query\n conn.query(q)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py\", line 726, in query\n self._affected_rows = self._read_query_result(unbuffered=unbuffered)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py\", line 861, in _read_query_result\n result.read()\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py\", line 1064, in read\n first_packet = self.connection._read_packet()\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py\", line 826, in _read_packet\n packet.check_error()\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py\", line 370, in check_error\n raise_mysql_exception(self._data)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/err.py\", line 116, in raise_mysql_exception\n _check_mysql_exception(errinfo)\n File \"/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/err.py\", line 112, in _check_mysql_exception\n raise InternalError(errno, errorvalue)\nInternalError: (InternalError) (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO bso11 (payload_size,userid,modified,collection,id,payload) VALUES (%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE payload_size = VALUES(payload_size),userid = VALUES(userid),modified = VALUES(modified),collection = VALUES(collection),id = VALUES(id),payload = VALUES(payload)' (58, 11612811, 1414405992310, 6, u'global', '{\"syncID\":\"Ig9OI9iyFzwG\",\"storageVersion\":5,\"declined\":[]}')\n", "hostname": "ip-A-B-C-D", "pid": 5517, "op": "syncstorage.storage.sql", "name": "syncstorage.storage.sql"}
Flags: needinfo?(bobm)
Comment 32•11 years ago
|
||
Thanks Bob, and good to see we were able to get this figured out! So this seems to be clearly a bug on your side then, and not in TPS. I think it's best to move this bug over to your product/component then. Is there already an ETA set on your side for the mentioned fix? Or at least a raw estimate how long it will take? Thanks!
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Component: TPS → Server: Sync
Depends on: 1057892
Product: Testing → Mozilla Services
QA Contact: hskupin
Comment 33•11 years ago
|
||
Just to add, this would also explain why we do not have this problem with the old CI system which is using a fixed user created ages ago.
Comment 34•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #33)
> Just to add, this would also explain why we do not have this problem with
> the old CI system which is using a fixed user created ages ago.
Btw. do we know how many people are affected by this who are currently trying to create their Firefox account? Is that somewhat a show stopper for us to reach the number of FxA users?
Updated•11 years ago
|
Whiteboard: [qa-automation-blocked] → [tps][qa-automation-blocked]
Updated•11 years ago
|
Component: Server: Sync → Operations
Comment 35•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #34)
> Btw. do we know how many people are affected by this who are currently
> trying to create their Firefox account? Is that somewhat a show stopper for
> us to reach the number of FxA users?
Users recover from the issue on subsequent Sync operations. We don't believe that this causing any serious impact. It happens more often that we'd prefer, but only accounts for 0.02% of our traffic over the last 24 hours. See https://bugzilla.mozilla.org/show_bug.cgi?id=1057892#c22
Comment 36•11 years ago
|
||
We (finally) deployed a workaround for this issue into production; :whimboo are you able to confirm whether it resolves the problem for TPS?
Flags: needinfo?(hskupin)
Comment 37•11 years ago
|
||
Sadly the fix didn't help us at all. The server side issues are still persistent. Here an actual excerpt from a log:
02:03:09 1418638071019 Sync.Service DEBUG Uploading meta/global: {"payload":"{\"syncID\":\"wwyV2Al_f9Vc\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
02:03:09 1418638071019 Sync.Resource TRACE In _doRequest.
02:03:09 1418638071019 Sync.BrowserIDManager DEBUG _ensureValidToken already has one
02:03:09 1418638071020 Sync.Resource TRACE HTTP Header authorization: ***** (suppressed)
02:03:09 1418638071021 Sync.Resource DEBUG PUT Length: 94
02:03:09 1418638071021 Sync.Resource TRACE PUT Body: {"payload":"{\"syncID\":\"wwyV2Al_f9Vc\",\"storageVersion\":5,\"declined\":[]}","id":"global"}
02:03:09 1418638071067 Sync.Resource TRACE onStartRequest called for channel [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)].
02:03:09 1418638071067 Sync.Resource TRACE onStartRequest: PUT https://sync-186-us-west-2.sync.services.mozilla.com/1.5/16560340/storage/meta/global
02:03:09 1418638071068 Sync.Resource TRACE Channel for PUT https://sync-186-us-west-2.sync.services.mozilla.com/1.5/16560340/storage/meta/global: isSuccessCode(0)? true
02:03:09 1418638071068 Sync.Resource TRACE Channel: flags = 17024, URI = https://sync-186-us-west-2.sync.services.mozilla.com/1.5/16560340/storage/meta/global, HTTP success? false
02:03:09 1418638071068 Sync.Resource TRACE In _onComplete. Error is null.
02:03:09 1418638071068 Sync.Resource TRACE Channel: [xpconnect wrapped (nsISupports, nsIChannel, nsIRequest, nsIHttpChannel, nsIUploadChannel)]
02:03:09 1418638071068 Sync.Resource TRACE Action: PUT
02:03:09 1418638071068 Sync.Resource TRACE Status: 503
02:03:09 1418638071068 Sync.Resource TRACE Success: false
02:03:09 1418638071068 Sync.Resource DEBUG mesg: PUT fail 503 https://sync-186-us-west-2.sync.services.mozilla.com/1.5/16560340/storage/meta/global
02:03:09 1418638071068 Sync.Resource DEBUG PUT fail 503 https://sync-186-us-west-2.sync.services.mozilla.com/1.5/16560340/storage/meta/global
Flags: needinfo?(rfkelly)
Flags: needinfo?(hskupin)
Flags: needinfo?(bobm)
Comment 38•11 years ago
|
||
We are seeing the 'Caught operational db error:' type errors on this host. Herer are the relevant server side entries:
A.B.C.D [2014-12-15T10:03:09+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.044 0.015 "500"
A.B.C.D [2014-12-15T10:03:09+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.036 0.012 "500"
A.B.C.D [2014-12-15T10:03:52+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.032 0.008 "500"
A.B.C.D [2014-12-15T10:03:52+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.032 0.008 "500"
A.B.C.D [2014-12-15T10:03:56+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.038 0.013 "500"
A.B.C.D [2014-12-15T10:04:00+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.033 0.008 "500"
A.B.C.D [2014-12-15T10:04:00+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.033 0.009 "500"
A.B.C.D [2014-12-15T10:04:03+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.040 0.016 "500"
A.B.C.D [2014-12-15T10:04:03+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.034 0.009 "500"
A.B.C.D [2014-12-15T10:04:06+00:00] "PUT /1.5/16560340/storage/meta/global HTTP/1.1" 503 162 838 "Firefox/36.0a2 FxSync/1.38.0.20141214004008.desktop" 0.034 0.009 "500"
Flags: needinfo?(bobm)
Comment 39•11 years ago
|
||
Unfortunately, it looks like our proposed workaround was not enough to avoid the db error, see Bug 1057892 Comment 37.
Flags: needinfo?(rfkelly)
Updated•11 years ago
|
Whiteboard: [tps][qa-automation-blocked] → [tps][qa-automation-blocked][needs fix for bug 1057892]
Comment 40•10 years ago
|
||
As given on bug 1057892 comment 46 there were some backend changes lately for the servers, which seem to have fixed our problem here! No more failures of this type have been seen in the past couple of days.
I'm going to close this bug and may reopen in case it starts happening again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [tps][qa-automation-blocked][needs fix for bug 1057892] → [tps]
Comment 41•10 years ago
|
||
It's still around and mostly for OS X but on a very low volume. :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [tps] → [tps][qa-automation-blocked][needs fix for bug 1057892]
Updated•10 years ago
|
Assignee: nobody → jrgm
Updated•10 years ago
|
Whiteboard: [tps][qa-automation-blocked][needs fix for bug 1057892] → [tps][needs fix for bug 1057892]
Assignee | ||
Updated•5 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 5 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•