Open Bug 1294270 Opened 6 years ago Updated 5 years ago

Long running Syncs fail with NS_ERROR_NET_PARTIAL_TRANSFER, causing each Sync to try and sync every item.

Categories

(Firefox :: Sync, defect, P3)

51 Branch
defect

Tracking

()

People

(Reporter: wtds.trabalho, Unassigned)

References

Details

Attachments

(3 files)

Attached file logs.zip Sync Logs
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
Build ID: 20160810030202

Steps to reproduce:

Browser not respond and I need wait to return:
- Navigate on internet on unstable wireless or mobile network (All Brazil places)
- Because sync run periodic the problem occurs periodic

I think the network problem or something like this make browser waiting and
because this browser not respond my click and requests.

This file:
      view-source:resource://devtools/shared/webconsole/network-monitor.js

This line:
      return this._wrappedNotificationCallbacks.getInterface(iid);

Sync error and other error in attach.

Sorry my poor English or if duplicated.
I only try to solve constant break at browsing.



Actual results:

Constant problems and slowdowns, breaks and crash.
I tried reinstall and clean profiles in different OS.

Hep me to find the origin of the problems.




Expected results:

No instabilities when browsing over internet.

Thanks... 
I love the Firefox!
See many error
See browser not working
Component: Untriaged → Sync
Thanks for the report.

The logs are showing that bookmarks and passwords failing with NS_ERROR_NET_PARTIAL_TRANSFER. It appears those engines never actually succeeded, so each Sync is causing that engine to start from scratch. A combination of many many bookmarks/passwords (as these logs have), a slow network and an overloaded machine all probably contribute.

(I'm not sure why it is janky - and that is likely to be better in Firefox 48 - but the key problem is that these engines are doing a full Sync each time)

I can reproduce this locally by writing a script that creates ~10k passwords and syncing. Each and every Sync runs for ~10 minutes before failing with:

> 1470899256200	Sync.ErrorHandler	DEBUG	passwords failed: Error: NS_ERROR_NET_PARTIAL_TRANSFER (resource://services-sync/resource.js:390:19) JS Stack trace: waitForSyncCallback@async.js:98:7 < Res__request@resource.js:383:14 < Res_get@resource.js:410:12 < SyncEngine.prototype._processIncoming@engines.js:1140:18 < SyncEngine.prototype._sync@engines.js:1569:7 < WrappedNotify@util.js:165:21 < Engine.prototype.sync@engines.js:675:5 < _syncEngine@enginesync.js:213:7 < sync@enginesync.js:163:15 < onNotify@service.js:1284:7 < WrappedNotify@util.js:165:21 < WrappedLock@util.js:120:16 < _lockedSync@service.js:1274:12 < sync/<@service.js:1266:14 < WrappedCatch@util.js:94:16 < sync@service.js:1254:5 < Res_get@resource.js:410:12 < SyncEngine.prototype._processIncoming@engines.js:1140:18 < SyncEngine.prototype._sync@engines.js:1569:7 < WrappedNotify@util.js:165:21 < Engine.prototype.sync@engines.js:675:5 < _syncEngine@enginesync.js:213:7 < sync@enginesync.js:163:15 < onNotify@service.js:1284:7 < WrappedNotify@util.js:165:21 < WrappedLock@util.js:120:16 < _lockedSync@service.js:1274:12 < sync/<@service.js:1266:14 < WrappedCatch@util.js:94:16 < sync@service.js:1254:5

I *think* this is due to the server closing the connection, but I'm not sure. Sync itself specifies a timeout of 5 minutes for requests, but I don't think we are hitting that, and bumping that to 20 mins doesn't help.

Bob, is there some timeout imposed by the server, and if so, can we increase is significantly?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(bobm)
Summary: Browser (Firefox,Nightly,Developer) not respond when this error appears: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIInterfaceRequestor.getInterface]network-monitor.js:329 → Long running Syncs fail with NS_ERROR_NET_PARTIAL_TRANSFER, causing each Sync to try and sync every item.
See Also: → 1246840
(In reply to Mark Hammond [:markh] from comment #3)
  
> I can reproduce this locally by writing a script that creates ~10k passwords
> and syncing. Each and every Sync runs for ~10 minutes before failing with:

This is probably running into the 10k keepalive_requests limit and not the 65 second keepalive_timeout.  Either way it should be logging a relevant error in the logs, and I'd be interested in exactly which we're running into here (499 client timeout, 504 gateway timeout, or whatever).  We should revisit these settings in any case.

The keepalive_timeout is tuned to close bad connections more quickly and allow for more actually active concurrent traffic.  I'm okay with revisiting this- we might consider at least matching the TCP maximum segment lifetime of two minutes. Does the Sync client recognize the “Keep-Alive: timeout=time” header field, and if so what does it do about it?  See: http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout

We should consider disabling keepalive_requests or setting it to something appropriately large. See: http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests

> I *think* this is due to the server closing the connection, but I'm not
> sure. Sync itself specifies a timeout of 5 minutes for requests, but I don't
> think we are hitting that, and bumping that to 20 mins doesn't help.

Besides the keep_alive timeout the only other timeout that I'm aware of is the upstream_proxy timeout, which we have currently set to 575 seconds (a hold over from Sync 1.5).  If we're actually logging upstream 504s I will revisit this to something larger, but I'm more inclined to lower than raise it.
 
> Bob, is there some timeout imposed by the server, and if so, can we increase
> is significantly?

To dig into these specific cases I need Sync logs, or more precisely I need to know which connections to inspect.  Luckily now that we have easy sessionization associating network errors with these will be easy.

:markh is the 10k password script something :kthiessen could use to reproduce this?
Flags: needinfo?(bobm)
(In reply to Bob Micheletto [:bobm] from comment #4)

> Besides the keep_alive timeout the only other timeout that I'm aware of is
> the upstream_proxy timeout, which we have currently set to 575 seconds (a
> hold over from Sync 1.5). 

I mean a hold over from Sync 1.1.
(In reply to Bob Micheletto [:bobm] from comment #4)
 
> To dig into these specific cases I need Sync logs, or more precisely I need
> to know which connections to inspect.  Luckily now that we have easy
> sessionization associating network errors with these will be easy.

Just noticed the logs are attached!  (Looking into it now.)
(In reply to Bob Micheletto [:bobm] from comment #6)

> Just noticed the logs are attached!  (Looking into it now.)

Looked into the connections from the log for the 9th and 10th, and not finding any associated errors with the userid or IP address.  Let's reproduce this in stage.
This problem has been successfully reproduced in Stage (thanks :markh) and traced to the client reaching the default Nginx send_timeout setting.  Increasing that value allowed the client to successfully sync.  Errors that correspond to this issue are found in the Nginx error log (at the info log level) in the following format:

client timed out (110: Connection timed out) while sending to client, client:

Logging a bug to investigate a server side remediation for this issue.
Depends on: 1294595
Priority: -- → P3
Duplicate of this bug: 1296930
You need to log in before you can comment on or make changes to this bug.