[basket] Sharp increase in timeout errors to ExactTarget



Infrastructure & Operations Graveyard
WebOps: Engagement
2 years ago
2 years ago


(Reporter: pmac, Assigned: ericz)



(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2038] )

We're seeing a large increase in timeouts connecting to ExactTarget starting on the 24th. I'm assuming (hoping) this will resolve itself once the DC move is complete, but in the interim we need to up the client connection timeout. Please set the following in settings/local.py for basket.


I think it'd be better for it to wait for the response than error, and most of these are from the celery workers anyway.

Also, if these connection issues are solvable by network config that'd be even better. I was under the impression that outbound internet requests like these would not have to cross DCs, but it appears they still are.


2 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2038]


2 years ago
Assignee: server-ops-webops → eziegenhorn

Comment 1

2 years ago
Timeout set.
These errors have not abated much. Has the gateway move from PHX to SCL happened yet?
(In reply to Paul [:pmac] McLanahan from comment #2)
> These errors have not abated much. Has the gateway move from PHX to SCL
> happened yet?

Yes, this happened on Friday, Oct 30th. 

What exactly are you contacting on Exact target? I'm wondering if it's either proxy or flow related, hence the question.
Summary: [basket] Please set EXACTTARGET_TIMEOUT setting → [basket] Sharp increase in timeout errors to ExactTarget

Comment 5

2 years ago
Shyam - can you provide an update on this? It's not impacting users yet (we think) but looking in new relic the error rate and timeouts went up significantly since the move.

Let me know if you need anything additional.
Flags: needinfo?(smani)

Comment 6

2 years ago
How can I reproduce this error for troubleshooting?  curl-ing that URL https://webservice.s4.exacttarget.com/Service.asmx from the webheads just gives me an error page (though the actual HTTP request goes through just fine).  Maybe it requires logging in first or the like?
According to my interpretation of Sentry the errors seem to have reduced somewhat on their own over the past few days. If you were to see it it would probably be on a request like:

curl https://basket.mozilla.org/news/user/5bbd3ac2-67f2-4990-b37d-fc5dc0631082/

(the token above is a test one of mine)

That should return some JSON based on data from ET. If ET fails that call should give a 4XX response of some kind.

I'd be interested however in ping times and any odd traceroute results from the webheads and worker nodes to that webservice.s4.exacttarget.com domain.

Comment 8

2 years ago
Ping times are all 19 to 20ms to webservice.s4.exacttarget.com from the python production webheads at the moment.  Example:
[python3.webapp.phx1.mozilla.com] out: PING webservice.s4.exacttarget.com ( 56(84) bytes of data.
[python3.webapp.phx1.mozilla.com] out: 64 bytes from webservice.s4.exacttarget.com ( icmp_seq=1 ttl=244 time=20.3 ms
[python3.webapp.phx1.mozilla.com] out: 64 bytes from webservice.s4.exacttarget.com ( icmp_seq=2 ttl=244 time=20.1 ms
[python3.webapp.phx1.mozilla.com] out: 64 bytes from webservice.s4.exacttarget.com ( icmp_seq=3 ttl=244 time=19.9 ms
[python3.webapp.phx1.mozilla.com] out: 64 bytes from webservice.s4.exacttarget.com ( icmp_seq=4 ttl=244 time=19.9 ms
[python3.webapp.phx1.mozilla.com] out: 64 bytes from webservice.s4.exacttarget.com ( icmp_seq=5 ttl=244 time=20.1 ms
[python3.webapp.phx1.mozilla.com] out:
[python3.webapp.phx1.mozilla.com] out: --- webservice.s4.exacttarget.com ping statistics ---
[python3.webapp.phx1.mozilla.com] out: 5 packets transmitted, 5 received, 0% packet loss, time 4024ms
[python3.webapp.phx1.mozilla.com] out: rtt min/avg/max/mdev = 19.908/20.102/20.311/0.173 ms

Traceroutes look ok to me as well, example:
[python3.webapp.phx1.mozilla.com] out: traceroute to webservice.s4.exacttarget.com (, 30 hops max, 60 byte packets
[python3.webapp.phx1.mozilla.com] out: 1 (  0.567 ms  0.645 ms  0.579 ms
[python3.webapp.phx1.mozilla.com] out: 2 (  2.823 ms  2.790 ms  2.581 ms
[python3.webapp.phx1.mozilla.com] out: 3 (  0.896 ms  0.972 ms  0.932 ms
[python3.webapp.phx1.mozilla.com] out: 4  xe-0-0-3.border1.scl3.mozilla.net (  1.026 ms  1.103 ms  1.122 ms
[python3.webapp.phx1.mozilla.com] out: 5  xe-1-2-2.border1.pao1.mozilla.net (  1.304 ms  1.374 ms  1.444 ms
[python3.webapp.phx1.mozilla.com] out: 6  xe-3-1-0.mpr2.pao1.us.above.net (  1.406 ms  1.507 ms  1.468 ms
[python3.webapp.phx1.mozilla.com] out: 7  ae7.cr2.sjc2.us.zip.zayo.com (  2.069 ms  1.973 ms  1.939 ms
[python3.webapp.phx1.mozilla.com] out: 8  v21.ae29.cr2.lax112.us.zip.zayo.com (  11.884 ms  11.904 ms  11.877 ms
[python3.webapp.phx1.mozilla.com] out: 9  ae4.mpr2.las1.us.zip.zayo.com (  19.587 ms  18.326 ms  19.400 ms
[python3.webapp.phx1.mozilla.com] out: 10 (  20.757 ms  20.853 ms  20.829 ms
[python3.webapp.phx1.mozilla.com] out: 11  * * *
[python3.webapp.phx1.mozilla.com] out: 12  webservice.s4.exacttarget.com (  28.398 ms  28.346 ms  20.582 ms

There are no problems running that curl repeatedly either:
[python1.webapp.phx1 ~]$ curl https://basket.mozilla.org/news/user/5bbd3ac2-67f2-4990-b37d-fc5dc0631082/
{"status": "ok", "format": "T", "newsletters": ["firefox-ios", "mozilla-and-you", "os"], "created-date": "7/20/2015 2:35:42 PM", "lang": "en-US", "confirmed": true, "country": "us", "token": "5bbd3ac2-67f2-4990-b37d-fc5dc0631082", "master": true, "email": "pmac+basket-soap-test@mozilla.com", "pending": false}
Flags: needinfo?(smani)
Thanks Eric. I'm going to mark this resolved. I'll reopen if this becomes a problem again.
Last Resolved: 2 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.