Closed
Bug 1179496
Opened 10 years ago
Closed 10 years ago
/lookup-user calls error out at certain times of day
Categories
(Websites :: Basket, defect)
Websites
Basket
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rfkelly, Assigned: jdavis)
Details
Attachments
(1 file)
|
65.82 KB,
image/png
|
Details |
In the FxA/basket integration, we're seeing regular outages of the /lookup-user endpoint. You can see the HTTP statuses in our kibana dashboard at [1], but I'm attaching a screenshot for convenience.
The red in the graph is our baset proxy returning a "500 server error". I'm not sure what response it's seeing from basket - perhaps you can look in the logs on the basket side and correlate?
As far as we can tell, this happens daily at around 7am (I think the logs are in UTC?) and lasts for about around 20 minutes. Only /lookup-user requests are affected, and AFAICT no /lookup-user requests succeed during such an event.
:pmac, does this correspond to any known/expected behaviour from basket? IIUC there's some background syncing process with ET thst might be responsible?
[1] https://kibana.fxa.us-west-2.prod.mozaws.net/index.html#/dashboard/elasticsearch/PROD%20-%20Basket%20HTTP
Comment 1•10 years ago
|
||
> I'm not sure what response it's seeing from basket - perhaps you can look in the logs on the basket side and correlate?
Looking at our proxy code, it is just piping the response received from basket, so looks like the 500 error comes from basket.
| Reporter | ||
Comment 2•10 years ago
|
||
I wondered whether we might be e.g. getting badly-formed output and throwig a ParseError internally, but you're right, the code doesn't seem to be doing anything that might trigger that.
Comment 3•10 years ago
|
||
7am UTC is exactly midnight Pacific, which I'm almost positive (I've not checked) is the time at which some tasks run on the Exact Target site that (very) unfortunately "lock" the tables. These things therefore cause the API to return faults or not at all, and therefore cause basket to do the same for things that require an immediate data response. For subscriptions and the like we retry the tasks, but for things like "lookup-user" there's no good way to retry. We need to re-architect the way we're using Exact Target, and that plan is forming, but for now we're kinda stuck with this.
I've NI'd Jess for more insight into the ET part.
Flags: needinfo?(jdavis)
| Reporter | ||
Comment 4•10 years ago
|
||
Thanks :pmac, even "we're stuck with this for a while" is good context for us to have. We've got work to do on the FxA side to make the error state less confusing for users, but the good news is that AFAICT the subscribe requests continue to work both from our side and yours.
| Assignee | ||
Comment 5•10 years ago
|
||
Pmac is right. 7am UTC = Midnight Pacific = when our filters run and tables lock :/
It's been a problem lurking around in the back of my mind, but we hadn't had a use case, until now, to put the pressure on to solve.
Will circle round with our ET team to figure out a way to finagle our account and avoid these timeouts.
Current Ideas (pros and cons to each that we have to solve for):
* Change filters to no lock (if possible to do via filter settings, which I think it's not thus we haven't changed it yet. SQL queries can run no lock, but that doesn't help with end-users who use the system.)
* Change the filters to not run on a nightly basis, and only manually run or run right before an email send.
Leaving the NI request to me so I keep on it.
| Assignee | ||
Comment 6•10 years ago
|
||
Just a quick update: We're working on optimizations on our end (spacing out queries and filters that run, eliminating some un-needed processes, etc.). Some tests are starting to run this week, and we'll continue to tweak until these timeouts stop. I'll keep you posted.
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jdavis
Flags: needinfo?(jdavis)
| Assignee | ||
Comment 7•10 years ago
|
||
Update: Yesterday evening we paused all the programs that run in ExactTarget nightly except for the 1 vital one that supports our double opt-in subscription process.
This morning, it does not appear that there were any timeouts in the system (per my new relic alerts).
I am going to leave the programs paused for a few days to see if/how it continues to keep timeouts from happening.
Pmac, rkelly - can you confirm the behavior you see on your end? (I know you're out at the moment - can you look on Monday 10 Aug?)
Flags: needinfo?(rfkelly)
Flags: needinfo?(pmac)
Comment 8•10 years ago
|
||
> Pmac, rkelly - can you confirm the behavior you see on your end? (I know you're out at the moment - can you look on Monday 10 Aug?)
I looked at metrics for fxa-content-server and last night, it had no 5xx level errors at the time they are usually seen each night.
Thanks, :jdavis!
Flags: needinfo?(rfkelly)
| Assignee | ||
Comment 9•10 years ago
|
||
NICE! Thanks :jrgm!
I'm going to leave it as is through the weekend, check back in early next week, and then slowly figure out how to rearchitect our ET system to avoid timeouts while making sure the processes we need to run can happen.
Comment 10•10 years ago
|
||
This is great! Thanks Jess. Looking forward to riding ourselves of this for good :)
Flags: needinfo?(pmac)
| Assignee | ||
Comment 11•10 years ago
|
||
Looks like we've identified the root cause of this and have currently band-aided it. I haven't seen any New Relic alerts come through the past week.
Please reopen if you start seeing timeouts again as we try to pick back up some of our filters and email launches.
Also - per #c9 - I'm working on spec'ing out a Q4 plan to solve/avoid these problems in the long run.
Thanks all!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•