Closed Bug 1179496 Opened 10 years ago Closed 10 years ago

/lookup-user calls error out at certain times of day

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: jdavis)

Details

Attachments

(1 file)

Screen Shot 2015-07-01 at 14.51.48.png 10 years ago Ryan Kelly [:rfkelly] 65.82 KB, image/png		Details

Ryan Kelly [:rfkelly]

Reporter

Description

•

10 years ago

Attached image Screen Shot 2015-07-01 at 14.51.48.png — Details

In the FxA/basket integration, we're seeing regular outages of the /lookup-user endpoint. You can see the HTTP statuses in our kibana dashboard at [1], but I'm attaching a screenshot for convenience. The red in the graph is our baset proxy returning a "500 server error". I'm not sure what response it's seeing from basket - perhaps you can look in the logs on the basket side and correlate? As far as we can tell, this happens daily at around 7am (I think the logs are in UTC?) and lasts for about around 20 minutes. Only /lookup-user requests are affected, and AFAICT no /lookup-user requests succeed during such an event. :pmac, does this correspond to any known/expected behaviour from basket? IIUC there's some background syncing process with ET thst might be responsible? [1] https://kibana.fxa.us-west-2.prod.mozaws.net/index.html#/dashboard/elasticsearch/PROD%20-%20Basket%20HTTP

John Morrison [:jrgm]

Comment 1

•

10 years ago

> I'm not sure what response it's seeing from basket - perhaps you can look in the logs on the basket side and correlate? Looking at our proxy code, it is just piping the response received from basket, so looks like the 500 error comes from basket.

Ryan Kelly [:rfkelly]

Reporter

Comment 2

•

10 years ago

I wondered whether we might be e.g. getting badly-formed output and throwig a ParseError internally, but you're right, the code doesn't seem to be doing anything that might trigger that.

Paul [:pmac] McLanahan

Comment 3

•

10 years ago

7am UTC is exactly midnight Pacific, which I'm almost positive (I've not checked) is the time at which some tasks run on the Exact Target site that (very) unfortunately "lock" the tables. These things therefore cause the API to return faults or not at all, and therefore cause basket to do the same for things that require an immediate data response. For subscriptions and the like we retry the tasks, but for things like "lookup-user" there's no good way to retry. We need to re-architect the way we're using Exact Target, and that plan is forming, but for now we're kinda stuck with this. I've NI'd Jess for more insight into the ET part.

Flags: needinfo?(jdavis)

Ryan Kelly [:rfkelly]

Reporter

Comment 4

•

10 years ago

Thanks :pmac, even "we're stuck with this for a while" is good context for us to have. We've got work to do on the FxA side to make the error state less confusing for users, but the good news is that AFAICT the subscribe requests continue to work both from our side and yours.

Jessilyn Davis

Assignee

Comment 5

•

10 years ago

Pmac is right. 7am UTC = Midnight Pacific = when our filters run and tables lock :/ It's been a problem lurking around in the back of my mind, but we hadn't had a use case, until now, to put the pressure on to solve. Will circle round with our ET team to figure out a way to finagle our account and avoid these timeouts. Current Ideas (pros and cons to each that we have to solve for): * Change filters to no lock (if possible to do via filter settings, which I think it's not thus we haven't changed it yet. SQL queries can run no lock, but that doesn't help with end-users who use the system.) * Change the filters to not run on a nightly basis, and only manually run or run right before an email send. Leaving the NI request to me so I keep on it.

Jessilyn Davis

Assignee

Comment 6

•

10 years ago

Just a quick update: We're working on optimizations on our end (spacing out queries and filters that run, eliminating some un-needed processes, etc.). Some tests are starting to run this week, and we'll continue to tweak until these timeouts stop. I'll keep you posted.

Jessilyn Davis

Assignee

Updated

•

10 years ago

Assignee: nobody → jdavis

Flags: needinfo?(jdavis)

Jessilyn Davis

Assignee

Comment 7

•

10 years ago

Update: Yesterday evening we paused all the programs that run in ExactTarget nightly except for the 1 vital one that supports our double opt-in subscription process. This morning, it does not appear that there were any timeouts in the system (per my new relic alerts). I am going to leave the programs paused for a few days to see if/how it continues to keep timeouts from happening. Pmac, rkelly - can you confirm the behavior you see on your end? (I know you're out at the moment - can you look on Monday 10 Aug?)

Flags: needinfo?(rfkelly)

Flags: needinfo?(pmac)

John Morrison [:jrgm]

Comment 8

•

10 years ago

> Pmac, rkelly - can you confirm the behavior you see on your end? (I know you're out at the moment - can you look on Monday 10 Aug?) I looked at metrics for fxa-content-server and last night, it had no 5xx level errors at the time they are usually seen each night. Thanks, :jdavis!

Flags: needinfo?(rfkelly)

Jessilyn Davis

Assignee

Comment 9

•

10 years ago

NICE! Thanks :jrgm! I'm going to leave it as is through the weekend, check back in early next week, and then slowly figure out how to rearchitect our ET system to avoid timeouts while making sure the processes we need to run can happen.

Paul [:pmac] McLanahan

Comment 10

•

10 years ago

This is great! Thanks Jess. Looking forward to riding ourselves of this for good :)

Flags: needinfo?(pmac)

Jessilyn Davis

Assignee

Comment 11

•

10 years ago

Looks like we've identified the root cause of this and have currently band-aided it. I haven't seen any New Relic alerts come through the past week. Please reopen if you start seeing timeouts again as we try to pick back up some of our filters and email launches. Also - per #c9 - I'm working on spec'ing out a Q4 plan to solve/avoid these problems in the long run. Thanks all!

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

/lookup-user calls error out at certain times of day

Categories

(Websites :: Basket, defect)

Tracking

(Not tracked)

People

(Reporter: rfkelly, Assigned: jdavis)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type