Closed Bug 1220585 Opened 9 years ago Closed 9 years ago

[snippets] AVG Response time doubled

Categories

(Infrastructure & Operations Graveyard :: WebOps: Engagement, task)

task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: giorgos, Unassigned)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2077] )

NewRelic reports that average response time for snippets.mozilla.com-python doubled from ~25ms to ~50ms on Oct 28th.

https://rpm.newrelic.com/accounts/263620/applications/2904874

Snippet editors report significant performance downgrade.

This is probably related to the PHX1 exit. Is this something we can fix with more machine power or is it a another bottleneck?
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2077]
Bumping importance to major since this reduces our ability to work with the service.
Severity: normal → major
Change Request: --- → emergency
Giorgos,

We know what's causing this, putting in an emergency change request (because of the change freeze) to remedy. 

Thanks!
It looks like one of the virtual IPs that should have gotten shuffled in the move got missed. =\   I've switched that IP over to the new load balancing cluster and the response times in New Relic look  more like the ones from a couple of weeks ago.

@girgos: Can you verify that the snippets editors are able to work with the service again?
We've backed this out because it failed after 10-15 mins of being fine. I've spoken to Ben Sternthal and we'll revisit this after the change freeze is over on the 11th Nov.
Change Request: emergency → not successful
Can we call November 11th the delivery date for this? It makes it very hard to edit and deploy snippets, and we're in the middle of a campaign cycle.
Folks we need some eyes on this. can you give us an ETA?
Flags: needinfo?(smani)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #7)
> Folks we need some eyes on this. can you give us an ETA?

We're working on this with netops, should be done tomorrow.
Flags: needinfo?(smani)
James,

These are the IPs that are moving :


[shyam@snippetsadm1.private.phx1 ~]$ host snippets-rw-zeus.phx1.mozilla.com
snippets-rw-zeus.phx1.mozilla.com has address 10.8.70.90

[shyam@snippetsadm1.private.phx1 ~]$ host snippets-ro-zeus.phx1.mozilla.com
snippets-ro-zeus.phx1.mozilla.com has address 10.8.70.99
Flags: needinfo?(jbarnell)
We'll do this at 1100 PST tomorrow. If this goes south, we will need some time to see why and therefore the site might be offline for a bit while we do that digging around. 

Thanks!
Flags: needinfo?(jbarnell)
(In reply to Shyam Mani [:fox2mike] from comment #10)
> We'll do this at 1100 PST tomorrow. If this goes south, we will need some
> time to see why and therefore the site might be offline for a bit while we
> do that digging around. 
> 
> Thanks!
Hi Shyam!

Can you elaborate on which elements of the site will be offline?

Is it just the snippets admin? Or is about:home also at risk?
Flags: needinfo?(smani)
(In reply to Cory Price [:ckprice] from comment #11)
> Can you elaborate on which elements of the site will be offline?

Sorry, "which elements of the site _may_ be offline"
(In reply to Cory Price [:ckprice] from comment #11)
> (In reply to Shyam Mani [:fox2mike] from comment #10)
> > We'll do this at 1100 PST tomorrow. If this goes south, we will need some
> > time to see why and therefore the site might be offline for a bit while we
> > do that digging around. 
> > 
> > Thanks!
> Hi Shyam!
> 
> Can you elaborate on which elements of the site will be offline?
> 
> Is it just the snippets admin? Or is about:home also at risk?

If the snippets servers go down users do not get a downgraded experience, i.e. the about:home works as expected. The only difference is that some users (the ones who will not hit the cache) will not get new snippets while the downtime lasts, which will affect the snippet views and metrics.
We have updated the VIPs to host from neo-phx1 and presuming they remain in good standing, we're done here.
Flags: needinfo?(smani)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
I verify that the response time dropped back to ~20ms according to NR. Thanks folks!
Status: RESOLVED → VERIFIED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.