Closed Bug 623386 Opened 14 years ago Closed 13 years ago

update production graph code base to latest + add amo_db credentials

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: justdave)

References

Details

From check-ins on bug 608347.

Update code on production graph server to latest.

Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.
Blocks: 608347
> Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.

Whoever is handling bug 620570 has those credentials.
I've no idea of context/details here, but want to avoid another tree closure event like happened before with graphserver-changes-for-AMO.

I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its unclear what the fixes are, and who's reviewed them. If this is impacting production graphserver, has this change been reviewed and approved by RelEng and IT? Also, should this happen in a scheduled downtime?
The changes in bug 620570 remove the dependency that caused the tree closure last time around, the reviews and testing were done by me and rhelmer.  The connection to the amo_db has been moved out of the regular code path and is only opened when amo data is being handled, so it will no longer effect regular graph server usage.

A downtime for this work would be reasonable, but I leave that up to the IT as they will be doing the work.
Sorry, the changes in bug 608347 remove the amo_db dependency.
Upping severity a notch to get this scheduled.
Severity: normal → major
Assignee: server-ops → jdow
DB access granted and access credentials sent over to jabba via email (share as needed to whoever is doing this bug).
Once the code is updated, the credentials should be added to the amo_db in server/graphsdb.py.
Assignee: jdow → justdave
Any ETA on rollout?
Up-ing severity again.  Would like to get this rolled out to finish 2010 Q4 goal.
Severity: major → critical
Code is already there, it just needs the config, and that's what we needed the downtime for.  Last time we touched that config it broke stuff.
Sorry, missed the subtlety about removing stuff in comment 3.  Yeah, we need a downtime for this.  You guys can schedule it, you care more about when it's working than we do. :)
Flags: needs-treeclosure?
Flags: needs-downtime+
Over to Zandr to get scheduled.
Assignee: justdave → zandr
When can we do this?
(In reply to comment #13)
> When can we do this?

How much downtime do you need? If this impacts talos, I'd guess a few hours to rerun before/after tests and look for variance? Depending on how much downtime is needed, I note there are 4 codefreezes happening this week, so maybe next week is best?

Also, please confirm this has been tested in staging, so we dont hit a surprise closure in production like last time. 

(In reply to comment #4)
> Sorry, the changes in bug 608347 remove the amo_db dependency.

I note bug#608347 remains open. Hard to tell - is this a blocker?
This has tested green in staging.

This does not affect talos numbers, no talos code is altered in this change.

This will not affect the reporting of talos results, as long as the code changes to the graph server have been rolled out.  In fact, if the code changes have rolled out (as it sounds like they have) then all the code is already working and green - adding the credentials will only allow for adding amo results and will in no way affect talos.

This does block roll out of the bug 608347, as if the credentials are not available we can not use the collection mode.
Zandr, can we schedule this?
Severity: critical → blocker
We need to get through the current batch of releases (including Fx4b10) this week, and will revisit this on Monday AM with an eye to sending notices for a Wednesday AM EST downtime.
(In reply to comment #17)
> We need to get through the current batch of releases (including Fx4b10) this
> week, and will revisit this on Monday AM with an eye to sending notices for a
> Wednesday AM EST downtime.

Any update?
Are we on schedule for Wednesday then?
I didn't see this in mrz's outage mail.  I hope it's a part of it.
(In reply to comment #20)
> I didn't see this in mrz's outage mail.  I hope it's a part of it.

It is not.
I don't have a go ahead from joduinn on this yet. He has concerns surrounding testing after rollout and availability of webdev resources during the push that I can't answer. I've asked him to comment in this bug, but until he's comfortable that the changes aren't going to burn the tree again, I'm not going to schedule the window.
joduinn, what's the schedule look like?
This was mentioned in comment #9 two weeks ago, but just reiterate, this is blocking a Q4 goal, something that we wanted to do by the end of December. Every day Wil asks for an update and doesn't receive a response more Firefox users are leaving because of slow add-ons we don't know about.
There's been a lot of discussion between IT and RelEng out of band, sorry this bug looked forgotten.

I need a few things to be comfortable rolling this out, and availability of these things affect the schedule.

1) Did we test breaking the connection to the AMO db in staging? Could talos results still be posted?

2) What can we check when we after deployment to ensure the tree won't turn red hours later? (which is the failure that caused all this churn) Again, I'd like to break the db connection and verify that results can be posted successfully before we close the downtime window.

3) When will a dev resource be available to help with issues that may come up during the push? Based on that, we can pick a downtime window. There's a preference for EST morning, but if the dev resources are PST we'll come up with a different timeslot.

I expect that the fixes mean this is a low-risk push, but we burned the tree last time, so people are very gunshy this time around.
(In reply to comment #24)
> There's been a lot of discussion between IT and RelEng out of band, sorry this
> bug looked forgotten.
> 
> I need a few things to be comfortable rolling this out, and availability of
> these things affect the schedule.
> 
> 1) Did we test breaking the connection to the AMO db in staging? Could talos
> results still be posted?
> 
> 2) What can we check when we after deployment to ensure the tree won't turn red
> hours later? (which is the failure that caused all this churn) Again, I'd like
> to break the db connection and verify that results can be posted successfully
> before we close the downtime window.

I don't know.  Alice did all the code/testing and she's apparently on leave now.  Both sound like things we can do when we roll it out.

> 3) When will a dev resource be available to help with issues that may come up
> during the push? Based on that, we can pick a downtime window. There's a
> preference for EST morning, but if the dev resources are PST we'll come up with
> a different timeslot.

Rhelmer has volunteered to be around since he worked on some of the code.  He doesn't have a time preference.
(In reply to comment #25)
> (In reply to comment #24)
> > There's been a lot of discussion between IT and RelEng out of band, sorry this
> > bug looked forgotten.
> > 
> > I need a few things to be comfortable rolling this out, and availability of
> > these things affect the schedule.
> > 
> > 1) Did we test breaking the connection to the AMO db in staging? Could talos
> > results still be posted?
> > 
> > 2) What can we check when we after deployment to ensure the tree won't turn red
> > hours later? (which is the failure that caused all this churn) Again, I'd like
> > to break the db connection and verify that results can be posted successfully
> > before we close the downtime window.
> 
> I don't know.  Alice did all the code/testing and she's apparently on leave
> now.  Both sound like things we can do when we roll it out.


This was tested by Alice in staging (608347 comment 25). The intention of the patch is to print an error and continue if there are any problems sending results to the AMO db, including connecting to the db (per bug 620570 comment 20).
(In reply to comment #26)

> This was tested by Alice in staging (608347 comment 25). The intention of the
> patch is to print an error and continue if there are any problems sending
> results to the AMO db, including connecting to the db (per bug 620570 comment
> 20).

Excellent. I'll get this scheduled for this week when rhelmer can be around.
(In reply to comment #2)
> I've no idea of context/details here, but want to avoid another tree closure
> event like happened before with graphserver-changes-for-AMO.
> 
> I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its
> unclear what the fixes are, and who's reviewed them. If this is impacting
> production graphserver, has this change been reviewed and approved by RelEng
> and IT? Also, should this happen in a scheduled downtime?


The relevant patches are attachment 500972 [details] [diff] [review] and attachment 501375 [details] [diff] [review], they are specifically about avoiding the problem that caused the tree closure before. More review would be great (drive-by in the bug is fine, or I can r? if you have someone in particular in mind).
Done.
Assignee: zandr → justdave
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
Flags: needs-treeclosure?
You need to log in before you can comment on or make changes to this bug.