Last Comment Bug 623386 - update production graph code base to latest + add amo_db credentials
: update production graph code base to latest + add amo_db credentials
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All All
: -- blocker (vote)
: ---
Assigned To: Dave Miller [:justdave] (justdave@bugzilla.org)
: matthew zeier [:mrz]
Mentors:
Depends on:
Blocks: 608347 620570
  Show dependency treegraph
 
Reported: 2011-01-05 16:10 PST by alice nodelman [:alice] [:anode]
Modified: 2015-03-12 08:17 PDT (History)
11 users (show)
justdave: needs‑downtime+
justdave: needs‑treeclosure?
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description alice nodelman [:alice] [:anode] 2011-01-05 16:10:03 PST
From check-ins on bug 608347.

Update code on production graph server to latest.

Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.
Comment 1 Wil Clouser [:clouserw] 2011-01-05 16:17:35 PST
> Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.

Whoever is handling bug 620570 has those credentials.
Comment 2 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-06 00:36:59 PST
I've no idea of context/details here, but want to avoid another tree closure event like happened before with graphserver-changes-for-AMO.

I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its unclear what the fixes are, and who's reviewed them. If this is impacting production graphserver, has this change been reviewed and approved by RelEng and IT? Also, should this happen in a scheduled downtime?
Comment 3 alice nodelman [:alice] [:anode] 2011-01-06 11:40:29 PST
The changes in bug 620570 remove the dependency that caused the tree closure last time around, the reviews and testing were done by me and rhelmer.  The connection to the amo_db has been moved out of the regular code path and is only opened when amo data is being handled, so it will no longer effect regular graph server usage.

A downtime for this work would be reasonable, but I leave that up to the IT as they will be doing the work.
Comment 4 alice nodelman [:alice] [:anode] 2011-01-06 11:44:10 PST
Sorry, the changes in bug 608347 remove the amo_db dependency.
Comment 5 alice nodelman [:alice] [:anode] 2011-01-07 14:58:46 PST
Upping severity a notch to get this scheduled.
Comment 6 Shyam Mani [:fox2mike] 2011-01-10 04:24:11 PST
DB access granted and access credentials sent over to jabba via email (share as needed to whoever is doing this bug).
Comment 7 alice nodelman [:alice] [:anode] 2011-01-10 13:31:22 PST
Once the code is updated, the credentials should be added to the amo_db in server/graphsdb.py.
Comment 8 alice nodelman [:alice] [:anode] 2011-01-11 13:02:31 PST
Any ETA on rollout?
Comment 9 alice nodelman [:alice] [:anode] 2011-01-12 14:53:53 PST
Up-ing severity again.  Would like to get this rolled out to finish 2010 Q4 goal.
Comment 10 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-12 17:32:06 PST
Code is already there, it just needs the config, and that's what we needed the downtime for.  Last time we touched that config it broke stuff.
Comment 11 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-12 17:39:02 PST
Sorry, missed the subtlety about removing stuff in comment 3.  Yeah, we need a downtime for this.  You guys can schedule it, you care more about when it's working than we do. :)
Comment 12 matthew zeier [:mrz] 2011-01-12 21:05:44 PST
Over to Zandr to get scheduled.
Comment 13 Wil Clouser [:clouserw] 2011-01-18 12:29:59 PST
When can we do this?
Comment 14 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-18 13:10:06 PST
(In reply to comment #13)
> When can we do this?

How much downtime do you need? If this impacts talos, I'd guess a few hours to rerun before/after tests and look for variance? Depending on how much downtime is needed, I note there are 4 codefreezes happening this week, so maybe next week is best?

Also, please confirm this has been tested in staging, so we dont hit a surprise closure in production like last time. 

(In reply to comment #4)
> Sorry, the changes in bug 608347 remove the amo_db dependency.

I note bug#608347 remains open. Hard to tell - is this a blocker?
Comment 15 alice nodelman [:alice] [:anode] 2011-01-18 14:12:06 PST
This has tested green in staging.

This does not affect talos numbers, no talos code is altered in this change.

This will not affect the reporting of talos results, as long as the code changes to the graph server have been rolled out.  In fact, if the code changes have rolled out (as it sounds like they have) then all the code is already working and green - adding the credentials will only allow for adding amo results and will in no way affect talos.

This does block roll out of the bug 608347, as if the credentials are not available we can not use the collection mode.
Comment 16 Wil Clouser [:clouserw] 2011-01-19 12:57:30 PST
Zandr, can we schedule this?
Comment 17 Zandr Milewski [:zandr] 2011-01-20 12:52:14 PST
We need to get through the current batch of releases (including Fx4b10) this week, and will revisit this on Monday AM with an eye to sending notices for a Wednesday AM EST downtime.
Comment 18 Wil Clouser [:clouserw] 2011-01-24 12:55:37 PST
(In reply to comment #17)
> We need to get through the current batch of releases (including Fx4b10) this
> week, and will revisit this on Monday AM with an eye to sending notices for a
> Wednesday AM EST downtime.

Any update?
Comment 19 Wil Clouser [:clouserw] 2011-01-25 11:07:05 PST
Are we on schedule for Wednesday then?
Comment 20 Wil Clouser [:clouserw] 2011-01-25 20:09:17 PST
I didn't see this in mrz's outage mail.  I hope it's a part of it.
Comment 21 Zandr Milewski [:zandr] 2011-01-25 20:17:44 PST
(In reply to comment #20)
> I didn't see this in mrz's outage mail.  I hope it's a part of it.

It is not.
I don't have a go ahead from joduinn on this yet. He has concerns surrounding testing after rollout and availability of webdev resources during the push that I can't answer. I've asked him to comment in this bug, but until he's comfortable that the changes aren't going to burn the tree again, I'm not going to schedule the window.
Comment 22 Wil Clouser [:clouserw] 2011-01-26 16:16:49 PST
joduinn, what's the schedule look like?
Comment 23 Justin Scott [:fligtar] 2011-01-28 13:31:29 PST
This was mentioned in comment #9 two weeks ago, but just reiterate, this is blocking a Q4 goal, something that we wanted to do by the end of December. Every day Wil asks for an update and doesn't receive a response more Firefox users are leaving because of slow add-ons we don't know about.
Comment 24 Zandr Milewski [:zandr] 2011-01-28 13:54:22 PST
There's been a lot of discussion between IT and RelEng out of band, sorry this bug looked forgotten.

I need a few things to be comfortable rolling this out, and availability of these things affect the schedule.

1) Did we test breaking the connection to the AMO db in staging? Could talos results still be posted?

2) What can we check when we after deployment to ensure the tree won't turn red hours later? (which is the failure that caused all this churn) Again, I'd like to break the db connection and verify that results can be posted successfully before we close the downtime window.

3) When will a dev resource be available to help with issues that may come up during the push? Based on that, we can pick a downtime window. There's a preference for EST morning, but if the dev resources are PST we'll come up with a different timeslot.

I expect that the fixes mean this is a low-risk push, but we burned the tree last time, so people are very gunshy this time around.
Comment 25 Wil Clouser [:clouserw] 2011-01-31 09:29:59 PST
(In reply to comment #24)
> There's been a lot of discussion between IT and RelEng out of band, sorry this
> bug looked forgotten.
> 
> I need a few things to be comfortable rolling this out, and availability of
> these things affect the schedule.
> 
> 1) Did we test breaking the connection to the AMO db in staging? Could talos
> results still be posted?
> 
> 2) What can we check when we after deployment to ensure the tree won't turn red
> hours later? (which is the failure that caused all this churn) Again, I'd like
> to break the db connection and verify that results can be posted successfully
> before we close the downtime window.

I don't know.  Alice did all the code/testing and she's apparently on leave now.  Both sound like things we can do when we roll it out.

> 3) When will a dev resource be available to help with issues that may come up
> during the push? Based on that, we can pick a downtime window. There's a
> preference for EST morning, but if the dev resources are PST we'll come up with
> a different timeslot.

Rhelmer has volunteered to be around since he worked on some of the code.  He doesn't have a time preference.
Comment 26 Robert Helmer [:rhelmer] 2011-01-31 09:46:53 PST
(In reply to comment #25)
> (In reply to comment #24)
> > There's been a lot of discussion between IT and RelEng out of band, sorry this
> > bug looked forgotten.
> > 
> > I need a few things to be comfortable rolling this out, and availability of
> > these things affect the schedule.
> > 
> > 1) Did we test breaking the connection to the AMO db in staging? Could talos
> > results still be posted?
> > 
> > 2) What can we check when we after deployment to ensure the tree won't turn red
> > hours later? (which is the failure that caused all this churn) Again, I'd like
> > to break the db connection and verify that results can be posted successfully
> > before we close the downtime window.
> 
> I don't know.  Alice did all the code/testing and she's apparently on leave
> now.  Both sound like things we can do when we roll it out.


This was tested by Alice in staging (608347 comment 25). The intention of the patch is to print an error and continue if there are any problems sending results to the AMO db, including connecting to the db (per bug 620570 comment 20).
Comment 27 Zandr Milewski [:zandr] 2011-01-31 09:51:10 PST
(In reply to comment #26)

> This was tested by Alice in staging (608347 comment 25). The intention of the
> patch is to print an error and continue if there are any problems sending
> results to the AMO db, including connecting to the db (per bug 620570 comment
> 20).

Excellent. I'll get this scheduled for this week when rhelmer can be around.
Comment 28 Robert Helmer [:rhelmer] 2011-01-31 09:58:56 PST
(In reply to comment #2)
> I've no idea of context/details here, but want to avoid another tree closure
> event like happened before with graphserver-changes-for-AMO.
> 
> I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its
> unclear what the fixes are, and who's reviewed them. If this is impacting
> production graphserver, has this change been reviewed and approved by RelEng
> and IT? Also, should this happen in a scheduled downtime?


The relevant patches are attachment 500972 [details] [diff] [review] and attachment 501375 [details] [diff] [review], they are specifically about avoiding the problem that caused the tree closure before. More review would be great (drive-by in the bug is fine, or I can r? if you have someone in particular in mind).
Comment 29 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-03 08:11:49 PST
Done.

Note You need to log in before you can comment on or make changes to this bug.