623386 - update production graph code base to latest + add amo_db credentials

Reporter

Description

•

14 years ago

From check-ins on bug 608347.

Update code on production graph server to latest.

Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.

alice nodelman [:alice] [:anode]

Reporter

Updated

•

14 years ago

Blocks: 608347

Wil Clouser [:clouserw]

Comment 1

•

14 years ago

> Edit server/graphsdb.py with credentials for amo_db provided by anode/clouserw.

Whoever is handling bug 620570 has those credentials.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

14 years ago

I've no idea of context/details here, but want to avoid another tree closure event like happened before with graphserver-changes-for-AMO.

I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its unclear what the fixes are, and who's reviewed them. If this is impacting production graphserver, has this change been reviewed and approved by RelEng and IT? Also, should this happen in a scheduled downtime?

alice nodelman [:alice] [:anode]

Reporter

Comment 3

•

14 years ago

The changes in bug 620570 remove the dependency that caused the tree closure last time around, the reviews and testing were done by me and rhelmer.  The connection to the amo_db has been moved out of the regular code path and is only opened when amo data is being handled, so it will no longer effect regular graph server usage.

A downtime for this work would be reasonable, but I leave that up to the IT as they will be doing the work.

alice nodelman [:alice] [:anode]

Reporter

Comment 4

•

14 years ago

Sorry, the changes in bug 608347 remove the amo_db dependency.

alice nodelman [:alice] [:anode]

Reporter

Comment 5

•

14 years ago

Upping severity a notch to get this scheduled.

Severity: normal → major

Justin Dow [:jabba]

Updated

•

14 years ago

Assignee: server-ops → jdow

Shyam Mani [:fox2mike]

Comment 6

•

14 years ago

DB access granted and access credentials sent over to jabba via email (share as needed to whoever is doing this bug).

alice nodelman [:alice] [:anode]

Reporter

Comment 7

•

14 years ago

Once the code is updated, the credentials should be added to the amo_db in server/graphsdb.py.

Dave Miller [:justdave]

Assignee

Updated

•

14 years ago

Assignee: jdow → justdave

alice nodelman [:alice] [:anode]

Reporter

Comment 8

•

14 years ago

Any ETA on rollout?

alice nodelman [:alice] [:anode]

Reporter

Comment 9

•

14 years ago

Up-ing severity again.  Would like to get this rolled out to finish 2010 Q4 goal.

Severity: major → critical

Dave Miller [:justdave]

Assignee

Comment 10

•

14 years ago

Code is already there, it just needs the config, and that's what we needed the downtime for.  Last time we touched that config it broke stuff.

Dave Miller [:justdave]

Assignee

Comment 11

•

14 years ago

Sorry, missed the subtlety about removing stuff in comment 3.  Yeah, we need a downtime for this.  You guys can schedule it, you care more about when it's working than we do. :)

Flags: needs-treeclosure?

Flags: needs-downtime+

matthew zeier [:mrz]

Comment 12

•

14 years ago

Over to Zandr to get scheduled.

Assignee: justdave → zandr

Wil Clouser [:clouserw]

Comment 13

•

13 years ago

When can we do this?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 14

•

13 years ago

(In reply to comment #13)
> When can we do this?

How much downtime do you need? If this impacts talos, I'd guess a few hours to rerun before/after tests and look for variance? Depending on how much downtime is needed, I note there are 4 codefreezes happening this week, so maybe next week is best?

Also, please confirm this has been tested in staging, so we dont hit a surprise closure in production like last time. 

(In reply to comment #4)
> Sorry, the changes in bug 608347 remove the amo_db dependency.

I note bug#608347 remains open. Hard to tell - is this a blocker?

alice nodelman [:alice] [:anode]

Reporter

Comment 15

•

13 years ago

This has tested green in staging.

This does not affect talos numbers, no talos code is altered in this change.

This will not affect the reporting of talos results, as long as the code changes to the graph server have been rolled out.  In fact, if the code changes have rolled out (as it sounds like they have) then all the code is already working and green - adding the credentials will only allow for adding amo results and will in no way affect talos.

This does block roll out of the bug 608347, as if the credentials are not available we can not use the collection mode.

Wil Clouser [:clouserw]

Comment 16

•

13 years ago

Zandr, can we schedule this?

Wil Clouser [:clouserw]

Updated

•

13 years ago

Severity: critical → blocker

Zandr Milewski [:zandr]

Comment 17

•

13 years ago

We need to get through the current batch of releases (including Fx4b10) this week, and will revisit this on Monday AM with an eye to sending notices for a Wednesday AM EST downtime.

Wil Clouser [:clouserw]

Comment 18

•

13 years ago

(In reply to comment #17)
> We need to get through the current batch of releases (including Fx4b10) this
> week, and will revisit this on Monday AM with an eye to sending notices for a
> Wednesday AM EST downtime.

Any update?

Wil Clouser [:clouserw]

Comment 19

•

13 years ago

Are we on schedule for Wednesday then?

Wil Clouser [:clouserw]

Comment 20

•

13 years ago

I didn't see this in mrz's outage mail.  I hope it's a part of it.

Zandr Milewski [:zandr]

Comment 21

•

13 years ago

(In reply to comment #20)
> I didn't see this in mrz's outage mail.  I hope it's a part of it.

It is not.
I don't have a go ahead from joduinn on this yet. He has concerns surrounding testing after rollout and availability of webdev resources during the push that I can't answer. I've asked him to comment in this bug, but until he's comfortable that the changes aren't going to burn the tree again, I'm not going to schedule the window.

Wil Clouser [:clouserw]

Comment 22

•

13 years ago

joduinn, what's the schedule look like?

Justin Scott [:fligtar]

Comment 23

•

13 years ago

This was mentioned in comment #9 two weeks ago, but just reiterate, this is blocking a Q4 goal, something that we wanted to do by the end of December. Every day Wil asks for an update and doesn't receive a response more Firefox users are leaving because of slow add-ons we don't know about.

Zandr Milewski [:zandr]

Comment 24

•

13 years ago

There's been a lot of discussion between IT and RelEng out of band, sorry this bug looked forgotten.

I need a few things to be comfortable rolling this out, and availability of these things affect the schedule.

1) Did we test breaking the connection to the AMO db in staging? Could talos results still be posted?

2) What can we check when we after deployment to ensure the tree won't turn red hours later? (which is the failure that caused all this churn) Again, I'd like to break the db connection and verify that results can be posted successfully before we close the downtime window.

3) When will a dev resource be available to help with issues that may come up during the push? Based on that, we can pick a downtime window. There's a preference for EST morning, but if the dev resources are PST we'll come up with a different timeslot.

I expect that the fixes mean this is a low-risk push, but we burned the tree last time, so people are very gunshy this time around.

Wil Clouser [:clouserw]

Comment 25

•

13 years ago

(In reply to comment #24)
> There's been a lot of discussion between IT and RelEng out of band, sorry this
> bug looked forgotten.
> 
> I need a few things to be comfortable rolling this out, and availability of
> these things affect the schedule.
> 
> 1) Did we test breaking the connection to the AMO db in staging? Could talos
> results still be posted?
> 
> 2) What can we check when we after deployment to ensure the tree won't turn red
> hours later? (which is the failure that caused all this churn) Again, I'd like
> to break the db connection and verify that results can be posted successfully
> before we close the downtime window.

I don't know.  Alice did all the code/testing and she's apparently on leave now.  Both sound like things we can do when we roll it out.

> 3) When will a dev resource be available to help with issues that may come up
> during the push? Based on that, we can pick a downtime window. There's a
> preference for EST morning, but if the dev resources are PST we'll come up with
> a different timeslot.

Rhelmer has volunteered to be around since he worked on some of the code.  He doesn't have a time preference.

Robert Helmer [:rhelmer]

Comment 26

•

13 years ago

(In reply to comment #25)
> (In reply to comment #24)
> > There's been a lot of discussion between IT and RelEng out of band, sorry this
> > bug looked forgotten.
> > 
> > I need a few things to be comfortable rolling this out, and availability of
> > these things affect the schedule.
> > 
> > 1) Did we test breaking the connection to the AMO db in staging? Could talos
> > results still be posted?
> > 
> > 2) What can we check when we after deployment to ensure the tree won't turn red
> > hours later? (which is the failure that caused all this churn) Again, I'd like
> > to break the db connection and verify that results can be posted successfully
> > before we close the downtime window.
> 
> I don't know.  Alice did all the code/testing and she's apparently on leave
> now.  Both sound like things we can do when we roll it out.


This was tested by Alice in staging (608347 comment 25). The intention of the patch is to print an error and continue if there are any problems sending results to the AMO db, including connecting to the db (per bug 620570 comment 20).

Zandr Milewski [:zandr]

Comment 27

•

13 years ago

(In reply to comment #26)

> This was tested by Alice in staging (608347 comment 25). The intention of the
> patch is to print an error and continue if there are any problems sending
> results to the AMO db, including connecting to the db (per bug 620570 comment
> 20).

Excellent. I'll get this scheduled for this week when rhelmer can be around.

Robert Helmer [:rhelmer]

Comment 28

•

13 years ago

(In reply to comment #2)
> I've no idea of context/details here, but want to avoid another tree closure
> event like happened before with graphserver-changes-for-AMO.
> 
> I've read https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c21 but its
> unclear what the fixes are, and who's reviewed them. If this is impacting
> production graphserver, has this change been reviewed and approved by RelEng
> and IT? Also, should this happen in a scheduled downtime?


The relevant patches are attachment 500972 [details] [diff] [review] and attachment 501375 [details] [diff] [review], they are specifically about avoiding the problem that caused the tree closure before. More review would be great (drive-by in the bug is fine, or I can r? if you have someone in particular in mind).

Dave Miller [:justdave]

Assignee

Comment 29

•

13 years ago

Done.

Assignee: zandr → justdave

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Dave Miller [:justdave]

Assignee

Updated

•

7 years ago

Flags: needs-treeclosure?