Last Comment Bug 600859 - Push Socorro 1.7.4 to production
: Push Socorro 1.7.4 to production
Status: RESOLVED FIXED
10/07/2010
:
Product: Infrastructure & Operations
Classification: Other
Component: WebOps: Other (show other bugs)
: other
: All Other
: -- minor (vote)
: ---
Assigned To: Justin Dow [:jabba]
: matthew zeier [:mrz]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-09-30 08:56 PDT by Laura Thomson :laura
Modified: 2013-10-09 10:29 PDT (History)
10 users (show)
mzeier: needs‑downtime+
See Also:
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---
Cab Review: ServiceNow Change Request (use flag)


Attachments
a patch of the changes to the daily csv for 1.7.4 (1.74 KB, patch)
2010-09-30 10:19 PDT, K Lars Lohn [:lars] [:klohn]
no flags Details | Diff | Splinter Review

Description Laura Thomson :laura 2010-09-30 08:56:34 PDT
Just seeking a release window for this at present, so I'm filing early as per our postmortem discussions.

Overview: We have two changes, bug 596689 and bug 600246, which are needed for Firefox beta 7 according to chofmann.  These changes only affect CSV file generation (a cron job).  (We had planned on a further 4 bugs, but these will not be ready in time for beta 7, so there will be a 1.7.5 for those, ready in approximately two weeks.)

User facing downtime: none

Risk/rollback plan:  If there is any problem with the CSVs generated, we can roll back the cron job to 1.7.3 and run it again, so risk is very low.

Timeline:  The changes have been tested locally.  Staging these bugs is blocked by bug 598752 which I expect to clear up today.  (Waiting on a mod_python box.)  

Suggested window: 10/5 or 10/7 in the regular maintenance window.
Comment 1 matthew zeier [:mrz] 2010-09-30 08:59:12 PDT
Suggesting 10/07 only because there's a sumo push on Tuesday.  I don't know if there are conflicting resources for that though.

aravind/jabba?
Comment 2 christian 2010-09-30 09:50:17 PDT
Just a quick comment...in the future it'd be nice to get diffs attached to the bugs so us not familiar with the code can see how invasive the changes being pushed are. 10/7 would be fine with me as I can rely on the web interface if something goes wrong with csv generation (and I am pretty sure the fixes are fairly trivial).
Comment 3 K Lars Lohn [:lars] [:klohn] 2010-09-30 10:19:47 PDT
Created attachment 479820 [details] [diff] [review]
a patch of the changes to the daily csv for 1.7.4

Here's a patch that shows the minor differences between 1.7.3 and 1.7.4.
Comment 4 chris hofmann 2010-09-30 16:27:03 PDT
Its not really clear to me why this needs to go in a regular maintenance window.  It seems like this is a really good change to decouple from any other work since it only involves the .csv files that me and a handful of people use each day. 

laura did a good job of outlining minimal risk, and *zero* downtime, and rollback plan in comment 0.   

our history of doing these .csv changes indicates detaching them from other socorro updates actually works out best, since the problems we have run into in the past result from forgetting to restart the cron jobs, or delays of the changes going into affect because of regressions from other web, database, or report processing load issues.

lets spend the two minutes to push the script changes and have them get picked up in the next cron job run so I can start analyzing more of the data we need to ship a good firefox 4 this weekend.
Comment 5 Laura Thomson :laura 2010-09-30 16:28:46 PDT
chofmann: the only thing is that the changes aren't tested/staged yet.
Comment 6 chris hofmann 2010-09-30 18:04:44 PDT
the best test is to run the script, and run it against the production db.

think of this more like "hey aravind, can you run some sql so we can look at some data from the database",  rather than a "release"

the only difference beyond that is that this script runs every night on a cron job.
Comment 7 matthew zeier [:mrz] 2010-09-30 20:17:29 PDT
(In reply to comment #4)
> Its not really clear to me why this needs to go in a regular maintenance
> window.  It seems like this is a really good change to decouple from any other
> work since it only involves the .csv files that me and a handful of people use
> each day. 
> 
> laura did a good job of outlining minimal risk, and *zero* downtime, and
> rollback plan in comment 0.   

Until a site has a good track record of doing risk free pushes, I want these in scheduled announced windows.  Everyone in IT knows to be around during the Tuesday/Thursday times incase they need to be called in.
Comment 8 chris hofmann 2010-09-30 21:24:45 PDT
(In reply to comment #7)  
> 
> Until a site has a good track record of doing risk free pushes, I want these in
> scheduled announced windows.  Everyone in IT knows to be around during the
> Tuesday/Thursday times incase they need to be called in.

I would agree for changes that are integral to the operation of the site.  This one is not.  I'm not sure it really falls into the bucket of something I'd call a "push" or revision to the operation of site.  Its just a script that runs some sql that runs on a cron job.

Can we just move this update to the script over the production side, run it by hand, and produce some csv files that I can start looking at?  That would be fine to suit my initial needs of trying to get a better handle on video card, driver, OS, and .dll's coming into play on 3d, direct 2d, and hardware accelerated graphics related crashes as soon as possible.

Then we could wait to push the new script update into place where the cron job runs it at the next announced window.
Comment 9 K Lars Lohn [:lars] [:klohn] 2010-10-01 09:46:02 PDT
the changes to the Socorro system to accommodate this enhancement touch two apps: the dailyUrl cron (should probably be renamed to dailyCsv), as well as the processor.  The processor change is very minor, but I think that is what pushes this change over the threshold to require a release.

The change to the processor is necessary to meet the requirements of getting the number of cores into 'cpu_info' field.  The number of cores, while available in the metadata, has not, until now, been recorded in the database.  I modified the processor to read that data and merge it with the 'cpu_info' database column rather than adding a new database column.  So the change is minor to the processor.

If you can do without the number of cores information, the scope of the problem shrinks to affecting only the 'dailyUrl' cron.  As such, then I would advocate that it would not require a full blown release to get into production.

I regret that this project has lost some of the agility that it had a year and a half ago when changes like this could go from concept to production in a matter of hours.  The introduction of formal procedures inevitably will slow things down from my "cowboy" actions in the past. The goal is, of course, to improve predictability and reliability, the downside is the reduction in agility.
Comment 10 chris hofmann 2010-10-01 10:28:23 PDT
ok, I buy the argument that changes to the processor cross the line and make this change "release worthy."  I should have looked closer at the diffs to see that's the approach that was taken

> I regret that this project has lost some of the agility it had...

me too.  I think there is a middle ground here that we need to be striving for. 

We definitely need to get *much* more agile on pushing reports and tools used analyze the data! 

Problems with recent releases show that we need to get more systematic (and accept the corresponding reduced agility) around changes to configuration, system loading, incoming report processing, and data management kinds of changes. 

I hope I can convince everyone not to conflate the two.  I'd like to see us push for a quarterly goal of pushing 25 or 50 new reporting and analysis tool enhancements that would help us to use the data we have much more effectively.  Many of these have been delayed or deferred as part of the work to enhance the backend processing and data storage.  I keep hoping we can shift the balance soon.

does this seem achievable?

I'd really like to see this get pushed on Tuesday.  Is there any way this can be done in parallel with the unrelated SUMO change Matt talked about in comment 1, or is there a place arbitrate the competing push priorities between projects?
Comment 11 Laura Thomson :laura 2010-10-05 09:14:26 PDT
Staging and testing is now complete.

The tag is http://socorro.googlecode.com/svn/tags/releases/1.7.4_r2552_20101005/

This release is ready to go into the starting gate.  mrz, any chance tonight or are we looking at Thursday?
Comment 12 matthew zeier [:mrz] 2010-10-05 09:27:35 PDT
Thursday's preferable.
Comment 13 Justin Fitzhugh 2010-10-06 21:21:22 PDT
are we *100%* confident in our rollback plan (and would we loose any data if it's executed)?  is it documented?  how much downtime is acceptable before we initiate the backout plan?
Comment 14 matthew zeier [:mrz] 2010-10-06 21:25:07 PDT
To clarify, here's what I need before 4pm:

1. Deployment plan.  What steps does Ops run? 

2. Rollback steps.  What steps does Ops run?

3. How do we know this push was a success?  If it's not, roll back in 30 mins?
Comment 15 Laura Thomson :laura 2010-10-07 05:21:25 PDT
1.  Update (and restart where needed) each Socorro component.  (There are only changes to processor and one cron job, but we like to keep the codebase in sync.)

The tag is http://socorro.googlecode.com/svn/tags/releases/1.7.4_r2552_20101005/
Update by switching to this tag.

In detail:
- Update and restart collectors
- Update and restart monitor
- Update and restart processors
- Update and restart web services
- Update cron jobs
- Update web app and purge caches (memcache, lb)

2.  To rollback, perform the same steps but using the tag
http://socorro.googlecode.com/svn/tags/releases/1.7.3_20100916/

3.  This release only affects the output from the dailyUrl cron.  After this has run (not sure when it's scheduled) if there are any problems with the output (chofmann is the main user, so he will confirm), perform steps in 2. and manually re-execute cron job.
Comment 16 chris hofmann 2010-10-07 07:37:11 PDT
we also need to monitor the processors a bit closer after the update and restart since the change there has a small chance of impacting performance with the code that creates a new table (that's why this qualifies as a "release").
Comment 17 K Lars Lohn [:lars] [:klohn] 2010-10-07 08:20:00 PDT
point of clarification: we will monitor the processor immediately after restart.  We can detect success nearly instantly by looking in the database at any newly processed report: data in the 'cpu_info' column will contain the number of cores suffixed to the end.  chofmann, is not correct about the addition of a new table, it is the processing of that one column that is different.  

Should the whole process "go pear shaped", and the processor fails to work at all (extremely unlikely), we will know immediately, data loss would be confined to very small number: sum(aProcessor.numberOfThreads for aProcessor in ListOfAllProcessors).
Comment 18 Justin Dow [:jabba] 2010-10-07 20:09:02 PDT
The update is done and lars and laura confirmed over irc that everything is working properly.
Comment 19 Laura Thomson :laura 2010-10-07 20:11:30 PDT
Webapp, collection, processing all working as expected.

chofmann: please verify the csvs are as expected.
Comment 20 chris hofmann 2010-10-08 07:27:54 PDT
initial look at expanded cpu info and app notes addtions to the .csv file looks great and will really help!

in the previous days .csv we got limited cpu info of

awk -F\t '{print $13}' 20101006*  | sort | uniq -c | more
5385 \N
 377 amd64
   1 cpu_name
3061 ppc
362027 x86

now we get expanded info

awk -F\t '{print $13}' 20101007*  | sort | uniq -c | sort -nr | more
41703 x86 | GenuineIntel family 6 model 23 stepping 10
36847 x86 | GenuineIntel family 6 model 15 stepping 13
13767 x86 | GenuineIntel family 6 model 23 stepping 6
10817 x86 | GenuineIntel family 15 model 4 stepping 1
10342 x86 | GenuineIntel family 15 model 2 stepping 9
10329 x86 | GenuineIntel family 6 model 15 stepping 11
9855 x86 | AuthenticAMD family 15 model 107 stepping 2
9829 x86 | GenuineIntel family 15 model 4 stepping 9
8820 x86 | GenuineIntel family 6 model 15 stepping 6
8127 \N
<long tail snipped>

For the new app_notes entries we see 22659 of 380041 reports had app notes entries

awk -F\t '$26 !~ /\N/ {print $26}' 20101007* | sort | uniq -c | sort -nr | more
 798 AdapterVendorID: 0000, AdapterDeviceID: 0000
 563 AdapterVendorID: 8086, AdapterDeviceID: 2a42
 476 AdapterVendorID: 8086, AdapterDeviceID: 2a02
 403 AdapterVendorID: 1106, AdapterDeviceID: 3344
 360 AdapterVendorID: 1039, AdapterDeviceID: 6330
 355 AdapterVendorID: 8086, AdapterDeviceID: 29c2
 329 AdapterVendorID: 10de, AdapterDeviceID: 03d0
 329 AdapterVendorID: 10de, AdapterDeviceID: 0322
 319 AdapterVendorID: 0000, AdapterDeviceID: 0000\n
 310 AdapterVendorID: 8086, AdapterDeviceID: 2772
 307 AdapterVendorID: 8086, AdapterDeviceID: 0046
 299 AdapterVendorID: 8086, AdapterDeviceID: 2a42\n
 245 AdapterVendorID: 10de, AdapterDeviceID: 0622
 224 AdapterVendorID: 1002, AdapterDeviceID: 5975
 215 AdapterVendorID: 10de, AdapterDeviceID: 0402
 212 AdapterVendorID: 10de, AdapterDeviceID: 0326
 171 AdapterVendorID: 1106, AdapterDeviceID: 3371
 171 AdapterVendorID: 10de, AdapterDeviceID: 0641
 162 AdapterVendorID: 10de, AdapterDeviceID: 0640
<long tail snipped>

It will take more work to do some statistically analysis on how these generations of cpu families and there age, and graphics cards correlate to various signatures and general "crashiness" but now we have some data to chew on.

one small cosmetic bug that might affect blcary's and others parsing of the .csv file.

column header for the cpu_name field has changed to ?column?

we can fix that in a follow up up bug.

Note You need to log in before you can comment on or make changes to this bug.