Closed Bug 832353 Opened 11 years ago Closed 10 years ago

Create a method for releng to clone the production graphs DB to the staging graphs DB

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

Details

(Whiteboard: [2014q1] February)

I would like to have a way to clone the production DB to the staging DB as things get out of sync.

Is there a way for releng to trigger such process?

Thanks in advance!
Armen,

Do you want the cloning to be regularly? (e.g. daily, weekly, monthly) Or just have some kind of button to refresh whenever you need to?
Some of my coworkers where not comfortable with frequent cloning.
I think that monthly and manual would be best.
Noted. When would you like the first cloning to happen?
The 1st of every month should be good enough. Thanks Sheeri!
Well, I missed 2/1. Are folks OK with me cloning sometime this week? I can work up some scripts and run them manually this week, and set them to auto-run on the 1st of the month.

Should I do this during the evening, so it doesn't interfere with anything, perhaps? It should only take one evening, but I'm just getting back from some time off so I don't know quite when I'll be all caught up.
Any time this week should be fine.
We don't touch the staging DB regularly.
Just let us know what the window will be and I will let the rest of the team know.
OK, I'll work on it for tonight. How about after 6 pm Pacific time? I haven't tested how long it will take, but probably just a few hours.
It should not cause any sort of downtime, right?

Are you concern it could affect the performance of reading/writing?

After 6pm works for us.
FTR load starts slowing down after 6pm, there will nevertheless be a lot of jobs running for few hours (since there is a lot of backlog) posting to the graph server past 6pm. Not sure if it affects in anyway what you're doing but worth pointing it out if you needed to know.

Thanks!
Will the import from production mess up the load? the import drops tables and creates them, then adding the data.
IIUC only staging DB will be unusable.
That's fine.
 did you want the db dropped and re-created only with the tables in the backup? or just import the backup, which drops and re-creates the tables, but not the whole db (so if there are tables not backed up, like they were created sometime today, they'll be gone)
[17:59:56] <armenzg>	 sheeri: first time to drop everything and recreate
[18:00:06] <armenzg>	 after that I just care that the data is up-to-date
Drive by comment - if reimporting is slow, perhaps we can neglect the actual test results from production. AIUI the motivation here is having tables like os_list, machines, branches, and tests in sync - the config that we modify as requirements change.
Hrm, sounds like everything except test_runs and test_run_values, then? Or can I also exclude tests? (the first 2 are the large tables).

I had an error last night trying to import the backup, which is large mostly due to those first 2 tables:


[root@stage1 scabral]# time mysql graphs_stage2_prod < graphs_mozilla_org_new.2013.02.06.sql
ERROR 1197 (HY000) at line 4541: Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again

real    50m11.244s
user    2m6.103s
sys     0m7.322s
[root@stage1 scabral]# 

Line 4541 imports into test_run_values....so skipping that would be tremendous. I'll work on some scripts for that.
Taking out just those 2 tables means that the backup goes from 21G to 20M.

I did the import:

[root@stage1 scabral]# time mysql graphs_stage2_prod < graphs_mozilla_org_new.2013.02.07.sql 

real    0m19.040s
user    0m0.590s
sys     0m0.044s

Can you double-check that everything looks good? Should I truncate test_run_values on stage (test_runs is already truncated).
We're fine with truncation.

We don't really need the data of the tests on staging as we don;t have a way of visualizing it. I think http://graphs.allizom.org/graph.html was to do it but I don't think we have it properly piped.

In other words, we only care about having consistency so our patches on staging can apply on production. We only care about what nthomas mentions.
Awesome, well, it was done as a one-off yesterday, and I'll set the scripts to truncate for the automation.
Today I am setting and testing our db logical export script to take parameters like --ignore-tables so we can do this properly....
I have a script that creates the backup with the appropriate truncation on backup5.db.scl3 and copies it to stage1.db.scl3.

I can continue tomorrow with a script that will import the logical export.

What time should I set the script to run? I can set it to run the first Monday of the month in the morning before folks are working on it, but I'd need to know when that is.
We have nthomas on New Zealand's timezone.
So perhaps the last Saturday of every month works best; do you think this could work?

Thanks for your help!
Sure, that could work. Shall I have an import done tomorrow, since we haven't done one in 2 months? Then I can set up an auto-import for the last Saturday of this month.
Whiteboard: April 2013
(In reply to Sheeri Cabral [:sheeri] from comment #21)
> Sure, that could work. Shall I have an import done tomorrow, since we
> haven't done one in 2 months? Then I can set up an auto-import for the last
> Saturday of this month.

Yep. That should work. Many thanks!
Sorry I didn't reply earlier, there was a typo in the script when I ran it from cron (path issue).

I just fixed the script and am running it by hand, to copy the file to stage.
OK, all the scripts are in place, checked by running by hand, and I have put the cron in script for the 2nd Saturday of the month, and so let's check on Monday to make sure it has recent data.
Armen -
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
This script had stopped working (i was checking cron errors), and I got it to work again, but it's importing a full production db, not one without tests and test_runs. Working on making this right again.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
I figured out what's wrong - I need to be able to pass in another argument to the underlying backup script, for the location of the smaller backup, but the underlying backup script only takes 2 args. And since it's a perl script from 2001, it's using shift to do that. I want to do this the right way and use GetOpts, but it's more than I can do tonight, so I"ll have to look in on it in the morning.
Had to yum install bcmath for the digest page to work. Had to add date.timezone='UTC' in php.ini to get rid of warnings when using the date() function.

Checksum conversion from hex to dec in "Anemometer.php" line 641 is wrong - firstly, it's using capital letters, whereas MySQL does not, so I was getting that it didn't know what "a" was. Once I did that, calculations were still wrong, e.g. 

checksum hex - ebd4f01d29940465bd2fad27544946b5 
actual decimal - 105866025350050960000
decimal calculated by anemomoeter - 313474216032254030094408162781080143541

I changed all instances of 
        $checksum = $this->translate_checksum(get_var('checksum'));
to
        $checksum = get_var('checksum');
in anemometer.php

And the digest page now works!
Clearly that last comment is in the wrong bug...sorry!
Whiteboard: April 2013 → [2014q1]
These notes are mostly for me to be able to pick back up if interrupted, but lurkers are always welcome!

Current cron & script:
[root@backup3.db.scl3 cron.d]# more backups_devtools 
MAILTO="infra-dbnotices@mozilla.com"
30 5 * * * root /data/backups/bin/make_and_copy_graphs_to_stage.sh

But that calls db-sqldump.

I have changed db-sqldump (the underlying script) to take in parameters:
--db=s instead of the db hard-coded as the first argument
--extra=s instead of the extra stuff hard-coded as the second argument
--targetdir to be the target directory
--target to be the target name

Need to change: make_and_copy_graphs_to_stage & and all the other scripts that call db-sqldump, to use the new format.

targetdir for make_and_copy_graphs_to_stage will be /data/backups, target is graphs_small in this case, so it doesn't overwrite the regular backup.
Fixed the underlying scripts, now to test them out...
Testing worked out, and I changed the target on the import script to be the proper name and tested that. I will wait until Friday to close this out, just to make sure I haven't messed up all the logical export backups.
The backups are getting extra exports for the other databases on devtools, so there's some minimal tweaking that needs to be done.

But the meat is done and working - the last error we had was 1/23, so things have been importing smoothly since then.
Thanks for keep beating at it!
Nagios alerts went off saying sql backups haven't been running for a few days in scl3. Tracked it back to this bug.

When the script db-sqldump was updated in update 81353, the input option name changed from "db" to "cluster".

Unfortunately, you didn't also change the wrapper script (run-all-sql-backups) that iteratively loops through the execution of db-sqldump, so it still was passing a flag called "db" to the script.

Adjusted the wrapper script in change 81444: "fixed db parameter to be cluster parameter, following in the footsteps of change 81353 which renamed it to cluster inside the executing script, db-sqldump"

bash-3.2$ svn diff -r 81353:81444 run-all-sql-backups
Index: run-all-sql-backups
===================================================================
--- run-all-sql-backups	(revision 81353)
+++ run-all-sql-backups	(revision 81444)
@@ -21,7 +21,7 @@
 date >> $LOG
 /usr/bin/perl /data/backups/bin/stop-sql-threads >> $LOG
 for cluster in $MYSQL_CLUSTERS; do
-  /usr/bin/perl /data/backups/bin/db-sqldump --db=$cluster --targetdir=/data/backups >> $LOG
+  /usr/bin/perl /data/backups/bin/db-sqldump --cluster=$cluster --targetdir=/data/backups >> $LOG
 done
 /usr/bin/perl /data/backups/bin/start-sql-threads >> $LOG
 date >> $LOG
Backups are all good after Brandon's catch last week....so we can close this now!
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Whiteboard: [2014q1] → [2014q1] February
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.