Closed
Bug 993665
Opened 11 years ago
Closed 11 years ago
Datazilla Repository Update
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jeads, Assigned: cturra)
References
Details
A new datazilla repository is ready to push to stage. This release includes the following changes:
1.) The alerts specification of "Device Type: tarako, Product: B2G, v1.3, prerelease" was removed from https://github.com/mozilla/datazilla/blob/master/datazilla/settings/ingestion_alerts.json. This will be put back in place once test data is ready.
2.) An updated data cycling protocol that addresses the issues described in Bug 990771. The protocol can be found here
https://github.com/mozilla/datazilla/blob/master/datazilla/controller/admin/management/commands/cycle_data.py
https://github.com/mozilla/datazilla/blob/master/datazilla/model/base.py#L1385
https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/perftest.json#L2
https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/objectstore.json#L146
This includes a new cron that launches the data cycling protocol. All test data in the objectstore and perftest schemas that's greater than 6 months old will be deleted nightly. The SQL statements carrying out the deletes use nested selects to reduce the join costs and have limits set to 10000. In a single cron run, 50 iterations are allowed, so a maximum of 500000 rows from any given table would be deleted in a single nightly run.
To test, please run the following command on the admin node from the command line after updating the repository and provide any output from the command.
python manage.py cycle_data --project talos --iterations 2 --debug
Assignee | ||
Comment 1•11 years ago
|
||
code update complete.
the cycle_data command failed however because the database user does not have delete table permissions.
_mysql_exceptions.OperationalError: (1142, "DELETE command denied to user 'datazilla_stage'@'10.22.70.209' for table 'objectstore'")
i am going to needinfo :sheeri for the permissions changes required here. this will need to be updated for both the stage and prod db users.
Assignee: server-ops-webops → cturra
Flags: needinfo?(scabral)
OS: Mac OS X → All
Hardware: x86 → All
Comment 2•11 years ago
|
||
delete privileges granted for stage, looking at prod now...
Flags: needinfo?(scabral)
Comment 3•11 years ago
|
||
prod perms granted to the datazilla_moz user.
Assignee | ||
Comment 4•11 years ago
|
||
due to how long this takes to process, i am getting a 'lost connection to mysql server during query' exception returned. i have timed it and this error happens after exactly 5 minutes.
i'm going to need some time to track down what is killing these connections b/c at first glace, the load balancer does not have any obvious connection limits for this.
Reporter | ||
Comment 5•11 years ago
|
||
Given that it's exactly 5 minutes, it's probably an automated process that kills long running database queries that have >= 5 min run times. I think we need some help from sheeri on this. One option we could try is to reduce the execution time by reducing the limit on the deletes in the sql. They're set to 10,000 rows at the moment but for tables with 1 to many relationships with "*perftest*.test_run" that could be higher. It would be useful to know what table we're stalling on. Next time you try running this set DEBUG=True and then we should be able to see which table is causing the problem on stdout.
sheeri what do you think about reducing the sql limit to 1000 and trying again?
Flags: needinfo?(scabral)
Comment 6•11 years ago
|
||
One query deleting 10,000 rows shouldn't take 5 minutes, it should be much shorter than that...so I'm not sure reducing to 1,000 rows would work, but it's worth a try as a workaround until Chris has time to look into this.
We do not run query killers on the machines, and MySQL only kills connections if the connection is idle - not during a query.
Flags: needinfo?(scabral)
Comment 7•11 years ago
|
||
Found it - both cturra and I only looked at the "connection management" section of the Pool configuration. There's also a "Virtual Server" configuration, which states:
Connection management settings control how the virtual server manages connections from the remote client.
Connection failures are logged for no connections.
On backend errors the client is sent: Default
New connection timeout is set to 10 seconds.
Existing connection timeout is set to 300 seconds.
For cturra, the IP is:
https://internal.zlb.ops.scl3.mozilla.com:9090/apps/zxtm/?name=datazilla-rw-db§ion=Virtual%20Servers%3AEdit
I've changed the existing connection timeout on the rw VIP, to 0, which disables the timeout (e.g. unlimited).
Assignee | ||
Comment 8•11 years ago
|
||
good catch :sheeri - apparently heartbleed has gotten the best of me also.
i just let this run for almost 1.5 hours before i killed it. i'm not entirely sure what it's doing, but that seems like an excessively long amount of time to do its task. :jeads - thoughts?
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date
Thu Apr 10 09:32:49 PDT 2014
/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py:317: Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.
cursor.execute(sql, kwargs['placeholders'])
^C
Thu Apr 10 10:50:45 PDT 2014
Flags: needinfo?(jeads)
Comment 9•11 years ago
|
||
If it's the talos_objectstore_1 database, the objectstore table is 57G in size...
Also if you're looking at dates based on date_loaded, there's no index for that, so that may be why it's taking so long.
Reporter | ||
Comment 10•11 years ago
|
||
Ok that's a problem. There was no use case in the application for indexing objectstore.date_loaded but it would have sure come handy now. An alternative strategy I could implement would be identifying the range of ids in talos_perftest_1.test_run that need to be deleted based on the indexed test_run.date_run (https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/template_schema/schema_perftest.sql.tmpl#L554) and use that in a where clause to match the test_run_id in talos_objctstore_1.objecstore (https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/template_schema/schema_objectstore.sql.tmpl#L17).
Would it be possible to add an index to date_loaded? Or perhaps that's not practical at this point given the table size. If so, I will implement the strategy suggested above.
Flags: needinfo?(jeads) → needinfo?(scabral)
Comment 11•11 years ago
|
||
We can add an index but the table will be offline while the index is added, which could be a while.
One way to fake this is, if the date_loaded is sequential, look for the autoincrement ID closest to the date you want, and delete that way, just this one time.
Then once you have the table you want, add the index in, which will rebuild the table (so it'll be offline) but the table will be smaller, so less to rebuild.
Flags: needinfo?(scabral)
Reporter | ||
Comment 12•11 years ago
|
||
Good suggestion, thanks sheeri! So it looks like id 3,000,000 corresponds to a date_loaded timestamp of 1381396223 (10/10/2013) which is close enough to 6 months ago for me. I've hard coded this id into the sql statement here, https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/objectstore.json#L154. There are 2,620,009 id's in the objectstore that are < 3,000,000 so at 50 iterations and a 10000 row limit we should be deleting 500000 rows every 24 hours which adds up to 5.2 days until we have removed the excess. At that point if all goes well I would like to schedule the index addition, also, if it can be done in the same operation the increase of max_data_length, so we have some breathing room, like the other data tables in talos_perftest_1. I will address this in a separate bug when the time comes.
So, cturra, could you please update the repository again and try the now notoriously failing command:
python manage.py cycle_data --project talos --iterations 2 --debug
This command is also cycling the data in talos_perftest_1 so it will take longer than what would be expected for just cycling the single objectstore table.
Flags: needinfo?(cturra)
Comment 13•11 years ago
|
||
No worries! I've had to do this kind of thing before. It's what makes me say that if you have a date in your table, you probably want an index on it, because at some point you're going to be looking at a date range :D
Assignee | ||
Comment 14•11 years ago
|
||
:jeads - i updated the repo and ran this command. after just over an hour of seeing nothing written to stdout, i ran an strace on the process and saw the following repeating..
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date
Fri Apr 11 14:53:21 PDT 2014
[root@datazillaadm.private.scl3 datazilla]# strace -p 28657
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
link("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/cycle_data_talos.lock") = -1 EEXIST (File exists)
stat("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
link("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/cycle_data_talos.lock") = -1 EEXIST (File exists)
stat("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
...
[root@datazillaadm.private.scl3 datazilla]# git log -1 | grep commit
commit 86d36e70ab7edca980656a831f527176e8e50468
thoughts?
Flags: needinfo?(cturra) → needinfo?(jeads)
Assignee | ||
Comment 15•11 years ago
|
||
wait. i really should have looked at that strace closer, it's not doing anything b/c the lockfile check is returning true! i just deleted cycle_data_talos.lock and have kicked off the test again. will report back when i have more.
Flags: needinfo?(jeads)
Assignee | ||
Comment 16•11 years ago
|
||
time to share an update. after we have removed the existing connection timeout limit on the load balancer, i am still seeing 'Lost connection to MySQL server during query' errors. however, they now happen much longer after the job kicks off - 2 hours to be exact.
in my most recent run, the command was started at 11:38:12 PDT and errored out at Apr 14 13:38:28 PDT:
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date
Mon Apr 14 11:38:12 PDT 2014
/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py:319: Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.
cursor.execute(sql)
Traceback (most recent call last):
File "manage.py", line 13, in <module>
execute_from_command_line(sys.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line
utility.execute()
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv
self.execute(*args, **options.__dict__)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute
output = self.handle(*args, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle
return self.handle_noargs(**options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs
Starting for projects: talos
self.handle_project(p, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project
sql_targets = ptm.cycle_data(sql_targets)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1417, in cycle_data
min_date, 'perftest', perftest_sql_to_execute, sql_targets
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1446, in _execute_table_deletes
debug_show=self.DEBUG,
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper
return func(self, **kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute
return self.__execute(sql, kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 300, in __execute
self.__cursor_execute(sql, kwargs, cursor)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 317, in __cursor_execute
cursor.execute(sql, kwargs['placeholders'])
File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute
self.errorhandler(self, exc, value)
File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
Mon Apr 14 13:38:28 PDT 2014
Reporter | ||
Comment 17•11 years ago
|
||
cturra was there anything written to /home/cturra/cycle_data.log? If so, could you attach it? It looks like we've deleted 40,000 rows from talos_objectstore_1.objectstore. I was expecting 20,000 from the command you ran, the extra 20,000 is probably from the previous command you tried. This leads me to think the problem is with one of the tables in talos_perftest_1 and the data cycling is working fine on the objectstore, if DEBUG=True in local.py when you ran this, I should be able to determine the last table that the cycle_data command was operating on.
Flags: needinfo?(cturra)
Assignee | ||
Comment 18•11 years ago
|
||
(In reply to Jonathan Eads ( :jeads ) from comment #17)
> cturra was there anything written to /home/cturra/cycle_data.log? If so,
> could you attach it?
the only this written is: "Starting for projects: talos"
> It looks like we've deleted 40,000 rows from
> talos_objectstore_1.objectstore. I was expecting 20,000 from the command you
> ran, the extra 20,000 is probably from the previous command you tried. This
> leads me to think the problem is with one of the tables in talos_perftest_1
> and the data cycling is working fine on the objectstore, if DEBUG=True in
> local.py when you ran this, I should be able to determine the last table
> that the cycle_data command was operating on.
i just ran the command once again with `Debug = True` and received the following:
Mon Apr 14 15:27:45 PDT 2014
Starting for projects: talos
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:generic.db_control.disable_foreign_key_checks
Executing SQL:SET FOREIGN_KEY_CHECKS = 0
Execution Time:7.4196e-04 sec
Traceback (most recent call last):
File "manage.py", line 13, in <module>
execute_from_command_line(sys.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line
utility.execute()
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv
self.execute(*args, **options.__dict__)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute
output = self.handle(*args, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle
return self.handle_noargs(**options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs
self.handle_project(p, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project
sql_targets = ptm.cycle_data(sql_targets)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1413, in cycle_data
min_date, 'objectstore', objectstore_sql_to_execute, sql_targets
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1440, in _execute_table_deletes
debug_show=self.DEBUG,
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper
return func(self, **kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute
return self.__execute(sql, kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 291, in __execute
tmsg = t.timeit(1)
File "/usr/lib64/python2.6/timeit.py", line 193, in timeit
timing = self.inner(it, self.timer)
File "/usr/lib64/python2.6/timeit.py", line 99, in inner
_func()
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 288, in timewrapper
self.__cursor_execute(sql, kwargs, cursor)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 319, in __cursor_execute
cursor.execute(sql)
File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 175, in execute
if not self._defer_warnings: self._warning_check()
File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 89, in _warning_check
warn(w[-1], self.Warning, 3)
_mysql_exceptions.Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.
Mon Apr 14 15:27:49 PDT 2014
Flags: needinfo?(cturra)
Reporter | ||
Comment 19•11 years ago
|
||
Ok, that actually worked on the objectstore. Exactly 10,000 rows were deleted. As a test, I've commented out the disable/enable foreign key step and all deletes on the perftest schema.
Could you update the repository again try running the command?
I'm hoping to see the command exit on it's own, if so, I will start adding back functionality until we determine what the cause of the hang is.
Assignee | ||
Comment 20•11 years ago
|
||
same traceback, but doesn't appear to be doing the `disable_foreign_key_checks`
[root@datazillaadm.private.scl3 datazilla]# git log -1
commit 879cd2b780726ca44da0ee201ac619b5171532a2
Author: Jonathan Eads <superjeads@gmail.com>
Date: Mon Apr 14 15:41:46 2014 -0700
commented out perftest table deletes and disable/enable foreing keys
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-041414.log ; date
Mon Apr 14 15:55:40 PDT 2014
Traceback (most recent call last):
File "manage.py", line 13, in <module>
...
_mysql_exceptions.Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.
Mon Apr 14 15:55:47 PDT 2014
Comment 21•11 years ago
|
||
FWIW the warning doesn't stop the query (at least, not on the MySQL side) so that's OK. We can get rid of the warning by changing to MIXED or ROW based binary logging.
Assignee | ||
Comment 22•11 years ago
|
||
i am no longer seeing mysql warnings, but we're still seeing the connection close after 2 hours (stdout from the last run below). while this was executing, i ran an `strace` on the process, which sat at the following *until* the connection was closed:
[root@datazillaadm.private.scl3 ~]# strace -p 28818 -f
Process 28818 attached - interrupt to quit
read(4,
i'd love to get your thoughts on were we go from here :jeads? is it possible to process these in smaller batches?
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-042514.log ; date
Fri Apr 25 14:29:49 PDT 2014
Starting for projects: talos
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.selects.get_test_run_rows_to_cycle
Executing SQL:SELECT id FROM test_run WHERE date_run < %s ORDER BY id ASC LIMIT 5000
Execution Time:1.1465e-02 sec
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:objectstore.deletes.cycle_objectstore_by_test_run_ids
Executing SQL:DELETE FROM objectstore WHERE test_run_id IN (%s,...... *a LOT of %s
Execution Time:2.4937e-01 sec
Traceback (most recent call last):
File "manage.py", line 13, in <module>
execute_from_command_line(sys.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line
utility.execute()
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv
self.execute(*args, **options.__dict__)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute
output = self.handle(*args, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle
return self.handle_noargs(**options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs
self.handle_project(p, **options)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project
sql_targets = ptm.cycle_data(sql_targets)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1441, in cycle_data
min_date, 'perftest', perftest_sql_to_execute, sql_targets
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1463, in _execute_table_deletes
debug_show=self.DEBUG
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper
return func(self, **kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute
return self.__execute(sql, kwargs)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 291, in __execute
tmsg = t.timeit(1)
File "/usr/lib64/python2.6/timeit.py", line 193, in timeit
timing = self.inner(it, self.timer)
File "/usr/lib64/python2.6/timeit.py", line 99, in inner
_func()
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 288, in timewrapper
self.__cursor_execute(sql, kwargs, cursor)
File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 317, in __cursor_execute
cursor.execute(sql, kwargs['placeholders'])
File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute
self.errorhandler(self, exc, value)
File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
Fri Apr 25 16:29:55 PDT 2014
Flags: needinfo?(jeads)
Reporter | ||
Comment 23•11 years ago
|
||
Hmmm I'm surprised the db is having problems with that. On Monday, I will reduce the rows deleted to something very small and see if that resolves the issue. Thanks for giving this another try, I will let you know when the modifications are ready.
Flags: needinfo?(jeads)
Reporter | ||
Comment 24•11 years ago
|
||
Ok I've modified the row count for the data cycling to be 100 rows. Please update the repository and run the command again.
Flags: needinfo?(cturra)
Assignee | ||
Comment 25•11 years ago
|
||
woohoo! it finished. maybe we should pick a reasonably large number of rows and try again?
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-050114.log ; date
Thu May 1 15:34:29 PDT 2014
Starting for projects: talos
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.selects.get_test_run_rows_to_cycle
Executing SQL:SELECT id FROM test_run WHERE date_run < %s ORDER BY id ASC LIMIT 100
Execution Time:1.3880e-01 sec
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:objectstore.deletes.cycle_objectstore_by_test_run_ids
Executing SQL:DELETE FROM objectstore WHERE test_run_id IN (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
Execution Time:2.3140e-02 sec
datasource.hubs.MySQL.MySQL debug message:
host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.deletes.cycle_test_aux_data
Executing SQL:DELETE FROM test_aux_data WHERE test_run_id IN (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
Execution Time:6.2075e+01 sec
Thu May 1 15:35:38 PDT 2014
Flags: needinfo?(cturra) → needinfo?(jeads)
Reporter | ||
Comment 26•11 years ago
|
||
Super! Ok, so I increased the cycling to 1000 rows. That seems reasonable looking at the execution times. In the crontab.txt I increased the iterations to 100 to account for the reduction in row count to make sure we keep up with the data load.
Could you update the repository and run the command one more time for good measure? If it completes in a reasonable amount of time, go ahead and push the repository and new cron command to production (https://github.com/mozilla/datazilla/blob/master/crontab.txt#L11) and we can put this well seasoned bug to rest...
Once this is in production make sure to set DEBUG = False in local.py so we're not writing out all the debug info.
Flags: needinfo?(jeads)
Assignee | ||
Comment 27•11 years ago
|
||
maybe 1000 is too much? i have tried running this a couple times with the same result, it's not finishing up. the following is the latest cycle_data i kicked off (almost 24 hours ago):
[root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-050114-1000.log ; date
Mon May 5 12:35:12 PDT 2014
Reporter | ||
Comment 28•11 years ago
|
||
Ok I reduced the limit down to 250. There's no reason for a delete of 1000 rows to take ~24 hours so there's likely a problem with the database index or something. Please give it another try and a sincere thank you for your persistence.
Assignee | ||
Comment 29•11 years ago
|
||
i spent some time last night playing with this more and according to `strace` the reason the previous cycle_data runs were not exiting was because there was a lock file present. the process was looping on the lock file check.
after removing the lock file, i was able to test [1] runs with: 250, 500, 1000 and even 2500! that last 2500 run took from 9:59am -> 10:48am.
when we setup the cron here, i think it'd be best to confirm no lock file is present, then run the cycle_data command. thoughts on the database select size and frequency?
[1] i manually updated the select statement on line 344 of perftest.json => https://github.com/mozilla/datazilla/blob/0e2a9903b4f93d81830bd0760daa1de6f6f6b098/datazilla/model/sql/perftest.json#L344
Flags: needinfo?(jeads)
Reporter | ||
Comment 30•11 years ago
|
||
Good catch, that makes a lot more sense! I think confirming that there are no lock files present when installing a new cron or testing manage.py commands would be a very good idea.
I set the limit in the select to 500 and reduced the iteration count passed in the cron job to 50. So a maximum of 25,000 rows could be deleted every 24 hours. Based on the execution times you're seeing for 100-2500, I think that's a reasonable place to start. There's an explicit sleep step between database deletes to let other queries through in the event of a long run. I will keep an eye on the database once this is in production and make sure that we keep up with the data generation.
Flags: needinfo?(jeads)
Assignee | ||
Comment 31•11 years ago
|
||
i have applied the the cycle_data cron referenced at:
https://github.com/mozilla/datazilla/blob/74c7d439957d88f0723c5fab3f62f58a2ebf5000/crontab.txt#L11
lets poke at the service/db tomorrow morning to make sure everything ran smoothly.
Flags: needinfo?(jeads)
Assignee | ||
Comment 32•11 years ago
|
||
the first clean_data cron job ran last night. did it appear to do what it should have on the db?
May 16 00:00:01 datazillaadm CROND[8229]: (root) CMD ($PYTHON_ROOT/python $DATAZILLA_HOME/manage.py cycle_data --project talos --iterations 50 > /dev/null 2>&1)
Reporter | ||
Comment 33•11 years ago
|
||
The cycle_data command looks like its running fine. The oldest data in talos_objectstore_1 is now from Wed, 20 Feb 2013 which would indicate that data is being cycled. I should see the oldest timestamp steadily approaching a maximum of 6 months ago and the data_length reported below either staying at approximately the same size or decreasing.
mysql> select table_schema,table_name,max_data_length,data_length from information_Schema.tables where table_name='objectstore' and table_schema like 'talos%';
+---------------------+-------------+-----------------+-------------+
| table_schema | table_name | max_data_length | data_length |
+---------------------+-------------+-----------------+-------------+
| talos_objectstore_1 | objectstore | 68719484928 | 65060069376 |
+---------------------+-------------+-----------------+-------------+
1 row in set (0.36 sec)
I would like to validate that the cycle data rate is removing data faster than we're adding it. So, I will run the same query tomorrow and confirm that the data_length and oldest timestamp are moving in the right direction.
Flags: needinfo?(jeads)
Reporter | ||
Comment 34•11 years ago
|
||
So it doesn't look like any data cycling occurred last night on talos_objectstore_1.
mysql> select table_schema,table_name,max_data_length,data_length from information_Schema.tables where table_name='objectstore' and table_schema like 'talos%';
+---------------------+-------------+-----------------+-------------+
| table_schema | table_name | max_data_length | data_length |
+---------------------+-------------+-----------------+-------------+
| talos_objectstore_1 | objectstore | 68719484928 | 65384742912 |
+---------------------+-------------+-----------------+-------------+
1 row in set (0.17 sec)
The oldest data is still from Wed, 20 Feb 2013 and it looks like data_length has increased. I checked the talos_perftest_1.test_run table, the data cycling protocol uses this table to identify ids that are out of range to delete from the rest of the schema and from the objectstore. The oldest date in that table is from Sun, 13 Jan 2013 which is older than what's in the objectstore. So it's possible that the talos_objectstore_1 is out of sync with the perftest schema by a month. If the cycle data protocol is running regularly this should work itself out in a few days.
I will check again tomorrow, if data has been removed from talos_perftest_1.test_run then we're good. If not, there might be some problem preventing the cycle_data protocol from running.
Reporter | ||
Comment 35•11 years ago
|
||
Ok so it looks like the cycle_data protocol is indeed running and working! The oldest date talos_perftest_1.test_run is now Sun, 16 Jan 2013. The objectstore and perftest databases are likely out of sync from all of the tests we did. So we're able to remove about 3 days of data per night at the current iteration count and limit size, I would like to try increasing that a bit. I will do that in another bug, we have some more datazilla changes we need to land.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•