993665 - Datazilla Repository Update

Reporter

Description

•

11 years ago

A new datazilla repository is ready to push to stage. This release includes the following changes: 1.) The alerts specification of "Device Type: tarako, Product: B2G, v1.3, prerelease" was removed from https://github.com/mozilla/datazilla/blob/master/datazilla/settings/ingestion_alerts.json. This will be put back in place once test data is ready. 2.) An updated data cycling protocol that addresses the issues described in Bug 990771. The protocol can be found here https://github.com/mozilla/datazilla/blob/master/datazilla/controller/admin/management/commands/cycle_data.py https://github.com/mozilla/datazilla/blob/master/datazilla/model/base.py#L1385 https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/perftest.json#L2 https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/objectstore.json#L146 This includes a new cron that launches the data cycling protocol. All test data in the objectstore and perftest schemas that's greater than 6 months old will be deleted nightly. The SQL statements carrying out the deletes use nested selects to reduce the join costs and have limits set to 10000. In a single cron run, 50 iterations are allowed, so a maximum of 500000 rows from any given table would be deleted in a single nightly run. To test, please run the following command on the admin node from the command line after updating the repository and provide any output from the command. python manage.py cycle_data --project talos --iterations 2 --debug

Chris Turra [:cturra]

Assignee

Comment 1

•

11 years ago

code update complete. the cycle_data command failed however because the database user does not have delete table permissions. _mysql_exceptions.OperationalError: (1142, "DELETE command denied to user 'datazilla_stage'@'10.22.70.209' for table 'objectstore'") i am going to needinfo :sheeri for the permissions changes required here. this will need to be updated for both the stage and prod db users.

Assignee: server-ops-webops → cturra

Flags: needinfo?(scabral)

OS: Mac OS X → All

Hardware: x86 → All

Sheeri Cabral [:sheeri]

Comment 2

•

11 years ago

delete privileges granted for stage, looking at prod now...

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 3

•

11 years ago

prod perms granted to the datazilla_moz user.

Chris Turra [:cturra]

Assignee

Comment 4

•

11 years ago

due to how long this takes to process, i am getting a 'lost connection to mysql server during query' exception returned. i have timed it and this error happens after exactly 5 minutes. i'm going to need some time to track down what is killing these connections b/c at first glace, the load balancer does not have any obvious connection limits for this.

Jonathan Eads ( :jeads )

Reporter

Comment 5

•

11 years ago

Given that it's exactly 5 minutes, it's probably an automated process that kills long running database queries that have >= 5 min run times. I think we need some help from sheeri on this. One option we could try is to reduce the execution time by reducing the limit on the deletes in the sql. They're set to 10,000 rows at the moment but for tables with 1 to many relationships with "*perftest*.test_run" that could be higher. It would be useful to know what table we're stalling on. Next time you try running this set DEBUG=True and then we should be able to see which table is causing the problem on stdout. sheeri what do you think about reducing the sql limit to 1000 and trying again?

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 6

•

11 years ago

One query deleting 10,000 rows shouldn't take 5 minutes, it should be much shorter than that...so I'm not sure reducing to 1,000 rows would work, but it's worth a try as a workaround until Chris has time to look into this. We do not run query killers on the machines, and MySQL only kills connections if the connection is idle - not during a query.

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 7

•

11 years ago

Found it - both cturra and I only looked at the "connection management" section of the Pool configuration. There's also a "Virtual Server" configuration, which states: Connection management settings control how the virtual server manages connections from the remote client. Connection failures are logged for no connections. On backend errors the client is sent: Default New connection timeout is set to 10 seconds. Existing connection timeout is set to 300 seconds. For cturra, the IP is: https://internal.zlb.ops.scl3.mozilla.com:9090/apps/zxtm/?name=datazilla-rw-db&section=Virtual%20Servers%3AEdit I've changed the existing connection timeout on the rw VIP, to 0, which disables the timeout (e.g. unlimited).

Chris Turra [:cturra]

Assignee

Comment 8

•

11 years ago

good catch :sheeri - apparently heartbleed has gotten the best of me also. i just let this run for almost 1.5 hours before i killed it. i'm not entirely sure what it's doing, but that seems like an excessively long amount of time to do its task. :jeads - thoughts? [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date Thu Apr 10 09:32:49 PDT 2014 /data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py:317: Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. cursor.execute(sql, kwargs['placeholders']) ^C Thu Apr 10 10:50:45 PDT 2014

Flags: needinfo?(jeads)

Sheeri Cabral [:sheeri]

Comment 9

•

11 years ago

If it's the talos_objectstore_1 database, the objectstore table is 57G in size... Also if you're looking at dates based on date_loaded, there's no index for that, so that may be why it's taking so long.

Jonathan Eads ( :jeads )

Reporter

Comment 10

•

11 years ago

Ok that's a problem. There was no use case in the application for indexing objectstore.date_loaded but it would have sure come handy now. An alternative strategy I could implement would be identifying the range of ids in talos_perftest_1.test_run that need to be deleted based on the indexed test_run.date_run (https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/template_schema/schema_perftest.sql.tmpl#L554) and use that in a where clause to match the test_run_id in talos_objctstore_1.objecstore (https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/template_schema/schema_objectstore.sql.tmpl#L17). Would it be possible to add an index to date_loaded? Or perhaps that's not practical at this point given the table size. If so, I will implement the strategy suggested above.

Flags: needinfo?(jeads) → needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 11

•

11 years ago

We can add an index but the table will be offline while the index is added, which could be a while. One way to fake this is, if the date_loaded is sequential, look for the autoincrement ID closest to the date you want, and delete that way, just this one time. Then once you have the table you want, add the index in, which will rebuild the table (so it'll be offline) but the table will be smaller, so less to rebuild.

Flags: needinfo?(scabral)

Jonathan Eads ( :jeads )

Reporter

Comment 12

•

11 years ago

Good suggestion, thanks sheeri! So it looks like id 3,000,000 corresponds to a date_loaded timestamp of 1381396223 (10/10/2013) which is close enough to 6 months ago for me. I've hard coded this id into the sql statement here, https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/objectstore.json#L154. There are 2,620,009 id's in the objectstore that are < 3,000,000 so at 50 iterations and a 10000 row limit we should be deleting 500000 rows every 24 hours which adds up to 5.2 days until we have removed the excess. At that point if all goes well I would like to schedule the index addition, also, if it can be done in the same operation the increase of max_data_length, so we have some breathing room, like the other data tables in talos_perftest_1. I will address this in a separate bug when the time comes. So, cturra, could you please update the repository again and try the now notoriously failing command: python manage.py cycle_data --project talos --iterations 2 --debug This command is also cycling the data in talos_perftest_1 so it will take longer than what would be expected for just cycling the single objectstore table.

Flags: needinfo?(cturra)

Sheeri Cabral [:sheeri]

Comment 13

•

11 years ago

No worries! I've had to do this kind of thing before. It's what makes me say that if you have a date in your table, you probably want an index on it, because at some point you're going to be looking at a date range :D

Chris Turra [:cturra]

Assignee

Comment 14

•

11 years ago

:jeads - i updated the repo and ran this command. after just over an hour of seeing nothing written to stdout, i ran an strace on the process and saw the following repeating.. [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date Fri Apr 11 14:53:21 PDT 2014 [root@datazillaadm.private.scl3 datazilla]# strace -p 28657 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) link("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/cycle_data_talos.lock") = -1 EEXIST (File exists) stat("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) link("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/cycle_data_talos.lock") = -1 EEXIST (File exists) stat("/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazillaadm.private.scl3.mozilla.com-689b1700.24761", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 ... [root@datazillaadm.private.scl3 datazilla]# git log -1 | grep commit commit 86d36e70ab7edca980656a831f527176e8e50468 thoughts?

Flags: needinfo?(cturra) → needinfo?(jeads)

Chris Turra [:cturra]

Assignee

Comment 15

•

11 years ago

wait. i really should have looked at that strace closer, it's not doing anything b/c the lockfile check is returning true! i just deleted cycle_data_talos.lock and have kicked off the test again. will report back when i have more.

Flags: needinfo?(jeads)

Chris Turra [:cturra]

Assignee

Comment 16

•

11 years ago

time to share an update. after we have removed the existing connection timeout limit on the load balancer, i am still seeing 'Lost connection to MySQL server during query' errors. however, they now happen much longer after the job kicks off - 2 hours to be exact. in my most recent run, the command was started at 11:38:12 PDT and errored out at Apr 14 13:38:28 PDT: [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data.log ; date Mon Apr 14 11:38:12 PDT 2014 /data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py:319: Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. cursor.execute(sql) Traceback (most recent call last): File "manage.py", line 13, in <module> execute_from_command_line(sys.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line utility.execute() File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv self.execute(*args, **options.__dict__) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute output = self.handle(*args, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle return self.handle_noargs(**options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs Starting for projects: talos self.handle_project(p, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project sql_targets = ptm.cycle_data(sql_targets) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1417, in cycle_data min_date, 'perftest', perftest_sql_to_execute, sql_targets File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1446, in _execute_table_deletes debug_show=self.DEBUG, File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper return func(self, **kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute return self.__execute(sql, kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 300, in __execute self.__cursor_execute(sql, kwargs, cursor) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 317, in __cursor_execute cursor.execute(sql, kwargs['placeholders']) File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute self.errorhandler(self, exc, value) File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query') Mon Apr 14 13:38:28 PDT 2014

Jonathan Eads ( :jeads )

Reporter

Comment 17

•

11 years ago

cturra was there anything written to /home/cturra/cycle_data.log? If so, could you attach it? It looks like we've deleted 40,000 rows from talos_objectstore_1.objectstore. I was expecting 20,000 from the command you ran, the extra 20,000 is probably from the previous command you tried. This leads me to think the problem is with one of the tables in talos_perftest_1 and the data cycling is working fine on the objectstore, if DEBUG=True in local.py when you ran this, I should be able to determine the last table that the cycle_data command was operating on.

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Assignee

Comment 18

•

11 years ago

(In reply to Jonathan Eads ( :jeads ) from comment #17) > cturra was there anything written to /home/cturra/cycle_data.log? If so, > could you attach it? the only this written is: "Starting for projects: talos" > It looks like we've deleted 40,000 rows from > talos_objectstore_1.objectstore. I was expecting 20,000 from the command you > ran, the extra 20,000 is probably from the previous command you tried. This > leads me to think the problem is with one of the tables in talos_perftest_1 > and the data cycling is working fine on the objectstore, if DEBUG=True in > local.py when you ran this, I should be able to determine the last table > that the cycle_data command was operating on. i just ran the command once again with `Debug = True` and received the following: Mon Apr 14 15:27:45 PDT 2014 Starting for projects: talos datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:generic.db_control.disable_foreign_key_checks Executing SQL:SET FOREIGN_KEY_CHECKS = 0 Execution Time:7.4196e-04 sec Traceback (most recent call last): File "manage.py", line 13, in <module> execute_from_command_line(sys.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line utility.execute() File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv self.execute(*args, **options.__dict__) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute output = self.handle(*args, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle return self.handle_noargs(**options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs self.handle_project(p, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project sql_targets = ptm.cycle_data(sql_targets) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1413, in cycle_data min_date, 'objectstore', objectstore_sql_to_execute, sql_targets File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1440, in _execute_table_deletes debug_show=self.DEBUG, File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper return func(self, **kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute return self.__execute(sql, kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 291, in __execute tmsg = t.timeit(1) File "/usr/lib64/python2.6/timeit.py", line 193, in timeit timing = self.inner(it, self.timer) File "/usr/lib64/python2.6/timeit.py", line 99, in inner _func() File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 288, in timewrapper self.__cursor_execute(sql, kwargs, cursor) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 319, in __cursor_execute cursor.execute(sql) File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 175, in execute if not self._defer_warnings: self._warning_check() File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 89, in _warning_check warn(w[-1], self.Warning, 3) _mysql_exceptions.Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. Mon Apr 14 15:27:49 PDT 2014

Flags: needinfo?(cturra)

Jonathan Eads ( :jeads )

Reporter

Comment 19

•

11 years ago

Ok, that actually worked on the objectstore. Exactly 10,000 rows were deleted. As a test, I've commented out the disable/enable foreign key step and all deletes on the perftest schema. Could you update the repository again try running the command? I'm hoping to see the command exit on it's own, if so, I will start adding back functionality until we determine what the cause of the hang is.

Chris Turra [:cturra]

Assignee

Comment 20

•

11 years ago

same traceback, but doesn't appear to be doing the `disable_foreign_key_checks` [root@datazillaadm.private.scl3 datazilla]# git log -1 commit 879cd2b780726ca44da0ee201ac619b5171532a2 Author: Jonathan Eads <superjeads@gmail.com> Date: Mon Apr 14 15:41:46 2014 -0700 commented out perftest table deletes and disable/enable foreing keys [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-041414.log ; date Mon Apr 14 15:55:40 PDT 2014 Traceback (most recent call last): File "manage.py", line 13, in <module> ... _mysql_exceptions.Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. Mon Apr 14 15:55:47 PDT 2014

Sheeri Cabral [:sheeri]

Comment 21

•

11 years ago

FWIW the warning doesn't stop the query (at least, not on the MySQL side) so that's OK. We can get rid of the warning by changing to MIXED or ROW based binary logging.

Chris Turra [:cturra]

Assignee

Comment 22

•

11 years ago

i am no longer seeing mysql warnings, but we're still seeing the connection close after 2 hours (stdout from the last run below). while this was executing, i ran an `strace` on the process, which sat at the following *until* the connection was closed: [root@datazillaadm.private.scl3 ~]# strace -p 28818 -f Process 28818 attached - interrupt to quit read(4, i'd love to get your thoughts on were we go from here :jeads? is it possible to process these in smaller batches? [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-042514.log ; date Fri Apr 25 14:29:49 PDT 2014 Starting for projects: talos datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.selects.get_test_run_rows_to_cycle Executing SQL:SELECT id FROM test_run WHERE date_run < %s ORDER BY id ASC LIMIT 5000 Execution Time:1.1465e-02 sec datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:objectstore.deletes.cycle_objectstore_by_test_run_ids Executing SQL:DELETE FROM objectstore WHERE test_run_id IN (%s,...... *a LOT of %s Execution Time:2.4937e-01 sec Traceback (most recent call last): File "manage.py", line 13, in <module> execute_from_command_line(sys.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 443, in execute_from_command_line utility.execute() File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/__init__.py", line 382, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 196, in run_from_argv self.execute(*args, **options.__dict__) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 232, in execute output = self.handle(*args, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/django/core/management/base.py", line 371, in handle return self.handle_noargs(**options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/base.py", line 137, in handle_noargs self.handle_project(p, **options) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/controller/admin/management/commands/cycle_data.py", line 45, in handle_project sql_targets = ptm.cycle_data(sql_targets) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1441, in cycle_data min_date, 'perftest', perftest_sql_to_execute, sql_targets File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/datazilla/model/base.py", line 1463, in _execute_table_deletes debug_show=self.DEBUG File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/RDBSHub.py", line 71, in wrapper return func(self, **kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 136, in execute return self.__execute(sql, kwargs) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 291, in __execute tmsg = t.timeit(1) File "/usr/lib64/python2.6/timeit.py", line 193, in timeit timing = self.inner(it, self.timer) File "/usr/lib64/python2.6/timeit.py", line 99, in inner _func() File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 288, in timewrapper self.__cursor_execute(sql, kwargs, cursor) File "/data/datazilla-stage/src/datazilla.allizom.org/datazilla/vendor/datasource/bases/SQLHub.py", line 317, in __cursor_execute cursor.execute(sql, kwargs['placeholders']) File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute self.errorhandler(self, exc, value) File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query') Fri Apr 25 16:29:55 PDT 2014

Flags: needinfo?(jeads)

Jonathan Eads ( :jeads )

Reporter

Comment 23

•

11 years ago

Hmmm I'm surprised the db is having problems with that. On Monday, I will reduce the rows deleted to something very small and see if that resolves the issue. Thanks for giving this another try, I will let you know when the modifications are ready.

Flags: needinfo?(jeads)

Jonathan Eads ( :jeads )

Reporter

Comment 24

•

11 years ago

Ok I've modified the row count for the data cycling to be 100 rows. Please update the repository and run the command again.

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Assignee

Comment 25

•

11 years ago

woohoo! it finished. maybe we should pick a reasonably large number of rows and try again? [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-050114.log ; date Thu May 1 15:34:29 PDT 2014 Starting for projects: talos datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.selects.get_test_run_rows_to_cycle Executing SQL:SELECT id FROM test_run WHERE date_run < %s ORDER BY id ASC LIMIT 100 Execution Time:1.3880e-01 sec datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_objectstore_1 host_type:master_host proc:objectstore.deletes.cycle_objectstore_by_test_run_ids Executing SQL:DELETE FROM objectstore WHERE test_run_id IN (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) Execution Time:2.3140e-02 sec datasource.hubs.MySQL.MySQL debug message: host:datazilla-rw-vip.db.scl3.mozilla.com db:talos_perftest_1 host_type:master_host proc:perftest.deletes.cycle_test_aux_data Executing SQL:DELETE FROM test_aux_data WHERE test_run_id IN (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) Execution Time:6.2075e+01 sec Thu May 1 15:35:38 PDT 2014

Flags: needinfo?(cturra) → needinfo?(jeads)

Jonathan Eads ( :jeads )

Reporter

Comment 26

•

11 years ago

Super! Ok, so I increased the cycling to 1000 rows. That seems reasonable looking at the execution times. In the crontab.txt I increased the iterations to 100 to account for the reduction in row count to make sure we keep up with the data load. Could you update the repository and run the command one more time for good measure? If it completes in a reasonable amount of time, go ahead and push the repository and new cron command to production (https://github.com/mozilla/datazilla/blob/master/crontab.txt#L11) and we can put this well seasoned bug to rest... Once this is in production make sure to set DEBUG = False in local.py so we're not writing out all the debug info.

Flags: needinfo?(jeads)

Chris Turra [:cturra]

Assignee

Comment 27

•

11 years ago

maybe 1000 is too much? i have tried running this a couple times with the same result, it's not finishing up. the following is the latest cycle_data i kicked off (almost 24 hours ago): [root@datazillaadm.private.scl3 datazilla]# date; python manage.py cycle_data --project talos --iterations 2 --debug | tee -a /home/cturra/cycle_data-050114-1000.log ; date Mon May 5 12:35:12 PDT 2014

Jonathan Eads ( :jeads )

Reporter

Comment 28

•

11 years ago

Ok I reduced the limit down to 250. There's no reason for a delete of 1000 rows to take ~24 hours so there's likely a problem with the database index or something. Please give it another try and a sincere thank you for your persistence.

Chris Turra [:cturra]

Assignee

Comment 29

•

11 years ago

i spent some time last night playing with this more and according to `strace` the reason the previous cycle_data runs were not exiting was because there was a lock file present. the process was looping on the lock file check. after removing the lock file, i was able to test [1] runs with: 250, 500, 1000 and even 2500! that last 2500 run took from 9:59am -> 10:48am. when we setup the cron here, i think it'd be best to confirm no lock file is present, then run the cycle_data command. thoughts on the database select size and frequency? [1] i manually updated the select statement on line 344 of perftest.json => https://github.com/mozilla/datazilla/blob/0e2a9903b4f93d81830bd0760daa1de6f6f6b098/datazilla/model/sql/perftest.json#L344

Flags: needinfo?(jeads)

Jonathan Eads ( :jeads )

Reporter

Comment 30

•

11 years ago

Good catch, that makes a lot more sense! I think confirming that there are no lock files present when installing a new cron or testing manage.py commands would be a very good idea. I set the limit in the select to 500 and reduced the iteration count passed in the cron job to 50. So a maximum of 25,000 rows could be deleted every 24 hours. Based on the execution times you're seeing for 100-2500, I think that's a reasonable place to start. There's an explicit sleep step between database deletes to let other queries through in the event of a long run. I will keep an eye on the database once this is in production and make sure that we keep up with the data generation.

Flags: needinfo?(jeads)

Chris Turra [:cturra]

Assignee

Comment 31

•

11 years ago

i have applied the the cycle_data cron referenced at: https://github.com/mozilla/datazilla/blob/74c7d439957d88f0723c5fab3f62f58a2ebf5000/crontab.txt#L11 lets poke at the service/db tomorrow morning to make sure everything ran smoothly.

Flags: needinfo?(jeads)

Chris Turra [:cturra]

Assignee

Comment 32

•

11 years ago

the first clean_data cron job ran last night. did it appear to do what it should have on the db? May 16 00:00:01 datazillaadm CROND[8229]: (root) CMD ($PYTHON_ROOT/python $DATAZILLA_HOME/manage.py cycle_data --project talos --iterations 50 > /dev/null 2>&1)

Jonathan Eads ( :jeads )

Reporter

Comment 33

•

11 years ago

The cycle_data command looks like its running fine. The oldest data in talos_objectstore_1 is now from Wed, 20 Feb 2013 which would indicate that data is being cycled. I should see the oldest timestamp steadily approaching a maximum of 6 months ago and the data_length reported below either staying at approximately the same size or decreasing. mysql> select table_schema,table_name,max_data_length,data_length from information_Schema.tables where table_name='objectstore' and table_schema like 'talos%'; +---------------------+-------------+-----------------+-------------+ | table_schema | table_name | max_data_length | data_length | +---------------------+-------------+-----------------+-------------+ | talos_objectstore_1 | objectstore | 68719484928 | 65060069376 | +---------------------+-------------+-----------------+-------------+ 1 row in set (0.36 sec) I would like to validate that the cycle data rate is removing data faster than we're adding it. So, I will run the same query tomorrow and confirm that the data_length and oldest timestamp are moving in the right direction.

Flags: needinfo?(jeads)

Jonathan Eads ( :jeads )

Reporter

Comment 34

•

11 years ago

So it doesn't look like any data cycling occurred last night on talos_objectstore_1. mysql> select table_schema,table_name,max_data_length,data_length from information_Schema.tables where table_name='objectstore' and table_schema like 'talos%'; +---------------------+-------------+-----------------+-------------+ | table_schema | table_name | max_data_length | data_length | +---------------------+-------------+-----------------+-------------+ | talos_objectstore_1 | objectstore | 68719484928 | 65384742912 | +---------------------+-------------+-----------------+-------------+ 1 row in set (0.17 sec) The oldest data is still from Wed, 20 Feb 2013 and it looks like data_length has increased. I checked the talos_perftest_1.test_run table, the data cycling protocol uses this table to identify ids that are out of range to delete from the rest of the schema and from the objectstore. The oldest date in that table is from Sun, 13 Jan 2013 which is older than what's in the objectstore. So it's possible that the talos_objectstore_1 is out of sync with the perftest schema by a month. If the cycle data protocol is running regularly this should work itself out in a few days. I will check again tomorrow, if data has been removed from talos_perftest_1.test_run then we're good. If not, there might be some problem preventing the cycle_data protocol from running.

Jonathan Eads ( :jeads )

Reporter

Comment 35

•

11 years ago

Ok so it looks like the cycle_data protocol is indeed running and working! The oldest date talos_perftest_1.test_run is now Sun, 16 Jan 2013. The objectstore and perftest databases are likely out of sync from all of the tests we did. So we're able to remove about 3 days of data per night at the current iteration count and limit size, I would like to try increasing that a bit. I will do that in another bug, we have some more datazilla changes we need to land.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard