Closed Bug 898492 Opened 12 years ago Closed 12 years ago

Datazilla Object Processing Problem

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jeads, Assigned: cturra)

References

Details

Attachments

(3 files)

all_dimensions_alter_statements.sql 12 years ago Jonathan Eads ( :jeads ) 1.80 KB, text/x-sql		Details
create_missing_table.sql 12 years ago Jonathan Eads ( :jeads ) 4.20 KB, text/x-sql		Details
remove_id.sql 12 years ago Jonathan Eads ( :jeads ) 873 bytes, text/x-sql		Details

Jonathan Eads ( :jeads )

Reporter

Description

•

12 years ago

Something is going wrong with object processing in datazilla. I'm not sure what the cause of this is, I will likely need someone to take a look at the database transaction log. It appears all objects in all project objectstores have been marked as "ready" to process which is not correct. The following steps need to be carried out while I'm trouble shooting: 1.) Disable all cron jobs on the admin node. 2.) If there is a, "manage.py process_objects --cron_batch small --loadlimit 25" process in the process list actively running it needs to be killed. I will file an additional bug to assess the problems on the master database. When the situation is resolved I will indicate it's ok to re-enable the crons in a comment on this bug.

Chris Turra [:cturra]

Assignee

Comment 1

•

12 years ago

it looks like the admin host actually because (mostly) unresponsive on 07/20, which might explain this. it was responding correctly to the checks our monitors were making, however, the virtual machine this is hosted on was reporting all sorts of out of memory errors. i chatted with some other folks and it looks like the OOM (Out Of Memory) killer likely killed something vital to the machine. i have rebooted the admin host and everything has come back up successfully. i am going to leave these crons enabled to see if things sort themselves back out now. do feel free to let me know if there are any other things you want run to deal with backlog (if the current crons won't do that).

Assignee: server-ops-webops → cturra

OS: Mac OS X → All

Hardware: x86 → All

Jonathan Eads ( :jeads )

Reporter

Updated

•

12 years ago

Depends on: 898509

Jonathan Eads ( :jeads )

Reporter

Comment 2

•

12 years ago

(In reply to Chris Turra [:cturra] from comment #1) > it looks like the admin host actually because (mostly) unresponsive on > 07/20, which might explain this. it was responding correctly to the checks > our monitors were making, however, the virtual machine this is hosted on was > reporting all sorts of out of memory errors. i chatted with some other folks > and it looks like the OOM (Out Of Memory) killer likely killed something > vital to the machine. > > i have rebooted the admin host and everything has come back up successfully. > i am going to leave these crons enabled to see if things sort themselves > back out now. do feel free to let me know if there are any other things you > want run to deal with backlog (if the current crons won't do that). Please DO NOT enable the crons yet! If there is an active process_object manage command running at this time please kill it. I believe the problem is related to some corrupted talos data, if process_objects picks up that data again the same problem will reoccur, to fix this we're going to need to hide it in the objectstore. This will require some SQL being executed.

Brandon Burton [:solarce]

Comment 3

•

12 years ago

(In reply to Jonathan Eads ( :jeads ) from comment #2) > (In reply to Chris Turra [:cturra] from comment #1) > > it looks like the admin host actually because (mostly) unresponsive on > > 07/20, which might explain this. it was responding correctly to the checks > > our monitors were making, however, the virtual machine this is hosted on was > > reporting all sorts of out of memory errors. i chatted with some other folks > > and it looks like the OOM (Out Of Memory) killer likely killed something > > vital to the machine. > > > > i have rebooted the admin host and everything has come back up successfully. > > i am going to leave these crons enabled to see if things sort themselves > > back out now. do feel free to let me know if there are any other things you > > want run to deal with backlog (if the current crons won't do that). > > Please DO NOT enable the crons yet! If there is an active process_object > manage command running at this time please kill it. I believe the problem is > related to some corrupted talos data, if process_objects picks up that data > again the same problem will reoccur, to fix this we're going to need to hide > it in the objectstore. This will require some SQL being executed. I've disabled all cron jobs for now and killed two process_objects processes [root@datazillaadm.private.scl3 ~]# ps waux | grep process root 4427 0.0 0.0 9224 1076 ? Ss 10:06 0:00 /bin/sh -c $PYTHON_ROOT/python $DATAZILLA_HOME/manage.py process_objects --cron_batch small --loadlimit 15 && $PYTHON_ROOT/python $DATAZILLA_HOME/manage.py process_objects --cron_batch small --loadlimit 15 root 5457 1.6 1.7 356540 33876 ? S 10:07 0:00 /usr/bin/python /data/datazilla/src/datazilla.mozilla.org/datazilla/manage.py process_objects --cron_batch small --loadlimit 15 root 5726 0.0 0.0 103240 848 pts/0 S+ 10:08 0:00 grep process [root@datazillaadm.private.scl3 ~]# kill -9 4427 [root@datazillaadm.private.scl3 ~]# ps waux | grep process root 5457 1.4 1.7 356540 33876 ? S 10:07 0:00 /usr/bin/python /data/datazilla/src/datazilla.mozilla.org/datazilla/manage.py process_objects --cron_batch small --loadlimit 15 root 5732 0.0 0.0 103240 848 pts/0 S+ 10:08 0:00 grep process [root@datazillaadm.private.scl3 ~]# kill -9 5457

Brandon Burton [:solarce]

Comment 4

•

12 years ago

Disabled in Puppet bburton@macbookair-00886541dab0 [10:12:50] [~/code/mozilla/sysadmins/puppet/trunk] -> % svn ci -m "disabling cron jobs on datazillaadm, bug 898492" Sending trunk/modules/webapp/files/datazilla/admin/etc-cron.d/datazilla.mozilla.org Transmitting file data . Committed revision 72018. Please let us know when you want us to take further action Thanks

Jonathan Eads ( :jeads )

Reporter

Comment 5

•

12 years ago

Attached file all_dimensions_alter_statements.sql — Details

The following steps need to be carried out. 1.) Execute the attached sql statements. These statements add two indexes in a table `test_data_all_dimensions` in all perftest databases. This is not related to the problem we're trying to fix but it's required by the new repository that we're pushing out. There's no data in the table so this should run fast. 2.) Push out the new repository to stage. STOP HERE: And I will validate that stage is working as expected. Once validated carry out 3-5 3.) Push out new repository to production 4.) Set up the new crontab.txt file in production 5.) Reactivate the cron jobs. The new repository makes the following changes. The cron job for processing the objects is now run individually for each project. This should prevent a project from delaying object processing in another. The only talos jobs that get processed are tp5 jobs and tp5 auxiliary data is no longer indexed.

Jonathan Eads ( :jeads )

Reporter

Comment 6

•

12 years ago

Attached file create_missing_table.sql — Details

Looks like `games_perftest_1` and `webpagetest_perftest_1` were missing the new test_data_all_dimensions table. When these projects were created the new table had not been added to the repository. Please execute the attached table creation statements.

Jonathan Eads ( :jeads )

Reporter

Comment 7

•

12 years ago

Attached file remove_id.sql — Details

The following warning is being generated when the new table test_data_all_dimensions is populated. "Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-increment column after selecting from another table are unsafe because the order in which rows are retrieved determines what (if any) rows will be written. This order cannot be predicted and may differ on master and the slave." The auto_increment id on the table that is causing the error is not really required. It's not required as a primary key or as a method of data retrieval. This SQL script removes the id which should remove the warning message.

Chris Turra [:cturra]

Assignee

Comment 8

•

12 years ago

as discussed on irc, all these steps have now been sorted. i will monitor the cron mail to ensure we're not causing all sorts of errors, but at first glance everything looks okay.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Datazilla Object Processing Problem

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P1)

Tracking

(Not tracked)

People

(Reporter: jeads, Assigned: cturra)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Attachment

General

Description

File Name

Content Type