Closed
Bug 898492
Opened 12 years ago
Closed 12 years ago
Datazilla Object Processing Problem
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task, P1)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jeads, Assigned: cturra)
References
Details
Attachments
(3 files)
Something is going wrong with object processing in datazilla. I'm not sure what the cause of this is, I will likely need someone to take a look at the database transaction log. It appears all objects in all project objectstores have been marked as "ready" to process which is not correct.
The following steps need to be carried out while I'm trouble shooting:
1.) Disable all cron jobs on the admin node.
2.) If there is a,
"manage.py process_objects --cron_batch small --loadlimit 25"
process in the process list actively running it needs to be killed.
I will file an additional bug to assess the problems on the master database. When the situation is resolved I will indicate it's ok to re-enable the crons in a comment on this bug.
Assignee | ||
Comment 1•12 years ago
|
||
it looks like the admin host actually because (mostly) unresponsive on 07/20, which might explain this. it was responding correctly to the checks our monitors were making, however, the virtual machine this is hosted on was reporting all sorts of out of memory errors. i chatted with some other folks and it looks like the OOM (Out Of Memory) killer likely killed something vital to the machine.
i have rebooted the admin host and everything has come back up successfully. i am going to leave these crons enabled to see if things sort themselves back out now. do feel free to let me know if there are any other things you want run to deal with backlog (if the current crons won't do that).
Assignee: server-ops-webops → cturra
OS: Mac OS X → All
Hardware: x86 → All
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Chris Turra [:cturra] from comment #1)
> it looks like the admin host actually because (mostly) unresponsive on
> 07/20, which might explain this. it was responding correctly to the checks
> our monitors were making, however, the virtual machine this is hosted on was
> reporting all sorts of out of memory errors. i chatted with some other folks
> and it looks like the OOM (Out Of Memory) killer likely killed something
> vital to the machine.
>
> i have rebooted the admin host and everything has come back up successfully.
> i am going to leave these crons enabled to see if things sort themselves
> back out now. do feel free to let me know if there are any other things you
> want run to deal with backlog (if the current crons won't do that).
Please DO NOT enable the crons yet! If there is an active process_object manage command running at this time please kill it. I believe the problem is related to some corrupted talos data, if process_objects picks up that data again the same problem will reoccur, to fix this we're going to need to hide it in the objectstore. This will require some SQL being executed.
Comment 3•12 years ago
|
||
(In reply to Jonathan Eads ( :jeads ) from comment #2)
> (In reply to Chris Turra [:cturra] from comment #1)
> > it looks like the admin host actually because (mostly) unresponsive on
> > 07/20, which might explain this. it was responding correctly to the checks
> > our monitors were making, however, the virtual machine this is hosted on was
> > reporting all sorts of out of memory errors. i chatted with some other folks
> > and it looks like the OOM (Out Of Memory) killer likely killed something
> > vital to the machine.
> >
> > i have rebooted the admin host and everything has come back up successfully.
> > i am going to leave these crons enabled to see if things sort themselves
> > back out now. do feel free to let me know if there are any other things you
> > want run to deal with backlog (if the current crons won't do that).
>
> Please DO NOT enable the crons yet! If there is an active process_object
> manage command running at this time please kill it. I believe the problem is
> related to some corrupted talos data, if process_objects picks up that data
> again the same problem will reoccur, to fix this we're going to need to hide
> it in the objectstore. This will require some SQL being executed.
I've disabled all cron jobs for now and killed two process_objects processes
[root@datazillaadm.private.scl3 ~]# ps waux | grep process
root 4427 0.0 0.0 9224 1076 ? Ss 10:06 0:00 /bin/sh -c $PYTHON_ROOT/python $DATAZILLA_HOME/manage.py process_objects --cron_batch small --loadlimit 15 && $PYTHON_ROOT/python $DATAZILLA_HOME/manage.py process_objects --cron_batch small --loadlimit 15
root 5457 1.6 1.7 356540 33876 ? S 10:07 0:00 /usr/bin/python /data/datazilla/src/datazilla.mozilla.org/datazilla/manage.py process_objects --cron_batch small --loadlimit 15
root 5726 0.0 0.0 103240 848 pts/0 S+ 10:08 0:00 grep process
[root@datazillaadm.private.scl3 ~]# kill -9 4427
[root@datazillaadm.private.scl3 ~]# ps waux | grep process
root 5457 1.4 1.7 356540 33876 ? S 10:07 0:00 /usr/bin/python /data/datazilla/src/datazilla.mozilla.org/datazilla/manage.py process_objects --cron_batch small --loadlimit 15
root 5732 0.0 0.0 103240 848 pts/0 S+ 10:08 0:00 grep process
[root@datazillaadm.private.scl3 ~]# kill -9 5457
Comment 4•12 years ago
|
||
Disabled in Puppet
bburton@macbookair-00886541dab0 [10:12:50] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "disabling cron jobs on datazillaadm, bug 898492"
Sending trunk/modules/webapp/files/datazilla/admin/etc-cron.d/datazilla.mozilla.org
Transmitting file data .
Committed revision 72018.
Please let us know when you want us to take further action
Thanks
Reporter | ||
Comment 5•12 years ago
|
||
The following steps need to be carried out.
1.) Execute the attached sql statements. These statements add two indexes in a table `test_data_all_dimensions` in all perftest databases. This is not related to the problem we're trying to fix but it's required by the new repository that we're pushing out. There's no data in the table so this should run fast.
2.) Push out the new repository to stage.
STOP HERE: And I will validate that stage is working as expected. Once validated carry out 3-5
3.) Push out new repository to production
4.) Set up the new crontab.txt file in production
5.) Reactivate the cron jobs.
The new repository makes the following changes.
The cron job for processing the objects is now run individually for each project. This should prevent a project from delaying object processing in another.
The only talos jobs that get processed are tp5 jobs and tp5 auxiliary data is no longer indexed.
Reporter | ||
Comment 6•12 years ago
|
||
Looks like `games_perftest_1` and `webpagetest_perftest_1` were missing the new test_data_all_dimensions table. When these projects were created the new table had not been added to the repository.
Please execute the attached table creation statements.
Reporter | ||
Comment 7•12 years ago
|
||
The following warning is being generated when the new table test_data_all_dimensions is populated.
"Warning: Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-increment column after selecting from another table are unsafe because the order in which rows are retrieved determines what (if any) rows will be written. This order cannot be predicted and may differ on master and the slave."
The auto_increment id on the table that is causing the error is not really required. It's not required as a primary key or as a method of data retrieval. This SQL script removes the id which should remove the warning message.
Assignee | ||
Comment 8•12 years ago
|
||
as discussed on irc, all these steps have now been sorted. i will monitor the cron mail to ensure we're not causing all sorts of errors, but at first glance everything looks okay.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•