Closed Bug 421459 Opened 16 years ago Closed 16 years ago

dingbat out of space

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: samuel.sidler+old, Assigned: justin)

Details

The Talkback server, dingbat, is out of diskspace. This is an urgent issue.

The scripts that used to run to clean out the db have been failing for a few days and have resulted in a full database.

Jay has more details.
Nagios wasn't watching this.  It is now.

The partition in question is the database store.  This'll probably require DBA involvement to clean up the damage.
Looks like the storage space for oracle on the netapp is out of space:
/dev/sdf1             493G  488G     0 100% /mnt/netapp/lun0

As Sam mentioned, the clean up scripts that deleted old data that Ctuft setup a
while back has failed about 8 or 9 of the past 12 days... so I think we're in
trouble.  The scripts is:
oracle   22263 22262  0 23:00 ?        00:00:00 /bin/bash
/data/oracle/admin/tools/run_job/run_job.sh TKDB spiral
tbutil_pkg.trim_tables(2,90,30)

We thought this was a normal tablespace issue due to the error Sam noted:
03/06 22:18:05 (0844): Encountered ODBC error -1: S1000, 1653,
[Oracle][ODBC][Ora]ORA-01653: unable to extend table SPIRAL.FC__KEYTABLE_1049
by 2560 in tablespace KEY_LONGRAW_DATA

But checking the datafiles, the latest datafile #17 for that tablespace has
plenty of room left, so this is definitely a disk issue and I think having more
space will solve it for now.

Justin:  Until we can find out if Chris has time to help with this issue, can
we expand the storage space so that Talkback can come back up?  If so, please
do what you can and let us know so I can restart the DB and Talkback processes.
 If not, we'll have to wait until Chris is available to help or find someone
that can jump in and investigate.  Also, could you please contact Chris in the
morning and see if he is available.  I don't have his number ready and it might
be best for you to chat with him in case we need to bring him back on
short-term contract to resolve these issues.
Apears to be an ext3 filesystem on an iSCSI mount.  ext3 is expandable if the netapp share is.  We'll have to resize the ext3 filesystem after the share size is increased.
(In reply to comment #3)
> Apears to be an ext3 filesystem on an iSCSI mount.  ext3 is expandable if the
> netapp share is.  We'll have to resize the ext3 filesystem after the share size
> is increased.

Err, we can't.  It's RHEL3.  The ext3 on RHEL3 didn't have any tools for resizing it yet.
idea...  we can unmount it from dingbat, mount it on a rhel4 box, resize it there, then move it back.
the right thing to do is to fix the cleanup scripts, not increase the netapp mount size.  jay has ctuft's contact info - if for some reason you lost it, let me know and I'll get it to you.
Severity: critical → major
Justin:  I agree, but I think the rate of incoming crashing and the time it is taking to run the current scripts, we are not able to keep up.  So we will probably need more space either way.  Also, Ctuft is out until later this evening, and the longer Talkback is down, the worse off we will be.  If there is a way to increase the diskspace, I still think we should do it.
why? did this issue just come up?  seems this has only become an issue after the script stopped running or 8-10 days unnoticed.  also, why throw more resources at talkback when it will become useless in months when ff3 comes out.  msmith is looking at the scripts to see if there is something he can do in the mean time.  per ss, even if we did add space, we'd have db issues.  think we should wait to talk to ctuft about this.  sound ok?
I don't have a preference... I just wanted to make sure we don't keep the Talkback servers down for too long.

I don't think this is just a db issue, but a combination of things. Our traffic has increased recently, and it has become more difficult to run our deletion scripts in a reasonable amount of time to keep up.  If we're getting more in than we can delete any given day, this problem will continue.

That's why I think more space will help us in the short-term, while we figure out the best way to fix the db scripts...or at least make them more efficient.

The only reason I would want a quick fix is that when the Talkback servers are down, not only do we build a backlog of crashes, but we are giving millions of users a bad user experience if they do crash.  They will error out and continue to see a pop up when Talkback tries to resend every once in a while.  We have had complaints in the past about this when we went down for a few hours... so if this runs into days of downtime, it's gonna suck for a lot of people.

If IT would rather wait to know more, Ctuft should be available in the evening, but there is no guarantee he will have an answer for us.  The diskspace issue is something that has never come up before, and I think it's just time to grow the space.  
I'll follow up with tuft and get a good answer tonight.
Assignee: server-ops → justin
although talkback has to stick around for a while, traffic will go down A LOT when FF3 ships.
closing this out - ctuft handled this I think.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.