996111 - socorro1.stage.db.phx1 disk alert /pgdata

Assignee

Description

•

12 years ago

[11:11:33] <nagios-phx1> Mon 10:11:33 PDT socorro1.stage.db.phx1.mozilla.com:Disk /pgdata is DOWNTIMESTART (WARNING): DISK WARNING - free space: /pgdata 184532 MB (8% inode=99%): (http://m.mozilla.org/Disk+/pgdata) (notify-by-email) DISK WARNING - free space: /pgdata 184532 MB (8% inode=99%): /dev/cciss/c0d0p1 xfs 2.0T 1.9T 180G 92% /pgdata

Matt Pressman [:mpressman]

Assignee

Comment 1

•

12 years ago

Prod hosts have 2.5T

Matt Pressman [:mpressman]

Assignee

Comment 2

•

12 years ago

This host has 6 300GB drives configured in raid 6. Prod hosts have six 900GB drives configured as raid 1+0 We could immediately get ~ 300GB by changing stage to raid 5 without any DC time or additional hardware expenditures

Matt Pressman [:mpressman]

Assignee

Comment 3

•

12 years ago

One more thought regarding going to raid 5 - the stage host is refreshed weekly, so the extra security of being able to lose two disks before losing the data is not a strong argument. Additionally, that data can be recreated in just a couple of hours should the disks need replacing because of failure due to the fact that the data is pulled from prod anyway

Lonnen :lonnen

Updated

•

12 years ago

Flags: needinfo?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Comment 4

•

12 years ago

(In reply to Matt Pressman [:mpressman] from comment #2) > This host has 6 300GB drives configured in raid 6. Prod hosts have six 900GB > drives configured as raid 1+0 > > We could immediately get ~ 300GB by changing stage to raid 5 without any DC > time or additional hardware expenditures Ok! Let's do it. Thanks for the suggestion, Matt. Can we schedule this for off-business hours?

Flags: needinfo?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Comment 5

•

12 years ago

Also would be good to know how long the stage refresh takes with raid5

Matt Pressman [:mpressman]

Assignee

Comment 6

•

12 years ago

Yep, we can schedule to do this after hours, no problem

Selena Deckelmann :selenamarie :selena

Comment 7

•

12 years ago

(In reply to Matt Pressman [:mpressman] from comment #6) > Yep, we can schedule to do this after hours, no problem Great! Let us know when. I ack'd the latest nagios alert.

Peter Radcliffe [:pir]

Comment 8

•

11 years ago

<nagios-phx1:#sysadmins> Mon 10:01:25 PDT [1132] socorro1.stage.db.phx1.mozilla.com:Disk - All is CRITICAL: DISK CRITICAL - free space: /pgdata 76683 MB (3% inode=99%): (http://m.mozilla.org/Disk+-+All) I've acked this.

Matt Pressman [:mpressman]

Assignee

Comment 9

•

11 years ago

raid level has been modified to raid5. I've set the rebuild priority to high to help speed the process along, but unfortunately there is no way to tell exactly when it will be complete. Once the transformation is complete the next step will be to grow the filesystem and then we will be able to utilize the extra disk and should have close to the space that prod hosts have

Brandon Johnson [:cyborgshadow]

Comment 11

•

11 years ago

For reference: on 4-21 it had 3% free when pir acked it. I acked it again today (somehow it went off again) with 1%. How much disk space do we expect this to have free when it finishes rebuilding?

Brandon Johnson [:cyborgshadow]

Comment 12

•

11 years ago

According to matt, it was 75% done last night (source: idonethis). It should be done today. We'll need a sysadmin to check for if the RAID rebuild is complete and once it is, grow the disk space. This is critically important and should need to be done today.

Adrian J Fernandez [:Aj]

Comment 13

•

11 years ago

For general update; ]$ sudo hpacucli ctrl all show config Smart Array P400 in Slot 3 (sn: PAFGL0T9SZK1XO) logicaldrive 1 (2.2 TB, RAID 5, OK) array A (SAS, Unused Space: 686753 MB) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK) physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 600 GB, OK) physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 600 GB, OK) physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 600 GB, OK) Smart Array P410i in Slot 0 (Embedded) (sn: 5001438006B62D10) logicaldrive 1 (136.7 GB, RAID 1, OK) array A (SAS, Unused Space: 0 MB) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK) so ready to be extended. However, we would prefer to send the system into single user mode instead of growing live (unless really needed). Information on how to grow; https://mana.mozilla.org/wiki/display/SYSADMIN/Extend+%28Grow%29+HP+Storage+RAID+Array

Sheeri Cabral [:sheeri]

Comment 14

•

11 years ago

On Monday 4/21 it was at 76683Mb, or 74.885Gb free. We were down to 39G today, so we're using about 7 or so G per day. Plenty of room even if we did nothing until the refresh on Sunday, but of course Matt will look into this once he gets back from the doctor (he e-mailed me that the doctor is still behind so I had a chat with him over the phone).

Selena Deckelmann :selenamarie :selena

Comment 15

•

11 years ago

Just a note that I did a quick old partition cleanup on stage and also on prod and got down to 96% utilization on stage, and 80% util on prod (2.0TB), which should allow the refresh to squeak by in the event that the disks aren't expanded over the weekend.

Matt Pressman [:mpressman]

Assignee

Comment 16

•

11 years ago

I've expanded the logical drive containing /pgdata on the controller. logicaldrive 1 expanded from 2.2 TB to 2.7 TB hpacucli ctrl all show config Smart Array P400 in Slot 3 (sn: PAFGL0T9SZK1XO) logicaldrive 1 (2.7 TB, RAID 5, OK) array A (SAS, Unused Space: 0 MB) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK) physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 600 GB, OK) physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 600 GB, OK) physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 600 GB, OK) Smart Array P410i in Slot 0 (Embedded) (sn: 5001438006B62D10) logicaldrive 1 (136.7 GB, RAID 1, OK) array A (SAS, Unused Space: 0 MB) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

Matt Pressman [:mpressman]

Assignee

Comment 17

•

11 years ago

filesystem disk size now 2.8T Additionally, I was unable to grow the partition and then the filesystem online. The new partition size exceeded what the dos partition type would allow and necessitated using gpt. As it happens, this is also what the prod disks are using thus making it a little more like prod. I changed the partition type to gpt and created the xfs filesystem on the new gpt partition. I then found the new UUID and updated /etc/fstab. Since we are just 6 hours away from the weekly refresh and it'll take about half that time to run one now, I will leave this as is and let the automated refresh recreate the data [root@socorro1.stage.db.phx1 etc]# df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_dbsoc-lv_root ext4 60G 20G 38G 35% / tmpfs tmpfs 12G 0 12G 0% /dev/shm /dev/sda1 ext4 485M 94M 366M 21% /boot /dev/mapper/vg_dbsoc-lv_wal xfs 50G 33M 50G 1% /wal /dev/cciss/c0d0p1 xfs 2.8T 69M 2.8T 1% /pgdata

Matt Pressman [:mpressman]

Assignee

Comment 18

•

11 years ago

Weekly stage refresh completed, used percentage now 73% [postgres@socorro1 postgresql]$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_dbsoc-lv_root ext4 60G 20G 38G 35% / tmpfs tmpfs 12G 0 12G 0% /dev/shm /dev/sda1 ext4 485M 94M 366M 21% /boot /dev/mapper/vg_dbsoc-lv_wal xfs 50G 33M 50G 1% /wal /dev/cciss/c0d0p1 xfs 2.8T 2.0T 775G 73% /pgdata

Matt Pressman [:mpressman]

Assignee

Updated

•

11 years ago

Assignee: server-ops-database → mpressman

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

socorro1.stage.db.phx1 disk alert /pgdata

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: mpressman, Assigned: mpressman)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated