Closed
Bug 996111
Opened 11 years ago
Closed 11 years ago
socorro1.stage.db.phx1 disk alert /pgdata
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mpressman, Assigned: mpressman)
References
Details
[11:11:33] <nagios-phx1> Mon 10:11:33 PDT socorro1.stage.db.phx1.mozilla.com:Disk /pgdata is DOWNTIMESTART (WARNING): DISK WARNING - free space: /pgdata 184532 MB (8% inode=99%): (http://m.mozilla.org/Disk+/pgdata) (notify-by-email) DISK WARNING - free space: /pgdata 184532 MB (8% inode=99%):
/dev/cciss/c0d0p1
xfs 2.0T 1.9T 180G 92% /pgdata
Assignee | ||
Comment 1•11 years ago
|
||
Prod hosts have 2.5T
Assignee | ||
Comment 2•11 years ago
|
||
This host has 6 300GB drives configured in raid 6. Prod hosts have six 900GB drives configured as raid 1+0
We could immediately get ~ 300GB by changing stage to raid 5 without any DC time or additional hardware expenditures
Assignee | ||
Comment 3•11 years ago
|
||
One more thought regarding going to raid 5 - the stage host is refreshed weekly, so the extra security of being able to lose two disks before losing the data is not a strong argument. Additionally, that data can be recreated in just a couple of hours should the disks need replacing because of failure due to the fact that the data is pulled from prod anyway
Updated•11 years ago
|
Flags: needinfo?(sdeckelmann)
Comment 4•11 years ago
|
||
(In reply to Matt Pressman [:mpressman] from comment #2)
> This host has 6 300GB drives configured in raid 6. Prod hosts have six 900GB
> drives configured as raid 1+0
>
> We could immediately get ~ 300GB by changing stage to raid 5 without any DC
> time or additional hardware expenditures
Ok! Let's do it. Thanks for the suggestion, Matt.
Can we schedule this for off-business hours?
Flags: needinfo?(sdeckelmann)
Comment 5•11 years ago
|
||
Also would be good to know how long the stage refresh takes with raid5
Assignee | ||
Comment 6•11 years ago
|
||
Yep, we can schedule to do this after hours, no problem
Comment 7•11 years ago
|
||
(In reply to Matt Pressman [:mpressman] from comment #6)
> Yep, we can schedule to do this after hours, no problem
Great! Let us know when. I ack'd the latest nagios alert.
Comment 8•11 years ago
|
||
<nagios-phx1:#sysadmins> Mon 10:01:25 PDT [1132]
socorro1.stage.db.phx1.mozilla.com:Disk - All is CRITICAL: DISK CRITICAL -
free space: /pgdata 76683 MB (3% inode=99%): (http://m.mozilla.org/Disk+-+All)
I've acked this.
Assignee | ||
Comment 9•11 years ago
|
||
raid level has been modified to raid5. I've set the rebuild priority to high to help speed the process along, but unfortunately there is no way to tell exactly when it will be complete. Once the transformation is complete the next step will be to grow the filesystem and then we will be able to utilize the extra disk and should have close to the space that prod hosts have
Comment 11•11 years ago
|
||
For reference: on 4-21 it had 3% free when pir acked it. I acked it again today (somehow it went off again) with 1%. How much disk space do we expect this to have free when it finishes rebuilding?
Comment 12•11 years ago
|
||
According to matt, it was 75% done last night (source: idonethis). It should be done today. We'll need a sysadmin to check for if the RAID rebuild is complete and once it is, grow the disk space. This is critically important and should need to be done today.
Comment 13•11 years ago
|
||
For general update;
]$ sudo hpacucli ctrl all show config
Smart Array P400 in Slot 3 (sn: PAFGL0T9SZK1XO)
logicaldrive 1 (2.2 TB, RAID 5, OK)
array A (SAS, Unused Space: 686753 MB)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 600 GB, OK)
Smart Array P410i in Slot 0 (Embedded) (sn: 5001438006B62D10)
logicaldrive 1 (136.7 GB, RAID 1, OK)
array A (SAS, Unused Space: 0 MB)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
so ready to be extended. However, we would prefer to send the system into single user mode instead of growing live (unless really needed). Information on how to grow; https://mana.mozilla.org/wiki/display/SYSADMIN/Extend+%28Grow%29+HP+Storage+RAID+Array
Comment 14•11 years ago
|
||
On Monday 4/21 it was at 76683Mb, or 74.885Gb free. We were down to 39G today, so we're using about 7 or so G per day. Plenty of room even if we did nothing until the refresh on Sunday, but of course Matt will look into this once he gets back from the doctor (he e-mailed me that the doctor is still behind so I had a chat with him over the phone).
Comment 15•11 years ago
|
||
Just a note that I did a quick old partition cleanup on stage and also on prod and got down to 96% utilization on stage, and 80% util on prod (2.0TB), which should allow the refresh to squeak by in the event that the disks aren't expanded over the weekend.
Assignee | ||
Comment 16•11 years ago
|
||
I've expanded the logical drive containing /pgdata on the controller. logicaldrive 1 expanded from 2.2 TB to 2.7 TB
hpacucli ctrl all show config
Smart Array P400 in Slot 3 (sn: PAFGL0T9SZK1XO)
logicaldrive 1 (2.7 TB, RAID 5, OK)
array A (SAS, Unused Space: 0 MB)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 600 GB, OK)
Smart Array P410i in Slot 0 (Embedded) (sn: 5001438006B62D10)
logicaldrive 1 (136.7 GB, RAID 1, OK)
array A (SAS, Unused Space: 0 MB)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
Assignee | ||
Comment 17•11 years ago
|
||
filesystem disk size now 2.8T
Additionally, I was unable to grow the partition and then the filesystem online. The new partition size exceeded what the dos partition type would allow and necessitated using gpt. As it happens, this is also what the prod disks are using thus making it a little more like prod. I changed the partition type to gpt and created the xfs filesystem on the new gpt partition. I then found the new UUID and updated /etc/fstab.
Since we are just 6 hours away from the weekly refresh and it'll take about half that time to run one now, I will leave this as is and let the automated refresh recreate the data
[root@socorro1.stage.db.phx1 etc]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg_dbsoc-lv_root
ext4 60G 20G 38G 35% /
tmpfs tmpfs 12G 0 12G 0% /dev/shm
/dev/sda1 ext4 485M 94M 366M 21% /boot
/dev/mapper/vg_dbsoc-lv_wal
xfs 50G 33M 50G 1% /wal
/dev/cciss/c0d0p1
xfs 2.8T 69M 2.8T 1% /pgdata
Assignee | ||
Comment 18•11 years ago
|
||
Weekly stage refresh completed, used percentage now 73%
[postgres@socorro1 postgresql]$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg_dbsoc-lv_root
ext4 60G 20G 38G 35% /
tmpfs tmpfs 12G 0 12G 0% /dev/shm
/dev/sda1 ext4 485M 94M 366M 21% /boot
/dev/mapper/vg_dbsoc-lv_wal
xfs 50G 33M 50G 1% /wal
/dev/cciss/c0d0p1
xfs 2.8T 2.0T 775G 73% /pgdata
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-database → mpressman
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•