Closed Bug 1008452 Opened 11 years ago Closed 11 years ago

socorro[1,2,3].db.phx1 disk are at 90%

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mpressman, Assigned: selenamarie)

References

Details

Attachments

(2 files)

There is around 265GB left, but we need to either remove data or increase disk
This is moving very fast. Disk space is been rapidly being used. In just the last few days, we have used about 50GB. There is currently ~200GB on the prod hosts.
The plan is to get larger disks. selenamarie mentioned getting 900GB or even 1.2T disks. I will check and see the largest we can get for the chassis
socorro[1-2].db.phx1 are DL360 G7 - They have 6 X 900GB drives in RAID 1+0 for the array backing the /pgdata partition socorro3.db.phx1 is a blade, but it has attached 6 X 900GB drives in RAID 1+0 for the array backing the /pgdata partition
I just received word that the 900GB drives are the largest that can be installed.
Severity: normal → major
Assignee: server-ops-database → mpressman
Will run the purge procedure https://mana.mozilla.org/wiki/display/websites/quarterly+Postgres+data+purge+procedure?src=search :selenamarie - How far back should I purge?
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(sdeckelmann)
Current disk usage before purge: [postgres@socorro1 ~]$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_dbsoc-lv_root ext4 201G 101G 98G 51% / tmpfs tmpfs 36G 0 36G 0% /dev/shm /dev/sda1 ext4 485M 90M 370M 20% /boot /dev/mapper/vg_dbsoc-lv_wal xfs 50G 4.4G 46G 9% /wal /dev/sdb1 xfs 2.5T 2.4T 158G 94% /pgdata breakpad size: [postgres@socorro1 ~]$ psql -c "select pg_size_pretty(pg_database_size(datname)) from pg_database where datname = 'breakpad'" breakpad pg_size_pretty ---------------- 2335 GB Post purge: [postgres@socorro1 ~]$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_dbsoc-lv_root ext4 201G 101G 98G 51% / tmpfs tmpfs 36G 0 36G 0% /dev/shm /dev/sda1 ext4 485M 90M 370M 20% /boot /dev/mapper/vg_dbsoc-lv_wal xfs 50G 4.4G 46G 9% /wal /dev/sdb1 xfs 2.5T 2.3T 184G 93% /pgdata breakpad size: [postgres@socorro1 ~]$ psql -c "select pg_size_pretty(pg_database_size(datname)) from pg_database where datname = 'breakpad'" breakpad pg_size_pretty ---------------- 2308 GB :selenamarie looking into bloat monitor as she mentioned quite a bit of bloat that could be cleaned up
:selenamarie was able to further reduce disk usage down to 91% [postgres@socorro1 ~]$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_dbsoc-lv_root ext4 201G 101G 98G 51% / tmpfs tmpfs 36G 0 36G 0% /dev/shm /dev/sda1 ext4 485M 90M 370M 20% /boot /dev/mapper/vg_dbsoc-lv_wal xfs 50G 4.4G 46G 9% /wal /dev/sdb1 xfs 2.5T 2.3T 229G 91% /pgdata [postgres@socorro1 ~]$ psql -c "select pg_size_pretty(pg_database_size(datname)) from pg_database where datname = 'breakpad'" breakpad pg_size_pretty ---------------- 2264 GB
:selenamarie pointed to a list of indexes that are candidates for removal. She has identified about 150GB of unused indexes at https://bugzilla.mozilla.org/show_bug.cgi?id=828711#c4 The plan is to leave these alone for the immediate time being, but in a pinch they can be dropped to free up more space.
Attached file indexes_1008452.txt
List of indexes and size that are currently unused. This is provided as an example as these are only as of right now unused. There is the potential for a rarely used or new query that may use them. Prior to any removal, the list should be checked to verify they aren't being used. select relname, sum(pg_relation_size(indexrelname::regclass)::bigint) from pg_stat_all_indexes where schemaname = 'public' and idx_scan = 0 group by relname order by 2 desc
We're getting a warning about disk use. mpressman says there are 152 GB left and we're using that up at 20GB a day. That gives us enough runway to have selena come back and double check the unused indexes before we start dropping them to save space. If we get under 100 GB free before then mpressman will drop them. "if I get hit by a bus and selenamarie doesn't make it back, my plan would be to drop those indexes on a replica and vacuum to clean up then let it catch up to the master and then failover"
Dropping indexes now to recover space. Current disk usage: [postgres@socorro1 ~]$ df -h /dev/sdb1 Filesystem Size Used Avail Use% Mounted on /dev/sdb1 2.5T 2.4T 133G 95% /pgdata [postgres@socorro1 ~]$ df /dev/sdb1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 2635873796 2496924348 138949448 95% /pgdata
indexes with idx_scan=0 [correlations_module] ix_correlations_module_signature_id [reports_duplicates] reports_duplicates_leader [reports_clean_20121001] reports_clean_20121001_release_channel reports_clean_20121001_process_type reports_clean_20121001_os_name reports_clean_20121001_arch_cores reports_clean_20121001_signature_id reports_clean_20121001_os_version_id reports_clean_20121001_hang_id reports_clean_20121001_flash_version_id reports_clean_20121001_domain_id reports_clean_20121001_address_id [reports_clean_20120924] reports_clean_20120924_release_channel reports_clean_20120924_process_type reports_clean_20120924_os_name reports_clean_20120924_arch_cores reports_clean_20120924_hang_id reports_clean_20120924_flash_version_id reports_clean_20120924_domain_id reports_clean_20120924_address_id reports_clean_20120924_signature_id reports_clean_20120924_os_version_id [reports_clean_20130114]
[reports_clean_20130114] reports_clean_20130114_domain_id reports_clean_20130114_release_channel reports_clean_20130114_process_type reports_clean_20130114_hang_id reports_clean_20130114_flash_version_id reports_clean_20130114_address_id reports_clean_20130114_signature_id reports_clean_20130114_os_version_id reports_clean_20130114_os_name reports_clean_20130114_arch_cores reports_clean_20130114_sig_prod_date [reports_clean_20121105] reports_clean_20121105_release_channel reports_clean_20121105_process_type reports_clean_20121105_os_name reports_clean_20121105_arch_cores reports_clean_20121105_hang_id reports_clean_20121105_flash_version_id reports_clean_20121105_domain_id reports_clean_20121105_address_id reports_clean_20121105_signature_id reports_clean_20121105_os_version_id [reports_clean_20120625] reports_clean_20120625_release_channel reports_clean_20120625_process_type reports_clean_20120625_os_name reports_clean_20120625_arch_cores reports_clean_20120625_hang_id reports_clean_20120625_flash_version_id reports_clean_20120625_domain_id reports_clean_20120625_address_id reports_clean_20120625_signature_id reports_clean_20120625_os_version_id [reports_clean_20130121] reports_clean_20130121_domain_id reports_clean_20130121_release_channel reports_clean_20130121_process_type reports_clean_20130121_hang_id reports_clean_20130121_flash_version_id reports_clean_20130121_address_id reports_clean_20130121_signature_id reports_clean_20130121_os_version_id reports_clean_20130121_os_name reports_clean_20130121_arch_cores reports_clean_20130121_sig_prod_date reports_clean_20130121_product_version_id [reports_clean_20120702] reports_clean_20120702_release_channel reports_clean_20120702_process_type reports_clean_20120702_os_name reports_clean_20120702_arch_cores reports_clean_20120702_hang_id reports_clean_20120702_flash_version_id reports_clean_20120702_domain_id reports_clean_20120702_address_id reports_clean_20120702_signature_id reports_clean_20120702_os_version_id [reports_clean_20130408] reports_clean_20130408_domain_id reports_clean_20130408_release_channel reports_clean_20130408_process_type reports_clean_20130408_hang_id reports_clean_20130408_flash_version_id reports_clean_20130408_address_id reports_clean_20130408_signature_id reports_clean_20130408_os_version_id reports_clean_20130408_os_name reports_clean_20130408_arch_cores reports_clean_20130408_sig_prod_date [reports_clean_20130311] reports_clean_20130311_domain_id reports_clean_20130311_release_channel reports_clean_20130311_process_type reports_clean_20130311_hang_id reports_clean_20130311_flash_version_id reports_clean_20130311_address_id reports_clean_20130311_signature_id reports_clean_20130311_os_version_id reports_clean_20130311_os_name reports_clean_20130311_arch_cores [reports_clean_20121022] reports_clean_20121022_release_channel reports_clean_20121022_process_type reports_clean_20121022_os_name reports_clean_20121022_arch_cores reports_clean_20121022_hang_id reports_clean_20121022_flash_version_id reports_clean_20121022_domain_id reports_clean_20121022_address_id reports_clean_20121022_signature_id reports_clean_20121022_os_version_id
The disk size has dropped below the 95% threshold and now back into warning territory [postgres@socorro1 ~]$ df -h /dev/sdb1 Filesystem Size Used Avail Use% Mounted on /dev/sdb1 2.5T 2.3T 164G 94% /pgdata [postgres@socorro1 ~]$ df /dev/sdb1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 2635873796 2464677840 171195956 94% /pgdata Rather than continue to put the indexes dropped into comment sections, I will put a more verbose output into an attachment. That way I can be a little quicker by just copying and pasting from my shell buffer
~44GB recovered from prior to dropping indexes [postgres@socorro1 ~]$ df -h /dev/sdb1 Filesystem Size Used Avail Use% Mounted on /dev/sdb1 2.5T 2.3T 179G 93% /pgdata
Excellent! I suspect Processes JSON is taking up a large amount of data. Laura suggested remove anything older then a day, and Rhelmer concurs. The only thing we can think of that relies on it is the correlations report, which only looks at the last day or two. That can wait until tomorrow though. Selena can confirm our suspicion.
Blocks: 1013995
(In reply to Chris Lonnen :lonnen from comment #19) > Excellent! I suspect Processes JSON is taking up a large amount of data. > Laura suggested remove anything older then a day, and Rhelmer concurs. The > only thing we can think of that relies on it is the correlations report, > which only looks at the last day or two. > > That can wait until tomorrow though. Selena can confirm our suspicion. We have processed_json from the last year, so removing everything but the last day would probably be close to 1/3 of our total disk utilization. I'm putting together a collection of issues involving processed_json. If we can't get movement on socorro processor and UI tickets related to this in the near future (next 3 weeks?), then we can drop the partitions. I'm going to attempt to measure the difference in size of processed json with and without the pipe-dump format. We currently double our storage of output from MDSW because of the transition to json output from MDSW.
I'd like to just put this here - and I apologize if it's obvious - but if there's a determination that some of the data has a lifecycle (e.g. processed_json doesn't need to be kept for more than a day or so), can there be another issue to automate the cleaning of this data so we don't run into this issue again? (also so that refreshes, backups and restores are as lean as they can be).
@sheeri -- it doesn't truly have a lifecycle that short. There are new features underway that will use it, but the only thing relying on it in production right now looks at 24 - 48 hours of data. Clearing older data is a cheap way to alleviate the immediate pressure before new features land while we wait for new hardware.
Assignee: mpressman → sdeckelmann
I'm going to selectively drop a set of tables containing raw and processed JSON. Here's where we're at before the clean-out: relname | pg_size_pretty ----------------------+---------------- raw_crashes_20131125 | 5690 MB raw_crashes_20130617 | 5065 MB raw_crashes_20130923 | 4869 MB raw_crashes_20140203 | 4770 MB raw_crashes_20140210 | 4769 MB raw_crashes_20140317 | 4767 MB raw_crashes_20140217 | 4739 MB raw_crashes_20131104 | 4685 MB raw_crashes_20131118 | 4659 MB raw_crashes_20130624 | 4658 MB raw_crashes_20140120 | 4620 MB raw_crashes_20131111 | 4618 MB raw_crashes_20140127 | 4596 MB raw_crashes_20140324 | 4550 MB raw_crashes_20140224 | 4474 MB raw_crashes_20140113 | 4441 MB raw_crashes_20140505 | 4440 MB raw_crashes_20140310 | 4439 MB raw_crashes_20140303 | 4429 MB raw_crashes_20130916 | 4415 MB raw_crashes_20131028 | 4412 MB raw_crashes_20130909 | 4391 MB raw_crashes_20131021 | 4387 MB raw_crashes_20140106 | 4372 MB raw_crashes_20131209 | 4370 MB raw_crashes_20140512 | 4368 MB raw_crashes_20130729 | 4346 MB raw_crashes_20130930 | 4337 MB raw_crashes_20130805 | 4335 MB raw_crashes_20131202 | 4326 MB raw_crashes_20131216 | 4323 MB raw_crashes_20130701 | 4321 MB raw_crashes_20131014 | 4310 MB raw_crashes_20130902 | 4304 MB raw_crashes_20140407 | 4268 MB raw_crashes_20130722 | 4266 MB raw_crashes_20131007 | 4253 MB raw_crashes_20140331 | 4248 MB raw_crashes_20130715 | 4246 MB raw_crashes_20140428 | 4237 MB raw_crashes_20130812 | 4227 MB raw_crashes_20140421 | 4221 MB raw_crashes_20130819 | 4211 MB raw_crashes_20130708 | 4187 MB raw_crashes_20130826 | 4183 MB raw_crashes_20131230 | 4157 MB raw_crashes_20140414 | 4157 MB raw_crashes_20131223 | 4091 MB raw_crashes_20140519 | 1818 MB raw_crashes_20140526 | 8192 bytes (50 rows) breakpad=# select relname, pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_' and relkind = 'r' order by pg_table_size(relname::text) desc; relname | pg_size_pretty ----------------------------+---------------- processed_crashes_20140512 | 117 GB processed_crashes_20140217 | 111 GB processed_crashes_20140505 | 110 GB processed_crashes_20140317 | 109 GB processed_crashes_20140324 | 108 GB processed_crashes_20140224 | 104 GB processed_crashes_20140303 | 103 GB processed_crashes_20140310 | 102 GB processed_crashes_20140428 | 101 GB processed_crashes_20140331 | 100 GB processed_crashes_20140407 | 99 GB processed_crashes_20140421 | 98 GB processed_crashes_20140414 | 96 GB processed_crashes_20140519 | 50 GB processed_crashes_20140210 | 49 GB processed_crashes_20140203 | 6848 kB processed_crashes_20140127 | 6248 kB processed_crashes_20140120 | 4040 kB processed_crashes_20140113 | 3672 kB processed_crashes_20140106 | 3392 kB processed_crashes_20131230 | 2784 kB processed_crashes_20131209 | 2464 kB processed_crashes_20131118 | 2384 kB processed_crashes_20131223 | 2192 kB processed_crashes_20131104 | 1816 kB processed_crashes_20131202 | 1568 kB processed_crashes_20131216 | 1496 kB processed_crashes_20131111 | 1344 kB processed_crashes_20131125 | 1328 kB processed_crashes_20131014 | 1128 kB processed_crashes_20131028 | 968 kB processed_crashes_20131021 | 904 kB processed_crashes_20131007 | 576 kB processed_crashes_20130923 | 544 kB processed_crashes_20130916 | 504 kB processed_crashes_20130930 | 376 kB processed_crashes_20130909 | 336 kB processed_crashes_20130902 | 168 kB processed_crashes_20140526 | 8192 bytes processed_crashes_20130819 | 8192 bytes processed_crashes_20130826 | 8192 bytes processed_crashes_20130805 | 8192 bytes processed_crashes_20130729 | 8192 bytes processed_crashes_20130722 | 8192 bytes processed_crashes_20130715 | 8192 bytes processed_crashes_20130708 | 8192 bytes processed_crashes_20130701 | 8192 bytes processed_crashes_20130624 | 8192 bytes processed_crashes_20130617 | 8192 bytes processed_crashes_20130812 | 8192 bytes (50 rows)
I get that this particular data set will soon be used, what I don't get is why there's data going back almost a year (June 17 2013). Have these features been "coming soon" since a year ago? If so, how can we be assured that they're actually coming now? Can we turn off that data collection, or purge it, until such time as the data is actually being used? Since it only uses a few days' worth of information, it shouldn't be a big loss to wait to store this data. (I don't know where y'all are in your cycle of buying new hardware, and if that will help this matter). On another note: are there any other data sets that aren't being used yet, that should be addressed?
I'm going to drop everything prior to April 2014. Leaving 2 months of data in there for playing around with. Next big step is to get rid of the pdumps :)
(In reply to Sheeri Cabral [:sheeri] from comment #24) > I get that this particular data set will soon be used, what I don't get is > why there's data going back almost a year (June 17 2013). Have these > features been "coming soon" since a year ago? If so, how can we be assured > that they're actually coming now? The data has only been present for processed crashes since February.
Cleanup complete! Didn't need to touch raw_crashes afterall. breakpad=# select relname, pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_201402' and relkind = 'r' ; relname | pg_size_pretty ----------------------------+---------------- processed_crashes_20140210 | 49 GB processed_crashes_20140217 | 111 GB processed_crashes_20140224 | 104 GB processed_crashes_20140203 | 6848 kB (4 rows) breakpad=# select 'TRUNCATE relname', pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_201402' and relkind = 'r' ; ?column? | pg_size_pretty ------------------+---------------- TRUNCATE relname | 49 GB TRUNCATE relname | 111 GB TRUNCATE relname | 104 GB TRUNCATE relname | 6848 kB (4 rows) breakpad=# select 'TRUNCATE ' ||relname, pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_201402' and relkind = 'r' ; ?column? | pg_size_pretty -------------------------------------+---------------- TRUNCATE processed_crashes_20140210 | 49 GB TRUNCATE processed_crashes_20140217 | 111 GB TRUNCATE processed_crashes_20140224 | 104 GB TRUNCATE processed_crashes_20140203 | 6848 kB (4 rows) breakpad=# select 'TRUNCATE ' || relname || ';', pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_201402' and relkind = 'r' ; ?column? | pg_size_pretty --------------------------------------+---------------- TRUNCATE processed_crashes_20140210; | 49 GB TRUNCATE processed_crashes_20140217; | 111 GB TRUNCATE processed_crashes_20140224; | 104 GB TRUNCATE processed_crashes_20140203; | 6848 kB (4 rows) breakpad=# select 'TRUNCATE ' || relname || ';' from pg_class where relname ~ 'processed_crashes_201402' and relkind = 'r' ; ?column? -------------------------------------- TRUNCATE processed_crashes_20140210; TRUNCATE processed_crashes_20140217; TRUNCATE processed_crashes_20140224; TRUNCATE processed_crashes_20140203; (4 rows) breakpad=# TRUNCATE processed_crashes_20140210; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140217; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140224; TRUNCATE TABLE breakpad=# select pg_size_pretty(pg_database_size('breakpad')); pg_size_pretty ---------------- 2059 GB (1 row) breakpad=# ^Z [1]+ Stopped psql breakpad [postgres@socorro1 ~]$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_dbsoc-lv_root 209809888 105454092 102224240 51% / tmpfs 37116508 0 37116508 0% /dev/shm /dev/sda1 495844 91476 378768 20% /boot /dev/mapper/vg_dbsoc-lv_wal 52403200 4587836 47815364 9% /wal /dev/sdb1 2635873796 2181320344 454553452 83% /pgdata [postgres@socorro1 ~]$ fg psql breakpad breakpad=# breakpad=# breakpad=# select 'TRUNCATE ' || relname || ';', pg_size_pretty(pg_table_size(relname::text)) from pg_class where relname ~ 'processed_crashes_201403' and relkind = 'r' ; ?column? | pg_size_pretty --------------------------------------+---------------- TRUNCATE processed_crashes_20140303; | 103 GB TRUNCATE processed_crashes_20140324; | 108 GB TRUNCATE processed_crashes_20140331; | 100 GB TRUNCATE processed_crashes_20140310; | 102 GB TRUNCATE processed_crashes_20140317; | 109 GB (5 rows) breakpad=# select 'TRUNCATE ' || relname || ';' from pg_class where relname ~ 'processed_crashes_201403' and relkind = 'r' ; ?column? -------------------------------------- TRUNCATE processed_crashes_20140303; TRUNCATE processed_crashes_20140324; TRUNCATE processed_crashes_20140331; TRUNCATE processed_crashes_20140310; TRUNCATE processed_crashes_20140317; (5 rows) breakpad=# TRUNCATE processed_crashes_20140303; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140324; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140331; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140310; TRUNCATE TABLE breakpad=# TRUNCATE processed_crashes_20140317; TRUNCATE TABLE breakpad=# select pg_size_pretty(pg_database_size('breakpad')); pg_size_pretty ---------------- 1535 GB (1 row) breakpad=# ^Z [1]+ Stopped psql breakpad [postgres@socorro1 ~]$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_dbsoc-lv_root 209809888 105454092 102224240 51% / tmpfs 37116508 0 37116508 0% /dev/shm /dev/sda1 495844 91476 378768 20% /boot /dev/mapper/vg_dbsoc-lv_wal 52403200 4587836 47815364 9% /wal /dev/sdb1 2635873796 1632207104 1003666692 62% /pgdata [postgres@socorro1 ~]$ I left the tables there in the event that we backfill or individuals request old-ish crashes be processed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to Sheeri Cabral [:sheeri] from comment #21) > I'd like to just put this here - and I apologize if it's obvious - but if > there's a determination that some of the data has a lifecycle (e.g. > processed_json doesn't need to be kept for more than a day or so), can there > be another issue to automate the cleaning of this data so we don't run into > this issue again? (also so that refreshes, backups and restores are as lean > as they can be). FWIW, one issue is that the UI can only be switched to using this data easily (without a fallback to hbase) if we have data for long enough time in those tables that we cover the whole timespan of data we expect to see in the UI (I think that's 6 months atm). Looks like with the current rate of data consumption, that switch is some time out for processed crash data (but raw crash data looks better, I guess).
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: