Closed Bug 545006 Opened 12 years ago Closed 11 years ago

prod socorro hardware install max RAM on Postgres server

Categories

(Socorro :: General, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ozten, Assigned: morgamic)

References

Details

(Whiteboard: [investigating])

This is a tracking bug for work to install max RAM.
Blocks: 544943
System can take 64GB max.  Upgrading to 8x 8GB DIMMs is $13k.  Is this the right direction right now?
Assignee: server-ops → mrz
Researching hardware platform & real requirements.
Component: Server Operations: Web Content Push → Server Operations
Whiteboard: [investigating]
Assignee: mrz → morgamic
Aravind - could you give me the size of tables on disk for:
1. the three latest partitions of the reports table
2. the overall size of the productdims and osdims tables (or anything w/ *dims in it)
the recent reports tables are between 116970 and 121298 relpages, so that would be between 118MB to 112MB in size.

For the other stuff..

breakpad=# SELECT relname, relpages FROM pg_class where relname like '%dim%' ORDER BY relpages DESC limit 10;
                        relname                         | relpages
--------------------------------------------------------+----------
 top_crashes_by_signature_osdims_key                    |    94792
 top_crashes_by_signature_productdims_key               |    94669
 top_crashes_by_signature_window_end_productdims_id_idx |    91967
 top_crashes_by_url_urldims_key                         |    13258
 top_crashes_by_url_osdims_key                          |    12924
 top_crashes_by_url_productdims_key                     |    12893
 urldims_url_domain_key                                 |     7620
 urldims                                                |     6290
 urldims_pkey1                                          |     1319
 osdims_name_version_key                                |       16
(10 rows)

That seems like the biggest one (top_crashes_by_signature_osdims_key) is 92MB in size.
Isn't 1 relpage 8KB?
Whoa.. my math was totally off..  

The recent reports tables are around between 2030 MB and 2302 MB each.

The other stuff is

breakpad=# SELECT pg_size_pretty(pg_total_relation_size('top_crashes_by_signature_osdims_key'));
 pg_size_pretty
----------------
 741 MB
(1 row)

breakpad=# SELECT pg_size_pretty(pg_total_relation_size('top_crashes_by_signature_productdims_key'));
 pg_size_pretty
----------------
 740 MB
(1 row)

breakpad=# SELECT pg_size_pretty(pg_total_relation_size('top_crashes_by_url_urldims_key '));
 pg_size_pretty
----------------
 105 MB
(1 row)

breakpad=# SELECT pg_size_pretty(pg_total_relation_size('top_crashes_by_url_osdims_key'));
 pg_size_pretty
----------------
 102 MB
(1 row)
Thanks Aravind - can you run the same query for %report% too
I did, thats how I got those earlier numbers (The recent reports tables are around between 2030 MB and 2302 MB each.)
09:31 < jberkus> aravind actually needs to run pg_total_relation_size()
09:31 < jberkus> which will include the indexes

Can you run that?
Actually, he did on the 2nd run.

However, we need to know *how many* recent reports partitions there are at 2GB each.  That'll give us our sizing; that much data needs to fit into around 60% of RAM.

Alternately, if you had some local disk instead of that slow-as-molasses SAN, hitting disk wouldn't be quite the kiss of death it is now.  Overall, if this system is going to remain in use for the forseeable future, then I think a new server is called for, one with 64-128GB of RAM and decent local disk.
(In reply to comment #10)
> Actually, he did on the 2nd run.
> 
> However, we need to know *how many* recent reports partitions there are at 2GB
> each.  That'll give us our sizing; that much data needs to fit into around 60%
> of RAM.
> 

That has to come from webdev/engineering (how many recent reports partitions are relevant).. We can safely assume that the 2GB number is a lower bound, because we expect the number of reports processed per week to only go up.

Also, in the long term, I'd think that these reports partitions will go away entirely, since this data will be in hadoop?
Aravind,

Morgamic says that you'll still need the cached data for reporting for the most recent 4 weeks, even if the data is coming from hadoop instead of matview generation.

But, once you have your policy, the math is simple:
Total frequently accessed data <= ( RAM * 65% )
Going by that, it would appear that we need about 4 weeks of reports - that's around 8 GB + matview stuff.  The two largest matview tables look like they are around 1.5 GB together.  I don't know enough about how much these tables could grow or how many more such tables we plan on adding.  It seem like we should be able to handle the existing data with the current memory of 16 GB (65% of that is 10.4GB).

I suspect we are missing something else here.

It does seem like we are just at the limit of the system with the current memory on the box.  I don't have a good way of estimating the future growth of the matview tables without engineering/webdev input.
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
Target Milestone: --- → 1.7
-> 1.7.  If the 1.6 push fixes perf, we'll likely skip the expense of new RAM.
This quote seems pertinent:

"""Alternately, if you had some local disk instead of that slow-as-molasses SAN,
hitting disk wouldn't be quite the kiss of death it is now.  Overall, if this
system is going to remain in use for the forseeable future, then I think a new
server is called for, one with 64-128GB of RAM and decent local disk."""

I've never seen a server that seems so slow for such simple queries before. Can we do a trial with this database on a local-disk machine before we go spending $gazillions on RAM? Also, 64-128GB RAM seems extremely large. I would expect something more along the lines of 8-32GB RAM to work out.

I'm asking morgamic for the report from the Postgres consultants so I can sanity check the things I'm saying here.
The slowness is a result of the data not fitting in RAM and disk being incredibly slow.  If you can speed up disk, more RAM isn't necessary.

Also, it would be worth looking at how the matviews are set up.  Multi-GB matviews kind of defeats the purpose of having matviews at all; there should be a way to make them more refined and smaller.  i.e. make the data smaller.
So we're in agreement. Ditch the SAN. Spend the money we would've spent on RAM getting some decent disks which will last us far longer and give us more benefit. It's a sensible financial choice.

It does sound like the material views could use some re-thinking, but I'm not sure I'm prepared to go into that right now.
Yes.  It needs to be good disk, though, like a 6-8 drive HW RAID, or a pair of SSDs.
Here's one possibility mrz and I pow-wowed about.

HP DL360G5 Quickspecs (http://h18000.www1.hp.com/products/quicKspecs/12476_div/12476_div.HTML): it has Up to six SAS/SATA hot plug drive bays. So we can get 6 HP fast drives, set them up in a local array, and see if it works better.

We have 300GB of space used now, so 4TB of total space (6x1.5TB/2) should last quite some time. If anyone has an intelligent guess on how long 4TB would last, please chime in.
6 drive bays but 2 are in use.  Largest capacity SAS drive is 300GB.  Assuming it's mirrored, it's only 600GB.

I'd vote against SATA for any number of reasons here.
Moving to SATA drives might not be a huge improvement due to the lack of control queueing (i.e. can't read and write at the same time).  Also, the outer tracks of those big drives tend to be sloooooooow.

If 2 of the drive bays are already in use, can they be reconfigured to support the transaction log?  If we did that, then putting 4 SAS drives in the remaining bays would probably be an improvement over current.  However, see capacity issue above.

Finally, given that you guys are moving to hadoop for your raw processing, wouldn't it make more sense to have hadoop generate highly targeted, compact matviews -- and get rid of "reports" entirely -- than to try to solve this issue with hardware?  Just a thought.
Morgamic?  Are we going to get more RAM?
We don't need more RAM if we aren't hitting the reports table hard and rely more on matviews as Josh alluded to.
FWIW, I'm working with another company which is using PostgreSQL as the reporting engine in front of Hadoop.  As such, I'm working on some performance optimizations for the use case where PostgreSQL is basically a "cache".  I would love to talk over your eventual design for the application in this light.
The PHX hardware will take care of this requirement; re-open if incorrect.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.