Closed Bug 1134621 Opened 10 years ago Closed 10 years ago

Stage's DB disk usage has increased 63GB (23%) in the last 48 hours

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

At 3pm UTC on the 18th, stage db1 was using 271GB on disk (out of 394GB). As of the same time today (48 hours later) it's up to 334GB. (And similar for stage db2) This is a 23% increase of usage & total disk utilisation is now at 85% and causing alerts. What's causing this sudden spike? Taskcluster is submitting more data now (and just to stage, not prod), but even still, this is quite excessive. Things to check/do: 1) We recently lowered the data lifecycle on stage from 4 months to 2 months - are there issues with this? 2) Given the total disk on stage is 400GB and prod is 1TB, stage should have a data likecycle of less than half of prod, whereas at the moment it's half (2 months vs 4 months). We should lower it to 6 weeks perhaps. 3) On prod, Sheeri reduced the length of time we kept logs for, to reduce their size on disk. I seem to recall we hadn't done the same for stage. We should.
Depends on: 1131603
1) I manually truncated some of the objectstore tables on stage last week, to buy us time before the disk size was increased. As such, we'll be filling those back up to the 2 month point now, making the per day growth rate look higher than it will be, compared to when we've stabilised. That said on prod, 4 months of objectstore was only about 25GB, so this would only account for ~0.5GB in the last 48 hours (excl taskcluster influence). 2) The stage data lifecycle is being reduced from 60 days to 45 days in bug 1131603. It makes sense to have a figure less than half of prod, seeing as stage only has 40% of the disk of prod. 3) The current oldest job in mozilla_inbound_jobs_1.job has a submit_timestamp of '1418399931', which is equivalent to 12/12/2014 @ 3:58pm (UTC), which is ~67 days ago. It also has a job id of '1', so looks like expiration has never run on stage (though stage has only been working 7 days longer than the current data expiration period, so this won't win a huge amount without also #2 above).
Stage: > SELECT id, from_unixtime(push_timestamp) FROM mozilla_inbound_jobs_1.result_set order by id LIMIT 10 + ------- + ---------------------------------- + | id | from_unixtime(push_timestamp) | + ------- + ---------------------------------- + | 1 | 2014-12-12 15:54:15 | | 2 | 2014-12-12 13:52:56 | | 3 | 2014-12-12 09:50:46 | | 4 | 2014-12-12 12:46:45 | | 5 | 2014-12-12 12:17:45 | | 6 | 2014-12-12 13:13:14 | | 7 | 2014-12-12 12:57:26 | | 8 | 2014-12-12 14:41:39 | | 9 | 2014-12-12 15:29:17 | | 10 | 2014-12-12 13:57:49 | + ------- + ---------------------------------- + 10 rows Prod DB looks fine in comparison. Looking at New Relic for the last 7 days... Prod: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?search[name]=cycle-data&search[sort_by]=count&search[type]=all&tw[end]=1424362279&tw[start]=1423757479&utf8=%E2%9C%93 -> 1 exception; MySQL server has gone away) Stage : https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/3133695802/similar_errors?original_error_id=3133695802&tw[end]=1424362233&tw[start]=1423757433 -> 7 exceptions, ie one every time it ran; 'NoneType' object has no attribute 'open' Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/hooks/application_celery.py", line 66, in wrapper return wrapped(*args, **kwargs) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__ return self.run(*args, **kwargs) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/tasks.py", line 35, in cycle_data call_command('cycle_data') File "/usr/lib/python2.7/site-packages/django/core/management/__init__.py", line 161, in call_command return klass.execute(*args, **defaults) File "/usr/lib/python2.7/site-packages/django/core/management/base.py", line 255, in execute output = self.handle(*args, **options) File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/api/function_trace.py", line 110, in literal_wrapper return wrapped(*args, **kwargs) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/management/commands/cycle_data.py", line 70, in handle num_deleted, project)) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 39, in __exit__ self.disconnect() File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 125, in disconnect dhub.disconnect() File "/usr/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 245, in disconnect if self.connection[host_type]['con_obj'].open: AttributeError: 'NoneType' object has no attribute 'open' This is the exception I commented on in bug 1134140 comment 1.
Bingo: [2015-02-19 00:00:00,065: INFO/MainProcess] Received task: cycle-data[388a6118-90b3-4393-b9e3-3c4a206bbbea] ... [2015-02-19 00:00:00,141: WARNING/Worker-12] datasource.hubs.MySQL: try_to_connect OperationalError encountered on attempt 0. Database:accessibility_jobs_1 [2015-02-19 00:00:00,141: WARNING/Worker-12] Error detected was: (1049, "Unknown database 'accessibility_jobs_1'") [2015-02-19 00:00:01,147: WARNING/Worker-12] datasource.hubs.MySQL: try_to_connect OperationalError encountered on attempt 1. Database:accessibility_jobs_1 [2015-02-19 00:00:01,147: WARNING/Worker-12] Error detected was: (1049, "Unknown database 'accessibility_jobs_1'") ... [2015-02-19 00:00:19,262: WARNING/Worker-12] datasource.hubs.MySQL: try_to_connect OperationalError encountered on attempt 19. Database:accessibility_jobs_1 [2015-02-19 00:00:19,262: WARNING/Worker-12] Error detected was: (1049, "Unknown database 'accessibility_jobs_1'") [2015-02-19 00:00:20,270: ERROR/MainProcess] Task cycle-data[388a6118-90b3-4393-b9e3-3c4a206bbbea] raised unexpected: AttributeError("'NoneType' object has no attribute 'open'",) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task R = retval = fun(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/hooks/application_celery.py", line 66, in wrapper return wrapped(*args, **kwargs) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__ return self.run(*args, **kwargs) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/tasks.py", line 35, in cycle_data call_command('cycle_data') File "/usr/lib/python2.7/site-packages/django/core/management/__init__.py", line 115, in call_command return klass.execute(*args, **defaults) File "/usr/lib/python2.7/site-packages/django/core/management/base.py", line 338, in execute output = self.handle(*args, **options) File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/api/function_trace.py", line 110, in literal_wrapper return wrapped(*args, **kwargs) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/management/commands/cycle_data.py", line 70, in handle num_deleted, project)) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 41, in __exit__ self.disconnect() File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 126, in disconnect dhub.disconnect() File "/usr/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 245, in disconnect if self.connection[host_type]['con_obj'].open: AttributeError: 'NoneType' object has no attribute 'open' bc mentioned getting "OperationalError 2006's since about Django 1.5/1.6" in: https://github.com/jeads/datasource/pull/17 We actually have that fix in the repo - but due to us having globally installed the packages in pure.txt, and the global path being in the python path before our own vendor directory, I believe we're not using the fix. Either uninstalling datasource from site-packages or pip2.7 installing the new revision should fix the problem.
Depends on: 1134140
Depends on: 1134804
No longer depends on: 1134804
So after a slight 2+2=5 diversion (end of comment 4), that at least resulted in fixing another problem (bug 1134740) even if it wasn't directly related, I finally actually read the error at the start of comment 4 and have resolved the problem causing the cycle-data exception, in bug 1134804. I've manually triggered a cycle-data using |python2.7 manage.py cycle_data| on stage, which of course is taking quite a while (particularly given we've just dropped the lifecycle from 60 days to 45) but is succeeding so far without exceptions. I'm presuming we'll need a defrag after this.
Depends on: 1134804
No longer depends on: 1134140
Assignee: nobody → emorley
Status: NEW → ASSIGNED
I ran several manual cycle-data runs, with increasingly smaller date ranges, to clear the backlog. cycle-data now completes with the 45 day interval without timing out & data expiration after 45 days (for stage) is definitely working: > SELECT id, from_unixtime(submit_timestamp) FROM mozilla_inbound_jobs_1.job ORDER BY id LIMIT 5 + ------- + ------------------------------------ + | id | from_unixtime(submit_timestamp) | + ------- + ------------------------------------ + | 535077 | 2015-01-06 13:58:41 | | 535078 | 2015-01-06 13:58:41 | | 535079 | 2015-01-06 13:58:41 | | 535080 | 2015-01-06 13:58:41 | | 535081 | 2015-01-06 13:58:41 | + ------- + ------------------------------------ + 5 rows Sheeri, we're now ready to defrag treeherder-stage, if you think it needs it - thanks :-)
Flags: needinfo?(scabral)
Started some defrag half an hour ago. So far we've regained 7 Gb: cedar_jobs_1: 1.1G -> 935M cypress_jobs_1: 1.5G -> 776M mozilla_esr31_jobs_1: 1.8G -> 227M gaia_try_jobs_1: 2.6G -> 1.9G mozilla_beta_jobs_1: 6.5G -> 3.9G mozilla_aurora_jobs_1: 12G -> 7.7G and we're working on the larger dbs now, which should gain more space.
Flags: needinfo?(scabral)
update, we've gained another 15G: b2g_inbound_jobs_1: 20G -> 13G mozilla_central_jobs_1: 22G -> 14G
Update: 39G total freed up. Here's the latest: try_jobs_1: 38G -> 29G fx_team_jobs_1: 40G -> 27G we still have to defragment the mozilla_inbound_jobs_1 db but we want to wait until tomorrow before 9 am Pacific to do so (even though it is stage).
That's great - thank you :-)
Because we have the room, I have set expire_logs_days back to 2.
I have defragmented mozilla_inbound_jobs_1 and it went from 100G to 77G. w00t!
Thank you :-)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.