Closed Bug 1189062 Opened 9 years ago Closed 9 years ago

Data on S3 has doubled in size

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

Details

On the 27th a new edge node has been added to aggregate submissions into files to be stored on S3. Adding that node shrank the average size of a file by about 2x which in turned is causing the v4 aggregator to take way too long to process a single day and ultimately fail.

Can we do something about it? Maybe aggregating for longer periods?
Priority: -- → P1
Flags: needinfo?(whd)
Flags: needinfo?(whd) → needinfo?(mreid)
I will try to batch file reads on the analysis job side and see if things improve.
On the 27th I updated the DWL configuration to use a c3.4xlarge (objects ending with ip-172-31-16-184), but the old DWL was re-enabled some time after that by cron and processed its entire backfill from kafka (objects ending with ip-172-31-14-40). I've disabled the old DWL, and we should remove all the ip-172-31-14-40 objects from after when the ip-172-31-16-184 objects started to appear.

I'm guessing the behavior :rvitillo is seeing is due to the doubling of S3 data, which would also double the large number of smaller files, throwing off the average. The configuration of the old and new DWLs is exactly the same, and removing the redundant data should be sufficient to fix things up.
okay. it's going to take me a bit to find out exactly which files need deleting, but i'll update here when i've figured it out.
'a bit' being a 1-4 hours of attention, on the morrow
Yeah, the size of the data has more than doubled:

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/
2T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/
5T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/
Summary: Files on S3 are too small → Data on S3 has doubled in size
Flags: needinfo?(mreid)
I've determined the list of files to remove by piping the output of this python script to a file:

> from boto import connect_s3
> 
> s3 = connect_s3()
> 
> bucket = s3.get_bucket('net-mozaws-prod-us-west-2-pipeline-data')
> 
> first = float('inf')
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-16-184':
>         first = min(first, float(ts))
> 
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40' and float(ts) > first:
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150728'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150729'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150730'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150731'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
94959 files were found
rvitillo and mreid have approved the list of 94959 files for deletion. starting delete now.
delete completed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.