Data on S3 has doubled in size

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: rvitillo, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

On the 27th a new edge node has been added to aggregate submissions into files to be stored on S3. Adding that node shrank the average size of a file by about 2x which in turned is causing the v4 aggregator to take way too long to process a single day and ultimately fail.

Can we do something about it? Maybe aggregating for longer periods?
(Reporter)

Updated

3 years ago
Priority: -- → P1
(Reporter)

Updated

3 years ago
Flags: needinfo?(whd)
(Reporter)

Updated

3 years ago
Flags: needinfo?(whd) → needinfo?(mreid)
(Reporter)

Comment 1

3 years ago
I will try to batch file reads on the analysis job side and see if things improve.

Comment 2

3 years ago
On the 27th I updated the DWL configuration to use a c3.4xlarge (objects ending with ip-172-31-16-184), but the old DWL was re-enabled some time after that by cron and processed its entire backfill from kafka (objects ending with ip-172-31-14-40). I've disabled the old DWL, and we should remove all the ip-172-31-14-40 objects from after when the ip-172-31-16-184 objects started to appear.

I'm guessing the behavior :rvitillo is seeing is due to the doubling of S3 data, which would also double the large number of smaller files, throwing off the average. The configuration of the old and new DWLs is exactly the same, and removing the redundant data should be sufficient to fix things up.
okay. it's going to take me a bit to find out exactly which files need deleting, but i'll update here when i've figured it out.
'a bit' being a 1-4 hours of attention, on the morrow
(Reporter)

Comment 5

3 years ago
Yeah, the size of the data has more than doubled:

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/
2T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/
5T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/
(Reporter)

Updated

3 years ago
Summary: Files on S3 are too small → Data on S3 has doubled in size

Updated

3 years ago
Flags: needinfo?(mreid)
I've determined the list of files to remove by piping the output of this python script to a file:

> from boto import connect_s3
> 
> s3 = connect_s3()
> 
> bucket = s3.get_bucket('net-mozaws-prod-us-west-2-pipeline-data')
> 
> first = float('inf')
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-16-184':
>         first = min(first, float(ts))
> 
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40' and float(ts) > first:
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150728'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150729'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150730'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150731'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
94959 files were found
rvitillo and mreid have approved the list of 94959 files for deletion. starting delete now.
delete completed.

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.