1189062 - Data on S3 has doubled in size

Reporter

Description

•

9 years ago

On the 27th a new edge node has been added to aggregate submissions into files to be stored on S3. Adding that node shrank the average size of a file by about 2x which in turned is causing the v4 aggregator to take way too long to process a single day and ultimately fail.

Can we do something about it? Maybe aggregating for longer periods?

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Priority: -- → P1

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(whd)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(whd) → needinfo?(mreid)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 1

•

9 years ago

I will try to batch file reads on the analysis job side and see if things improve.

Wesley Dawson [:whd]

Comment 2

•

9 years ago

On the 27th I updated the DWL configuration to use a c3.4xlarge (objects ending with ip-172-31-16-184), but the old DWL was re-enabled some time after that by cron and processed its entire backfill from kafka (objects ending with ip-172-31-14-40). I've disabled the old DWL, and we should remove all the ip-172-31-14-40 objects from after when the ip-172-31-16-184 objects started to appear.

I'm guessing the behavior :rvitillo is seeing is due to the doubling of S3 data, which would also double the large number of smaller files, throwing off the average. The configuration of the old and new DWLs is exactly the same, and removing the redundant data should be sufficient to fix things up.

Daniel Thorn [:relud]

Comment 3

•

9 years ago

okay. it's going to take me a bit to find out exactly which files need deleting, but i'll update here when i've figured it out.

Daniel Thorn [:relud]

Comment 4

•

9 years ago

'a bit' being a 1-4 hours of attention, on the morrow

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 5

•

9 years ago

Yeah, the size of the data has more than doubled:

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/
2T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150721/

s3cmd du -H s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/
5T       s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20150728/

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Summary: Files on S3 are too small → Data on S3 has doubled in size

Mark Reid [:mreid]

Updated

•

9 years ago

Flags: needinfo?(mreid)

Daniel Thorn [:relud]

Comment 6

•

9 years ago

I've determined the list of files to remove by piping the output of this python script to a file:

> from boto import connect_s3
> 
> s3 = connect_s3()
> 
> bucket = s3.get_bucket('net-mozaws-prod-us-west-2-pipeline-data')
> 
> first = float('inf')
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-16-184':
>         first = min(first, float(ts))
> 
> for key in bucket.list('telemetry-2/20150727'):
>     ts, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40' and float(ts) > first:
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150728'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150729'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150730'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))
> 
> for key in bucket.list('telemetry-2/20150731'):
>     _, host = key.key.rsplit('/', 1).pop().split('_')
>     if host == 'ip-172-31-14-40':
>         print('s3://%s/%s' % (bucket.name, key.key))

Daniel Thorn [:relud]

Comment 7

•

9 years ago

94959 files were found

Daniel Thorn [:relud]

Comment 8

•

9 years ago

rvitillo and mreid have approved the list of 94959 files for deletion. starting delete now.

Daniel Thorn [:relud]

Comment 9

•

9 years ago

delete completed.

Mark Reid [:mreid]

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Quick Search

Data on S3 has doubled in size

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated