Closed
Bug 1422332
Opened 7 years ago
Closed 6 years ago
Only zip extract the files that are suspiciously new
Categories
(Socorro :: Symbols, task)
Socorro
Symbols
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: peterbe, Unassigned)
References
Details
See https://cdn-2916.kxcdn.com/cache/37/63/376317ffbee7922e3779914fb49aebaa.png
This is a measurement of executing a Python function that takes a file buffer, sends it into `zipfile.ZipFile(file_buffer).extractall(tmp_dir)`.
It takes up a LOT of time. Sometimes 30 seconds, and not unusual to hover around 15 seconds.
A possible solution is to iterate over the zip files content (and sizes) first, and use that list to decide which files are new and needs to be extracted to disk.
In terms of implementation, you could simply replace...:
tmpdir = '/some/tmp/dir'
dump_and_extract(file_buffer, tmpdir)
for root, _, files in os.walk(tmpdir):
for file in files:
was_it_new = process(file)
With something like this...:
new_names = []
for name, size in extract_zip_file_names_and_sizes(file_buffer):
if not has_in_s3(name, size):
new_names.append(name)
tmpdir = '/some/tmp/dir'
dump_and_extract_some_select(new_names, file_buffer, tmpdir)
for root, _, files in os.walk(tmpdir):
for file in files:
was_it_new = process(file)
For this to work, we need to know if it's worth it. Load testing of the Stage server depends on using archives of old uploads to prod.
We need to start using Prod, for real, and start observing how common it is for a zip file to come in and later find out that only, say, 40% of the files were new to S3.
Reporter | ||
Comment 1•7 years ago
|
||
By the way, once you've created a zipfile.ZipFile instance, to iterate through it and get a list of names + sizes is dirt cheap.
For example, using this code:
sizes = []
t0=time.time()
for member in zf.infolist():
sizes.append((member.filename, member.file_size))
t1=time.time()
print("GETTING A LIST OF", len(sizes), "TOOK", t1 - t0)
I ran this on a 547MB file and that whole loop took 0.03ms. Milliseconds!
Reporter | ||
Comment 2•7 years ago
|
||
I ran a sample analysis on the most recent 116 uploads coming in to Prod.
30% of the files in all the .zip files get to be skipped. Meaning, if we apply the strategy laid out in this bug we can save ourselves 30% of unzipping.
Another thing we can do is switch to c5.4xlarge instances and get a 10% boost too.
https://www.peterbe.com/plog/unzip-benchmark-on-aws-ec2-c3.large-vs-c4.large (see last update)
Probably not worth the hassle.
Reporter | ||
Comment 3•6 years ago
|
||
Another option is to spring for an EC2 c5d instance type which has SSD drives.
Another option is to just up the c5 instance type to one which has enough RAM to unzip the whole symbols.zip in memory and not have to mess with disk as a buffer at all.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•