Closed
Bug 1399192
Opened 7 years ago
Closed 7 years ago
Upload by using disk for inboxing archive files
Categories
(Socorro :: Symbols, task)
Socorro
Symbols
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Assigned: peterbe)
Details
Attachments
(3 files)
At the moment, when we get a symbol upload, we look into the archive file to validate that it looks sane, then we upload it into S3 (aka. the inbox file), write that down and send that to the Celery working. Turns out (https://bugzilla.mozilla.org/show_bug.cgi?id=1399140) that that S3 upload is potentially too slow for our architecture. Writing it to regular disk is an absolute no because our Celery workers don't share file system with the web heads. (Well, they do in local docker dev). With AWS EFS we can use disk between the web heads and the Celery workers without incurring the cost of having to upload the payload to S3 for temporary storage.
Comment 1•7 years ago
|
||
Assignee | ||
Comment 2•7 years ago
|
||
Can you create an EFS config on Dev, Stage and Prod. The env var is called DJANGO_UPLOAD_INBOX_DIRECTORY [0]. The directory doesn't have to exist as long as Django can 1) create it if it doesn't exist and 2) create a temp file in it if does exist. The PR mentioned above assumes it to be set. There's a dockerflow health check that also checks that it can write to that directory. [0] https://github.com/mozilla-services/tecken/pull/395/files#diff-eead337ed57b42e703fe8f51dc925c61R164
Assignee: nobody → peterbe
Flags: needinfo?(miles)
Comment 3•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla-services/tecken https://github.com/mozilla-services/tecken/commit/98580bee35c8c2b8fd9fde11d4b3f35da993c229 bug 1399192 - Upload by using disk for inboxing archive files (#395) * bug 1399192 - Upload by using disk for inboxing archive files * adding docs * adding a dockerflow test check_upload_inbox_directory * adding temp_upload_inbox_directory * use tempfile
Assignee | ||
Comment 4•7 years ago
|
||
So I've tested it on Dev. I upload a bunch of files and keep an eye on the logs looking for this line: https://github.com/mozilla-services/tecken/blob/98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243 After analyzing the log files I find these results:: MEAN 40.2MB/sec MEDIAN 28.2MB/sec It's not immediately obvious which statistic to use but let's focus on the median. It's only 40% faster than S3 uploads! Sigh. Also, at the time of writing I keep getting so many Gunicorn timeouts that I went ahead and changed the default Gunicorn timeout to 120 seconds https://github.com/mozilla-services/tecken/pull/400/files Basically, I've not yet been able to write a file bigger than 400MB.
Assignee | ||
Comment 5•7 years ago
|
||
Hmm... Another frustrating thing is; how the heck did this happen? https://sentry.prod.mozaws.net/operations/symbols-dev/issues/648668/ It basically means Django wrote down a file to EFS disk then as soon as it considered itself done it kicked off the Celery task. And when the Celery task started it complained that the file didn't exist. Are write writing to disk in an async way?? If so, it could mean that the time it takes to package up the Celery task and for the Celery task to start is actually *faster* than it takes EFS to *fully* write the file. An easy band-aid would be something like this:: while True: attempts = 1 try: with open(upload.inbox_filepath, 'rb') as f: buf.write(f.read()) break except FileNotFound: if attempts >= 5: raise logger.debug(f'File not found. Trying again in {attempts} seconds') time.sleep(attempts) attempts += 1
Comment 6•7 years ago
|
||
EFS has fairly strong consistency guarantees. Copied from [0]: "Amazon EFS provides the open-after-close consistency semantics that applications expect from NFS. Amazon EFS provides stronger consistency guarantees than open-after-close semantics depending on the access pattern. Applications that perform synchronous data access and perform non-appending writes will have read-after-write consistency for data access." So, in theory we should be good here. The speeds aren't great, but it seems like EFS gives speed credits based on standing usage of the filesystem, whereas we're mostly using it as "swap" space. One possibility that I was considering was that there could have been extra tecken instances lying around that didn't have the EFS change but _did_ have Celery instances that could claim the task. However, upon looking into it I didn't see any extras. This also shouldn't be a problem going forward. I _do_ think adding some kind of retry / brief delay logic as you mentioned about would be a good workaround. [0] http://docs.aws.amazon.com/efs/latest/ug/using-fs.html#consistency
Flags: needinfo?(miles)
Comment 7•7 years ago
|
||
Comment 8•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla-services/tecken https://github.com/mozilla-services/tecken/commit/d21ffd8e4453d2496684de949e5afef80ac3dc8e bug 1399192 - retry around filenotfounderror (#402)
Comment 9•7 years ago
|
||
Assignee | ||
Comment 10•7 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #4) > So I've tested it on Dev. I upload a bunch of files and keep an eye on the > logs looking for this line: > https://github.com/mozilla-services/tecken/blob/ > 98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243 > > After analyzing the log files I find these results:: > > MEAN 40.2MB/sec > MEDIAN 28.2MB/sec > > It's not immediately obvious which statistic to use but let's focus on the > median. It's only 40% faster than S3 uploads! > Sigh. > > Also, at the time of writing I keep getting so many Gunicorn timeouts that I > went ahead and changed the default Gunicorn timeout to 120 seconds > https://github.com/mozilla-services/tecken/pull/400/files > Basically, I've not yet been able to write a file bigger than 400MB. Update after having successfully uploaded 120 files (no idea how many have failed but it's a LOT more):: MEAN 26.4MB/sec MEDIAN 24.9MB/sec
Comment 11•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla-services/tecken https://github.com/mozilla-services/tecken/commit/b28165a06416cd9801f254e961f08050bcb48785 bug 1399192 - remove filenotfounderror retry in task (#403)
Assignee | ||
Comment 12•7 years ago
|
||
In terms of code this pretty complete. It works. But it doesn't work well enough. Let's do https://bugzilla.mozilla.org/show_bug.cgi?id=1405073 instead.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•