Closed Bug 1399192 Opened 7 years ago Closed 7 years ago

Upload by using disk for inboxing archive files

Categories

(Socorro :: Symbols, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: peterbe)

Details

Attachments

(3 files)

At the moment, when we get a symbol upload, we look into the archive file to validate that it looks sane, then we upload it into S3 (aka. the inbox file), write that down and send that to the Celery working. 

Turns out (https://bugzilla.mozilla.org/show_bug.cgi?id=1399140) that that S3 upload is potentially too slow for our architecture. Writing it to regular disk is an absolute no because our Celery workers don't share file system with the web heads. (Well, they do in local docker dev).

With AWS EFS we can use disk between the web heads and the Celery workers without incurring the cost of having to upload the payload to S3 for temporary storage.
Can you create an EFS config on Dev, Stage and Prod. The env var is called DJANGO_UPLOAD_INBOX_DIRECTORY [0].
The directory doesn't have to exist as long as Django can 1) create it if it doesn't exist and 2) create a temp file in it if does exist. 

The PR mentioned above assumes it to be set. There's a dockerflow health check that also checks that it can write to that directory. 


[0] https://github.com/mozilla-services/tecken/pull/395/files#diff-eead337ed57b42e703fe8f51dc925c61R164
Assignee: nobody → peterbe
Flags: needinfo?(miles)
Commit pushed to master at https://github.com/mozilla-services/tecken

https://github.com/mozilla-services/tecken/commit/98580bee35c8c2b8fd9fde11d4b3f35da993c229
bug 1399192 - Upload by using disk for inboxing archive files (#395)

* bug 1399192 - Upload by using disk for inboxing archive files

* adding docs

* adding a dockerflow test check_upload_inbox_directory

* adding temp_upload_inbox_directory

* use tempfile
So I've tested it on Dev. I upload a bunch of files and keep an eye on the logs looking for this line: https://github.com/mozilla-services/tecken/blob/98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243

After analyzing the log files I find these results::

 MEAN 40.2MB/sec
 MEDIAN 28.2MB/sec

It's not immediately obvious which statistic to use but let's focus on the median. It's only 40% faster than S3 uploads!
Sigh. 

Also, at the time of writing I keep getting so many Gunicorn timeouts that I went ahead and changed the default Gunicorn timeout to 120 seconds https://github.com/mozilla-services/tecken/pull/400/files
Basically, I've not yet been able to write a file bigger than 400MB.
Hmm... Another frustrating thing is; how the heck did this happen?
https://sentry.prod.mozaws.net/operations/symbols-dev/issues/648668/

It basically means Django wrote down a file to EFS disk then as soon as it considered itself done it kicked off the Celery task. And when the Celery task started it complained that the file didn't exist.

Are write writing to disk in an async way?? If so, it could mean that the time it takes to package up the Celery task and for the Celery task to start is actually *faster* than it takes EFS to *fully* write the file. An easy band-aid would be something like this::

 while True:
    attempts = 1
    try:
        with open(upload.inbox_filepath, 'rb') as f:
            buf.write(f.read())
        break
    except FileNotFound:
        if attempts >= 5:
            raise
        logger.debug(f'File not found. Trying again in {attempts} seconds')
        time.sleep(attempts)
        attempts += 1
EFS has fairly strong consistency guarantees. Copied from [0]:

"Amazon EFS provides the open-after-close consistency semantics that applications expect from NFS. Amazon EFS provides stronger consistency guarantees than open-after-close semantics depending on the access pattern. Applications that perform synchronous data access and perform non-appending writes will have read-after-write consistency for data access."

So, in theory we should be good here. The speeds aren't great, but it seems like EFS gives speed credits based on standing usage of the filesystem, whereas we're mostly using it as "swap" space.

One possibility that I was considering was that there could have been extra tecken instances lying around that didn't have the EFS change but _did_ have Celery instances that could claim the task. However, upon looking into it I didn't see any extras. This also shouldn't be a problem going forward.

I _do_ think adding some kind of retry / brief delay logic as you mentioned about would be a good workaround.

[0] http://docs.aws.amazon.com/efs/latest/ug/using-fs.html#consistency
Flags: needinfo?(miles)
(In reply to Peter Bengtsson [:peterbe] from comment #4)
> So I've tested it on Dev. I upload a bunch of files and keep an eye on the
> logs looking for this line:
> https://github.com/mozilla-services/tecken/blob/
> 98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243
> 
> After analyzing the log files I find these results::
> 
>  MEAN 40.2MB/sec
>  MEDIAN 28.2MB/sec
> 
> It's not immediately obvious which statistic to use but let's focus on the
> median. It's only 40% faster than S3 uploads!
> Sigh. 
> 
> Also, at the time of writing I keep getting so many Gunicorn timeouts that I
> went ahead and changed the default Gunicorn timeout to 120 seconds
> https://github.com/mozilla-services/tecken/pull/400/files
> Basically, I've not yet been able to write a file bigger than 400MB.

Update after having successfully uploaded 120 files (no idea how many have failed but it's a LOT more)::


 MEAN 26.4MB/sec
 MEDIAN 24.9MB/sec
In terms of code this pretty complete. It works. But it doesn't work well enough. 
Let's do https://bugzilla.mozilla.org/show_bug.cgi?id=1405073 instead.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: