Closed Bug 1399192 Opened 7 years ago Closed 7 years ago

Upload by using disk for inboxing archive files

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: peterbe)

Details

Attachments

(3 files)

Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/395 7 years ago GitHub Bugzilla PR Linker 51 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/402 7 years ago GitHub Bugzilla PR Linker 51 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/403 7 years ago GitHub Bugzilla PR Linker 51 bytes, text/x-github-pull-request		Details \| Review

Peter Bengtsson [:peterbe]

Assignee

Description

•

7 years ago

At the moment, when we get a symbol upload, we look into the archive file to validate that it looks sane, then we upload it into S3 (aka. the inbox file), write that down and send that to the Celery working. 

Turns out (https://bugzilla.mozilla.org/show_bug.cgi?id=1399140) that that S3 upload is potentially too slow for our architecture. Writing it to regular disk is an absolute no because our Celery workers don't share file system with the web heads. (Well, they do in local docker dev).

With AWS EFS we can use disk between the web heads and the Celery workers without incurring the cost of having to upload the payload to S3 for temporary storage.

GitHub Bugzilla PR Linker

Comment 1

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/395 — Details

Peter Bengtsson [:peterbe]

Assignee

Comment 2

•

7 years ago

Can you create an EFS config on Dev, Stage and Prod. The env var is called DJANGO_UPLOAD_INBOX_DIRECTORY [0].
The directory doesn't have to exist as long as Django can 1) create it if it doesn't exist and 2) create a temp file in it if does exist. 

The PR mentioned above assumes it to be set. There's a dockerflow health check that also checks that it can write to that directory. 


[0] https://github.com/mozilla-services/tecken/pull/395/files#diff-eead337ed57b42e703fe8f51dc925c61R164

Assignee: nobody → peterbe

Flags: needinfo?(miles)

[github robot]

Comment 3

•

7 years ago

Commit pushed to master at https://github.com/mozilla-services/tecken

https://github.com/mozilla-services/tecken/commit/98580bee35c8c2b8fd9fde11d4b3f35da993c229
bug 1399192 - Upload by using disk for inboxing archive files (#395)

* bug 1399192 - Upload by using disk for inboxing archive files

* adding docs

* adding a dockerflow test check_upload_inbox_directory

* adding temp_upload_inbox_directory

* use tempfile

Peter Bengtsson [:peterbe]

Assignee

Comment 4

•

7 years ago

So I've tested it on Dev. I upload a bunch of files and keep an eye on the logs looking for this line: https://github.com/mozilla-services/tecken/blob/98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243

After analyzing the log files I find these results::

 MEAN 40.2MB/sec
 MEDIAN 28.2MB/sec

It's not immediately obvious which statistic to use but let's focus on the median. It's only 40% faster than S3 uploads!
Sigh. 

Also, at the time of writing I keep getting so many Gunicorn timeouts that I went ahead and changed the default Gunicorn timeout to 120 seconds https://github.com/mozilla-services/tecken/pull/400/files
Basically, I've not yet been able to write a file bigger than 400MB.

Peter Bengtsson [:peterbe]

Assignee

Comment 5

•

7 years ago

Hmm... Another frustrating thing is; how the heck did this happen?
https://sentry.prod.mozaws.net/operations/symbols-dev/issues/648668/

It basically means Django wrote down a file to EFS disk then as soon as it considered itself done it kicked off the Celery task. And when the Celery task started it complained that the file didn't exist.

Are write writing to disk in an async way?? If so, it could mean that the time it takes to package up the Celery task and for the Celery task to start is actually *faster* than it takes EFS to *fully* write the file. An easy band-aid would be something like this::

 while True:
    attempts = 1
    try:
        with open(upload.inbox_filepath, 'rb') as f:
            buf.write(f.read())
        break
    except FileNotFound:
        if attempts >= 5:
            raise
        logger.debug(f'File not found. Trying again in {attempts} seconds')
        time.sleep(attempts)
        attempts += 1

Miles Crabill [:miles]

Comment 6

•

7 years ago

EFS has fairly strong consistency guarantees. Copied from [0]:

"Amazon EFS provides the open-after-close consistency semantics that applications expect from NFS. Amazon EFS provides stronger consistency guarantees than open-after-close semantics depending on the access pattern. Applications that perform synchronous data access and perform non-appending writes will have read-after-write consistency for data access."

So, in theory we should be good here. The speeds aren't great, but it seems like EFS gives speed credits based on standing usage of the filesystem, whereas we're mostly using it as "swap" space.

One possibility that I was considering was that there could have been extra tecken instances lying around that didn't have the EFS change but _did_ have Celery instances that could claim the task. However, upon looking into it I didn't see any extras. This also shouldn't be a problem going forward.

I _do_ think adding some kind of retry / brief delay logic as you mentioned about would be a good workaround.

[0] http://docs.aws.amazon.com/efs/latest/ug/using-fs.html#consistency

Flags: needinfo?(miles)

GitHub Bugzilla PR Linker

Comment 7

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/402 — Details

[github robot]

Comment 8

•

7 years ago

Commit pushed to master at https://github.com/mozilla-services/tecken

https://github.com/mozilla-services/tecken/commit/d21ffd8e4453d2496684de949e5afef80ac3dc8e
bug 1399192 - retry around filenotfounderror (#402)

GitHub Bugzilla PR Linker

Comment 9

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-services/tecken/pull/403 — Details

Peter Bengtsson [:peterbe]

Assignee

Comment 10

•

7 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #4)
> So I've tested it on Dev. I upload a bunch of files and keep an eye on the
> logs looking for this line:
> https://github.com/mozilla-services/tecken/blob/
> 98580bee35c8c2b8fd9fde11d4b3f35da993c229/tecken/upload/views.py#L243
> 
> After analyzing the log files I find these results::
> 
>  MEAN 40.2MB/sec
>  MEDIAN 28.2MB/sec
> 
> It's not immediately obvious which statistic to use but let's focus on the
> median. It's only 40% faster than S3 uploads!
> Sigh. 
> 
> Also, at the time of writing I keep getting so many Gunicorn timeouts that I
> went ahead and changed the default Gunicorn timeout to 120 seconds
> https://github.com/mozilla-services/tecken/pull/400/files
> Basically, I've not yet been able to write a file bigger than 400MB.

Update after having successfully uploaded 120 files (no idea how many have failed but it's a LOT more)::


 MEAN 26.4MB/sec
 MEDIAN 24.9MB/sec

[github robot]

Comment 11

•

7 years ago

Commit pushed to master at https://github.com/mozilla-services/tecken

https://github.com/mozilla-services/tecken/commit/b28165a06416cd9801f254e961f08050bcb48785
bug 1399192 - remove filenotfounderror retry in task (#403)

Peter Bengtsson [:peterbe]

Assignee

Comment 12

•

7 years ago

In terms of code this pretty complete. It works. But it doesn't work well enough. 
Let's do https://bugzilla.mozilla.org/show_bug.cgi?id=1405073 instead.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Upload by using disk for inboxing archive files

Categories

(Socorro :: Symbols, task)

Tracking

(Not tracked)

People

(Reporter: peterbe, Assigned: peterbe)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Attachment

General

Description

File Name

Content Type