Closed Bug 1624408 Opened 6 months ago Closed 3 months ago

Downloading TMX through wget/curl interrupts mid-way

Categories

(Webtools :: Pontoon, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: jotes)

Details

Attachments

(1 file)

When downloading TMX files from Pontoon, the process interrupts when doing it through wget (and possibly curl).

Doing so in the browser works.

Let's find out why.

When command line is used to download the file, we seem to be hitting H18:

Mar 23 19:44:22 mozilla-pontoon heroku/router sock=backend at=error code=H18 desc="Server Request Interrupted"
method=GET path="/de/all-projects/de.all-projects.tmx" host=pontoon.mozilla.org
request_id=9a3949f1-13d3-48dc-89a8-17542c64b94d fwd="109.182.195.45" dyno=web.1 connect=1ms service=52243ms
status=503 bytes= protocol=https

That's how the output of curl looks:

Leopold:Downloads mathjazz$ curl -o de.all-projects.tmx https://pontoon.mozilla.org/de/all-projects/de.all-projects.tmx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.5M    0 12.5M    0     0   243k      0 --:--:--  0:00:52 --:--:--  205k
curl: (18) transfer closed with outstanding read data remaining

Downloading via the browser is much faster. I wonder if we're hitting Request timeout when using command line.

Relevant code:
https://github.com/mozilla/pontoon/blob/master/pontoon/base/views.py#L683

Right, the files are malformed, they're not valid .tmx nor valid XML.

tail -n 3 de.all-projects.tmx*

==> de.all-projects.tmx <==
		<tu tuid="firefox-os-20:apps/sms/sms.properties:thread-header-textmany" srclang="en-US">
			<tuv xml:lang="en-US">
				<seg>{{name}} (+{{n}})</seg>


==> de.all-projects.tmx.1 <==
		<tu tuid="focus-for-android:app.po:preference_privacy_stealth_summaryhide-webpages-when-switching-apps-and-block-taking-screenshots" srclang="en-US">
			<tuv xml:lang="en-US">
				<seg>Hide webpages when 

==> de.all-projects.tmx.2 <==
			<tuv xml:lang="en-US">
				<seg>spans {{0}} columns</seg>
			</tu

The motivation for using wget or curl is to download the .tmx for all locales at once.

Priority: -- → P2

:mathjazz
Can I take this bug?

I'm trying to reproduce this problem locally and my version of Curl (7.65.3) downloads uncorrupted .tmx files. Maybe the recent update to Django 2 helped somehow (?).
The transfer speed is still slow in my case (Chrome is much faster).

Assigned.

I hit the same error:

Leopold:pontoon mathjazz$ curl -o de.all-projects.tmx https://pontoon.mozilla.org/de/all-projects/de.all-projects.tmx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.3M    0 12.3M    0     0   241k      0 --:--:--  0:00:52 --:--:--  218k
curl: (18) transfer closed with outstanding read data remaining
Assignee: nobody → poke
Status: NEW → ASSIGNED

Hey,

I wrote a small Gist with my thoughts about this issue: https://gist.github.com/jotes/3bf97a2542153b2ad0dc24f1bffa6f59

The problem is caused by the speed of transfer between a client and the server. When a transfer of a TMX file takes more than 50 seconds, Gunicorn decides to terminate the worker which streams that file to the client.
In comparison, when a user tries to execute curl --compressed (uses GZIP) everything works fine and data is transferred in a few seconds.
I tried a couple of things to fix this issue (e.g. using an asynchronous Gunicorn worker), but most of them didn't introduce visible improvement.

I think there are two solutions that are worth considering for now:

  • A low hanging fruit: Increase the Guincorn's worker timeout configuration. Unfortunately, I don't know how big change it's for the Mozilla's Pontoon instance.
  • A harder one: Introduce an instance of a CDN (AWS Cloudfront/S3?) and periodically upload TMX files there.

I've created a small PR that can help with assessing the first solution.

We've increased the timeout to 120 seconds:
https://github.com/mozilla/pontoon/pull/1643

We've updated the docs to include the note about curl --compressed:
https://github.com/mozilla-l10n/localizer-documentation/pull/180

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.