Closed Bug 894368 Opened 11 years ago Closed 11 years ago

lower final verify parallelization

Categories

(Release Engineering :: Release Automation: Other, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: pmoore)

References

Details

Attachments

(3 files, 2 obsolete files)

We keep getting 408 timeouts.
Attached patch lower itSplinter Review
Attachment #776364 - Flags: review?(jhopkins)
Comment on attachment 776364 [details] [diff] [review]
lower it

r+ to the release/final-verification.sh portion of the patch
Attachment #776364 - Flags: review?(jhopkins) → review+
Comment on attachment 776364 [details] [diff] [review]
lower it

I landed the final-verification.sh part of this.
Attachment #776364 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I propose we half the processes, to 48...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reduced further, from 96 to 48 parallel processes.

(I haven't marked this as a replacement for previous patch, since previous patch is already live, and also it contains other changes than just final-verification.sh)
Attachment #779151 - Flags: review?(jhopkins)
Attachment #779151 - Attachment is patch: true
Attachment #779151 - Attachment mime type: text/x-patch → text/plain
Assignee: bhearsum → pmoore
Just a note: this can be tested directly from any machine, even outside of the network, (e.g. laptop with internet connection, without need to be on VPN) running linux or OS X, using:

./final-verification.sh mozBeta-firefox-linux.cfg mozBeta-firefox-linux64.cfg mozBeta-firefox-mac64.cfg mozBeta-firefox-win32.cfg

I haven't tested this on cygwin.
Attachment #779151 - Flags: review?(jhopkins) → review+
This additional patch handles the following situation:

Currently we are getting HTTP/408 errors due to capacity problems on the CDN (we believe). When several parallel requests are fired off, to get the http headers for the individual mar files, this can overload the servers, and consequently the front-end on the CDN side will return an HTTP/408 when the backend does not respond with the header in due time (20 seconds, currently). This is a server-side time out, implemented on the front-end of the CDN. The consequence of this is that curl gets the HTTP/408 and associated headers, and considers them as a valid set of headers for the query, and gives a 0 return code, indicating all is ok. It also does not use its internal retry mechanism (curl --retry) since this is not used in HTTP/4xx situations.

This patch catches this specific case of handling HTTP/408 responses, and since "curl --retry" does not retry, this patch provides a loop in bash for retrying.

In addition, I have tidied up the logging, so that if we do not get a HTTP 200 OK response for the mar file, the "FAILURE" entry in the log file will say so (currently it reports HTTP/408 errors as "mar file incorrect size" since the content-length of the HTTP/408 response is 0 bytes, and it compares this to the expected mar file size, which is expected in the header). If now there is a non HTTP 200 OK response, the FAILURE message will say so, so the logs should be less misleading.

Any questions, let me know! :)
Attachment #783154 - Flags: review?(bhearsum)
Comment on attachment 783154 [details] [diff] [review]
Specifically handles HTTP/408 retries

John, you're more plugged into this than me. Can you review this?
Attachment #783154 - Flags: review?(bhearsum) → review?(jhopkins)
Attachment #783154 - Attachment is patch: true
Attachment #783154 - Attachment mime type: text/x-patch → text/plain
Reattaching with i) accidental debug output removed ii) more intelligent patch name.
Attachment #783154 - Attachment is obsolete: true
Attachment #783154 - Flags: review?(jhopkins)
Attachment #783161 - Flags: review?(jhopkins)
Comment on attachment 783161 [details] [diff] [review]
bug894368_http_408_bash_loop.patch

Shouldn't we be looking at the result of $mar_file_curl_exit_code before and during the retry loop?

When you attach an updated patch, can you please let me know what testing you've done?

A couple of nits:
(1)
+while [ "$((++attempts))" -lt 50 ] && grep -F 'HTTP/1.1 408 Request Timeout' "${mar_headers_file}" >/dev/null

This is mixing two styles of condition testing.  Personally I'd prefer to use either brackets or no brackets but not a mix of the two styles.  eg.

while test "$((++attempts))" -lt 50 && grep ...

(2)
On the same line, is -F really needed?
Attachment #783161 - Flags: review?(jhopkins) → review-
Hi John,

thanks for your comments! Please have a look to my remarks about them.

> "Shouldn't we be looking at the result of $mar_file_curl_exit_code before and during the retry loop?"

No. In the loop, $mar_file_curl_exit_code will always be 0, otherwise we would not get 'HTTP/1.1 408 Request Timeout'. A HTTP/408 response is treated as a success by curl. If a non zero exit code occurs, 'HTTP/1.1 408 Request Timeout' will also not be a header in the header file retrieved, and therefore the loop will already exit. Therefore it is redundant to also check the curl exit code.

> "Personally I'd prefer to use either brackets or no brackets but not a mix of the two styles."

I totally agree that it would be confusing to mix styles. However, I disagree that this is a mix of styles. Furthermore, I consider that to change to using 'test' would introduce a mix of styles.

Explanation
===========

If I had written:

test <expr1> && [ <expr2> ] && test <expr3>

then I would have used both 'test ...' and '[ ... ]' in the same list. They are synonymous with each other, and to use both, would be a mix of styles.

However, I consistently use '[' for testing expressions, and do not mix it with using 'test' too. If I change it to "test", I will now have the same command in different forms ('test' and '[') in the same script (e.g 'test' would clash with the usage of '[' present in lines 24 and 26). They are the same bash built-in, and are synonymous.

The key point here is that a list is made up of pipelines, which *can* be expressions. grep is not an expression, it is a command, and therefore does not use '['.

Since you wrote that this is a 'personal' preference, I hope this won't block the patch, as my preference is to leave it as it is. :)

> On the same line, is -F really needed?

In its current form, yes; because '.' means "match any character", rather than a literal dot. An alternative construction would be to escape the '.' in the string.

Thank you for your patience in reading this!
Hi John,

Ideally I'd like to get the patch landed before the next beta - how shall we proceed?

Pete
Comment on attachment 783161 [details] [diff] [review]
bug894368_http_408_bash_loop.patch

requesting a second opinion from Hal
Attachment #783161 - Flags: review- → review?(hwine)
Comment on attachment 783161 [details] [diff] [review]
bug894368_http_408_bash_loop.patch

Review of attachment 783161 [details] [diff] [review]:
-----------------------------------------------------------------

A couple of nits, but nothing that important. Okay to commit as is from me.

r+

::: release/test-mar-url.sh
@@ +7,5 @@
>  mar_file_curl_exit_code=$?
>  
> +# Bug 894368 - HTTP 408's are not handled by the "curl --retry" mechanism; in this case retry in bash
> +attempts=1
> +while [ "$((++attempts))" -lt 50 ] && grep -F 'HTTP/1.1 408 Request Timeout' "${mar_headers_file}" >/dev/null

nit: most everything else redirects stderr to the same place as stdout. That would be '&>/dev/null' here.

@@ +18,5 @@
>  # check file size matches what was written in update.xml
>  # strip out dos line returns from header if they occur
>  mar_actual_size="$(sed -e "s/$(printf '\r')//" -n -e 's/^Content-Length: //p' "${mar_headers_file}" | tail -1)"
>  mar_actual_url="$(sed -e "s/$(printf '\r')//" -n -e 's/^Location: //p' "${mar_headers_file}" | tail -1)"
> +http_response_code="$(sed -e "s/$(printf '\r')//" -n -e '/^HTTP\//p' "${mar_headers_file}" | tail -1)"

nit: our version of 'sed' understands '\r' directly (as part of POSIX), so the printf isn't needed (and adds visual complexity). (I know final-verification.sh does that as well, so this is more of an fyi.)

style: even though it "costs" an extra process, I'd be inclined to use grep instead of the 2nd sed expression. It took me a bit to understand it wasn't changing the line, just printing verbatim.

@@ +30,4 @@
>  then
>      echo "$(date):  FAILURE: Could not retrieve http header for mar file from ${mar_url}" > "$(mktemp -t log.XXXXXXXXXX)"
>      echo "NO_MAR_FILE ${mar_url} ${mar_headers_file} ${mar_file_curl_exit_code} ${mar_actual_url}" > "$(mktemp -t failure.XXXXXXXXXX)"
> +elif [ -n "${http_response_code}" ] && ! echo "${http_response_code}" | grep '200 OK' >/dev/null

A bash idiom for the echo ... pipe is:
 "$var" == "${var/200 OK/}"
which is true if '200 OK' is NOT in the original variable. That would leave the entire line as one test:
  [ "${http_response_code}" == "${http_response_code/200 OK/}" ]

Not all folks like the bash ${} extensions like '/' (as it inverts the logic), but it's a big reduction in special characters in this case, imo.
Attachment #783161 - Flags: review?(hwine) → review+
(In reply to Pete Moore [:pete][:pmoore] from comment #13)
> > "Shouldn't we be looking at the result of $mar_file_curl_exit_code before and during the retry loop?"
> 
> No. In the loop, $mar_file_curl_exit_code will always be 0, otherwise we
> would not get 'HTTP/1.1 408 Request Timeout'. A HTTP/408 response is treated
> as a success by curl. If a non zero exit code occurs, 'HTTP/1.1 408 Request
> Timeout' will also not be a header in the header file retrieved, and
> therefore the loop will already exit. Therefore it is redundant to also
> check the curl exit code.

The corner case, as I see it, is only receiving a partial payload after good (HTTP 200) headers. Perhaps due to a local system error (curl exit code 23). However, there are other tests for that (content length), and the exit code is printed in any failure situation.

In other words, I think we're okay here with this addition. It would overly complicate the code to test for "allowable exit codes" at all the call points. The goal here is to retry until success, not fail on first attempt. If we really need that level of testing, we probably would benefit from a language that supports exceptions.

(If anything, adding a final test for $mar_file_curl_exit_code -eq 0 as a prerequisite for success might be considered.)
Comment on attachment 783161 [details] [diff] [review]
bug894368_http_408_bash_loop.patch

Review of attachment 783161 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Hal, thanks for your review! I have some responses inline, can you take a look?

Many thanks for the very cool tips!

Pete

::: release/test-mar-url.sh
@@ +7,5 @@
>  mar_file_curl_exit_code=$?
>  
> +# Bug 894368 - HTTP 408's are not handled by the "curl --retry" mechanism; in this case retry in bash
> +attempts=1
> +while [ "$((++attempts))" -lt 50 ] && grep -F 'HTTP/1.1 408 Request Timeout' "${mar_headers_file}" >/dev/null

Sure, I'll change this.

@@ +18,5 @@
>  # check file size matches what was written in update.xml
>  # strip out dos line returns from header if they occur
>  mar_actual_size="$(sed -e "s/$(printf '\r')//" -n -e 's/^Content-Length: //p' "${mar_headers_file}" | tail -1)"
>  mar_actual_url="$(sed -e "s/$(printf '\r')//" -n -e 's/^Location: //p' "${mar_headers_file}" | tail -1)"
> +http_response_code="$(sed -e "s/$(printf '\r')//" -n -e '/^HTTP\//p' "${mar_headers_file}" | tail -1)"

re: the "nit" - I think I did this for Darwin sed compatibility, will double check if this is the case, or if Darwin handles this too, as part of POSIX (which I guess it should). Foopies are running CentOS, but I thought it was useful if it can still be run e.g. from Darwin since so many of us are using macs (and does not even need to be run from inside mozilla network).

re: the "style", it is actually changing the line, stripping out the words "Content-Length: " or "Location: " from the beginning of the line, to leave just the value - so I think I can't switch this to grep. unless I also added e.g. a "grep" and a "cut" or something similar. Let me know what you think.

@@ +30,4 @@
>  then
>      echo "$(date):  FAILURE: Could not retrieve http header for mar file from ${mar_url}" > "$(mktemp -t log.XXXXXXXXXX)"
>      echo "NO_MAR_FILE ${mar_url} ${mar_headers_file} ${mar_file_curl_exit_code} ${mar_actual_url}" > "$(mktemp -t failure.XXXXXXXXXX)"
> +elif [ -n "${http_response_code}" ] && ! echo "${http_response_code}" | grep '200 OK' >/dev/null

I love the idiom, will keep it. But I believe I still need the [ -n "${http_response_code}" ] part, since an empty string would still match in the [ "${http_response_code}" == "${http_response_code/200 OK/}" ] part..

so the end result would be:

elif [ -n "${http_response_code}" ] && [ "${http_response_code}" == "${http_response_code/200 OK/}" ]

Do you agree Hal?
Flags: needinfo?(hwine)
Comment on attachment 783161 [details] [diff] [review]
bug894368_http_408_bash_loop.patch

Review of attachment 783161 [details] [diff] [review]:
-----------------------------------------------------------------

::: release/test-mar-url.sh
@@ +18,5 @@
>  # check file size matches what was written in update.xml
>  # strip out dos line returns from header if they occur
>  mar_actual_size="$(sed -e "s/$(printf '\r')//" -n -e 's/^Content-Length: //p' "${mar_headers_file}" | tail -1)"
>  mar_actual_url="$(sed -e "s/$(printf '\r')//" -n -e 's/^Location: //p' "${mar_headers_file}" | tail -1)"
> +http_response_code="$(sed -e "s/$(printf '\r')//" -n -e '/^HTTP\//p' "${mar_headers_file}" | tail -1)"

re the nit: It's a nit, so I don't mind either way. However, if we're setting a goal of running on darwin, then I think we need some sort of unit tests to ensure it always runs on darwin. (I've given up on that goal personally -- darwin is too different, imo.)

re the style: style is opinion - I don't mind either way. And while you are correct for the first two headers, the HTTP header is passed unmodified. (Which is inconsistent from the first two headers.) It'd be splendid if either all were consistent, or the last was very visibly different (eg by using grep).

@@ +30,4 @@
>  then
>      echo "$(date):  FAILURE: Could not retrieve http header for mar file from ${mar_url}" > "$(mktemp -t log.XXXXXXXXXX)"
>      echo "NO_MAR_FILE ${mar_url} ${mar_headers_file} ${mar_file_curl_exit_code} ${mar_actual_url}" > "$(mktemp -t failure.XXXXXXXXXX)"
> +elif [ -n "${http_response_code}" ] && ! echo "${http_response_code}" | grep '200 OK' >/dev/null

I don't agree -- this is why many folks frown on this idiom. :/

    $ v=""
    $ test "$v" == "${v/200 OK/}" && echo true || echo false
    true
    $ v="200 OK"
    $ test "$v" == "${v/200 OK/}" && echo true || echo false
    false

The logic is reversed from the way it reads in English, but you don't need to special case the empty string
Adding pointer to the IT approach to dealing with the 408 issue is bug 899607 for future reference.

(This also gives me a chance to clear the needs info that the splinter review process doesn't reset automatically.)
Flags: needinfo?(hwine)
See Also: → 899607
So I've incorporated most of the suggestions into the newer patch. I discussed with Hal in IRC, and we agreed that the [ -n "${http_response_code}" ] was still valid, so that has been left in there too.

Where there were points of confusion, I have done my best to add comments to explain the choices made, as this was not always clear (e.g. explaining when something was done for Darwin compatibility, that the sed acts like a grep, but has less overhead since sed already running, etc).

I believe all open points should be now closed, but assigning to hwine and jhopkins for a final review, in case I missed anything.

Thanks guys for your support and help with this patch!
Attachment #783161 - Attachment is obsolete: true
Attachment #787586 - Flags: review?(jhopkins)
Attachment #787586 - Flags: review?(hwine)
Attachment #787586 - Flags: review?(jhopkins) → review+
Comment on attachment 787586 [details] [diff] [review]
updated bash loop for managing http/408 explicitly

Review of attachment 787586 [details] [diff] [review]:
-----------------------------------------------------------------

comments help a lot! thanks!
Attachment #787586 - Flags: review?(hwine) → review+
Afterthought - maybe we should log when we get a 408 error, and retry, e.g. to help with e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=899607 - at the moment, we only report on it if after 50 attempts it is not successful - so we won't see 408 errors that were "overcome" with retries.
Product: mozilla.org → Release Engineering
Closing this now, as we haven't had any problems in the last 3 weeks since the patch was applied. Feel free to re-open if problems occur again.

thanks,
Pete
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: