incorrect http/200 responses from jsonrpc.cgi




4 years ago
2 years ago


(Reporter: glob, Assigned: dylan)



Firefox Tracking Flags

(Not tracked)



(1 attachment)

when the following request is sent to bmo:

Content-Length: 2
Content-Type: application/json


bugzilla returns http/500.  this isn't ideal but it's currently expected.

however intermittently returns a http/200 response:

HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:32:18 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Client-Date: Mon, 07 Dec 2015 17:42:40 GMT
Keep-Alive: timeout=5, max=1000

it's wildly variable when this may happen - sometimes i'll get a http/200 after a handful of requests, sometimes it's over a hundred.  i've seen the http/200 response from all of the backend servers.

this is very weird, and slightly concerning.  note the content-length of these invalid "ok" responses is always zero bytes.

significantly more concerning is occasionally (perhaps one in five of the http/200 responses) we see a set-cookie header:

HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:39:58 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Keep-Alive: timeout=5, max=1000
Set-Cookie: BUGLIST=448493-510945-661739-667865-705617-839470-923774-1043837-1093462-1101296-1103323-1142139-1183663-1190333-890504-250667-315415-369955-531199-563881-565458-578828-676975-682026-697948-796848-895511-958564-969787-981574-995953-996754-1003476-1087418-1123986-1123987-1125285-1126996-1131187-1163631-1167896-1198975-1202496-1211270-1215800-391684-751274-559296-873055-968065; path=/; expires=Fri, 01-Jan-2038 00:00:00 GMT; secure

there's no way this cookie should be returned as a result of the request - it's generated by buglist.cgi and stores the search result.

i've only ever seen he BUGLIST cookie returned; i've never seen a session or any other cookie.  however i'm still concerned that something weird is going on.

i've tried to replicate this issue without success under the following conditions:
- local dev environment (both mod_perl and mod_cgi)
- (bmo dev) and (bmo stage)
- on a production webhead taken out of the zeus pool, requests generated from localhost

so this appears to be something triggered by concurrent requests, or zeus.

i'm not sure what's going on, and will need some lower-level assistance to figure out if apache/mod_perl/bugzilla is generating the invalid responses, or if it's something that zeus is doing.

:atoll indicated that packet sniffing magic may shed some light on the origin of the response.

as i'm taking pto after mozlando, :dylan and :fubar are the points of contact for this issue after today.
Assignee: nobody → dylan
Did some digging here, and I'm able to replicate this direct to one of the web nodes, without Zeus:

curl -v -H 'Host:' -d '{}' -H 'Content-Type: application/json' -o /dev/null

Run that a few times and eventually I was able to get some 200's out of it. Did not take a whole lot, less than 20 for me. Given what I'm seeing, I do suspect that parallelism is relevant, so peak traffic times are probably more susceptible to this happening.

I won't paste the output here because it contains seemingly arbitrary JSON data from other users' queries. Here's an example with the variable data cut out and replaced with "arbitrary bugzilla json data here".

> POST /jsonrpc.cgi HTTP/1.1
> Host:
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2
} [2 bytes data]
* upload completely sent off: 2 out of 2 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Jan 2016 17:55:32 GMT
< Server: Apache
< malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)")arbitrary bugzilla json data here">
< X-Backend-Server:
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

Note how this is in the *header* block, suggesting that something was trying to generate some sort of header and failed, reading the wrong query's data and generating an error instead of a valid header. The body of the response is indeed empty, hence Content-Length: 0.

I see at least 2 problems here:

1) Getting data from other queries. This is the part that I think is caused by parallelism. Thread safety or shared storage space (memcache, database) perhaps.

2) Whatever is trying to generate the header that goes there is not doing enough error checking and returning something completely bonkers.

> "According to section 14.13 of the HTTP specification, the Content-Length header is the number of octets (8-bit bytes) in the 
> body of a message. If the length can be determined prior to sending, it can be very useful to include it. The most important 
> reason is that KeepAlive requests (when the same connection is used to fetch more than one object from the web server) work 
> only with responses that contain a Content-Length header"

It so happens that for BMO we:
have KeepAlive turned on
and we never ever set a Content-Length header (in fact, from perl most of the time we don't know this.)

As Jake was seeing body-content in the headers... this makes me think that might be the problem.

gozer: The documentation says "Keep Alives work only with responses that contain a Content-Length header",
what do they mean by "Work"? I've looked at the code, and as far as I can tell there is nothing stopping you from not sending a content-length *and* the connection being kept open by the KeepAlive setting...
Flags: needinfo?(gozer)
Note that it's possible to disable keepalives to the backend nodes, at the cost of performance.
KeepAlive is not the issue at hand -- because we see this even running curl against the webheads. My mistake.
After talking with gozer I have identified several other possibilities. Working through those now.
Flags: needinfo?(gozer)
Here's what I know so far:

When this happens, we see this in the logs:

Jan  8 17:00:02 apache[10370]: [Fri Jan 08 09:00:02 2016] [error] [client] malformed header from script. Bad header=malformed JSON string, neither: jsonrpc.cgi

We see other mailformed headers in the log:

[client] malformed header from script. Bad header=<!DOCTYPE html PUBLIC "-//W3C/: buglist.cgi, referer:

That is why we sometimes see the buglist header.

Something could be happening in XS land, or it could be something happening in a $SIG{__DIE__} handler. I'm narrowing it down.

With a kept-alive socket, printing out garbage -- it will sit there until the next request. So turning off keep alive would alleviate this slightly (but obviously this is still a problem! jsonrpc should not be omitting headers!)
I'm thinking this is server push + keepalives.

should disable keep alives. If that works, we should either call close() for server pushes, force the process
that does a server push to exit, or remove the server push feature. I strongly endorse the last option.
dylan - great work; that code does indeed look suspect.

fubar - can you disable keepalives in the dev and stage configs.  we'll do a quick sanity check to make sure nothing broke, and then we can disable it on production.
Flags: needinfo?(klibby)
ok, dev and staging both have KeepAlive disabled.
Flags: needinfo?(klibby)
thanks fubar - things look good to me.  please flip the switch on prod.
Flags: needinfo?(klibby)
(In reply to Byron Jones ‹:glob› from comment #9)
> thanks fubar - things look good to me.  please flip the switch on prod.

Flags: needinfo?(klibby)
At least in part, the problem is that sentry adds a __DIE__ handler which calls ThrowTemplateError(). That function expects the http headers to already be sent! Examining the call sites, I concluded this was always the case (except in our __DIE__ handler). This patch fixes that call to ThrowTemplateError().

Meanwhile, we need to eventually not use a __DIE__ handler. It would be better to have an eval { } block in, and only install the __DIE__ handler on dev machines.

Note that bz-dev-manager adds its own __DIE__ handler -- you will need to revert that change to test this patch.

For testing, I used http://bugzilla.local/bmo/buglist.cgi?bugidtype=include&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=XML&keywords=css1%3C%00ScRiPt%20%0D%0A%3Eprompt%28964655%29%3C%2FScRiPt%3E&keywords_type=anywords&list_id=396

It remains to be seen if this is *only* problem we're encountering for the random intermixed HTTP responses. We may need to call CORE::GLOBAL::exit() when a fatal error happens, because mod_perl2 gets into an inconsistent state.
Attachment #8707628 - Flags: review?(dkl)
Comment on attachment 8707628 [details] [diff] [review]

Review of attachment 8707628 [details] [diff] [review]:


::: Bugzilla/
@@ +336,5 @@
> +            my $cgi = Bugzilla->cgi;
> +            $cgi->close_standby_message('text/html', 'inline', 'error', 'html');
> +            Bugzilla::Error::ThrowTemplateError($message);
> +            print $cgi->multipart_final() if $cgi->{_multipart_in_progress};
> +

nit: remove extra newline
Attachment #8707628 - Flags: review?(dkl) → review+
So this will have unintended consequences because of the die handler we use to send stuff to sentry.

To explain how __DIE__ handlers work, see this code:

It outputs this:
die handler: error message at line 12.
warn handler: error message at line 12.

What does this mean?

Even with my fix, any code that does this:

eval { operation_that_may_die };
if ($@) {
    # ...

will first output an HTTP error page, and then continue on from the if block!

I'm going to examine the code to see if this happens anywhere.

Note this could even happen in code we don't control...
Okay, thankfully _in_eval() guards against this (which I didn't think eval { } block added (eval) to the stack.
Note this means exception handling is very slow (because we walk the stack for every exception)
Do we need a separate bug for "stop walking the stack for every exception", to track that perf issue?
I don't have proof that it is very slow, other than remembering that you're not really encouraged to use caller()... I think I'd like to land this patch and see what happens.
Aha! We need to use $^S instead of _in_eval().
I thought maybe that wasn't a read-only SV, but it is so there's no fixing that under mod_perl.
Component: Infrastructure → General
QA Contact: mcote
To ssh://
   ace6728..15a6c7e  master -> master
Closed: 4 years ago
Resolution: --- → FIXED
Did we reenable keepalives as part of shipping the fix in comment 19, or does this fix require *both* the comment 19 *and* keepalives off solutions in tandem?
keepalives have NOT been re-enabled, but I don't know if that is still required.
Flags: needinfo?(dylan)
Resolution: FIXED → ---
I'm fairly certain we can re-enable them. 98% sure.
Flags: needinfo?(dylan)
re-enabled keepalives on dev and stage. please test and let me know if I can re-enable on prod.
Looks good. We never were able to trigger this on dev or stage though.
ok, re-enabled on prod.
This seems to be resolved.
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Group: bmo-infra
Group: infra
Group: bmo-infra
You need to log in before you can comment on or make changes to this bug.