incorrect http/200 responses from jsonrpc.cgi

RESOLVED FIXED

Status

()

bugzilla.mozilla.org
General
--
major
RESOLVED FIXED
2 years ago
2 months ago

People

(Reporter: glob, Assigned: dylan)

Tracking

Production

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
when the following request is sent to bmo:

--
POST https://bugzilla.mozilla.org/jsonrpc.cgi
Content-Length: 2
Content-Type: application/json

{}
--

bugzilla returns http/500.  this isn't ideal but it's currently expected.


however bugzilla.mozilla.org intermittently returns a http/200 response:

--
HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:32:18 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Client-Date: Mon, 07 Dec 2015 17:42:40 GMT
Keep-Alive: timeout=5, max=1000
X-Backend-Server: web5.bugs.scl3.mozilla.com
--

it's wildly variable when this may happen - sometimes i'll get a http/200 after a handful of requests, sometimes it's over a hundred.  i've seen the http/200 response from all of the backend servers.

this is very weird, and slightly concerning.  note the content-length of these invalid "ok" responses is always zero bytes.

significantly more concerning is occasionally (perhaps one in five of the http/200 responses) we see a set-cookie header:

--
HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:39:58 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Keep-Alive: timeout=5, max=1000
Set-Cookie: BUGLIST=448493-510945-661739-667865-705617-839470-923774-1043837-1093462-1101296-1103323-1142139-1183663-1190333-890504-250667-315415-369955-531199-563881-565458-578828-676975-682026-697948-796848-895511-958564-969787-981574-995953-996754-1003476-1087418-1123986-1123987-1125285-1126996-1131187-1163631-1167896-1198975-1202496-1211270-1215800-391684-751274-559296-873055-968065; path=/; expires=Fri, 01-Jan-2038 00:00:00 GMT; secure
X-Backend-Server: web5.bugs.scl3.mozilla.com
--

there's no way this cookie should be returned as a result of the request - it's generated by buglist.cgi and stores the search result.

i've only ever seen he BUGLIST cookie returned; i've never seen a session or any other cookie.  however i'm still concerned that something weird is going on.


i've tried to replicate this issue without success under the following conditions:
- local dev environment (both mod_perl and mod_cgi)
- bugzilla-dev.allizom.org (bmo dev) and bugzilla.allizom.org (bmo stage)
- on a production webhead taken out of the zeus pool, requests generated from localhost

so this appears to be something triggered by concurrent requests, or zeus.


i'm not sure what's going on, and will need some lower-level assistance to figure out if apache/mod_perl/bugzilla is generating the invalid responses, or if it's something that zeus is doing.

:atoll indicated that packet sniffing magic may shed some light on the origin of the response.

as i'm taking pto after mozlando, :dylan and :fubar are the points of contact for this issue after today.
(Assignee)

Updated

2 years ago
Assignee: nobody → dylan

Comment 1

2 years ago
Did some digging here, and I'm able to replicate this direct to one of the web nodes, without Zeus:

curl -v -H 'Host: bugzilla.mozilla.org' http://10.22.82.29:443/jsonrpc.cgi -d '{}' -H 'Content-Type: application/json' -o /dev/null

Run that a few times and eventually I was able to get some 200's out of it. Did not take a whole lot, less than 20 for me. Given what I'm seeing, I do suspect that parallelism is relevant, so peak traffic times are probably more susceptible to this happening.

I won't paste the output here because it contains seemingly arbitrary JSON data from other users' queries. Here's an example with the variable data cut out and replaced with "arbitrary bugzilla json data here".

> POST /jsonrpc.cgi HTTP/1.1
> Host: bugzilla.mozilla.org
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2
>
} [2 bytes data]
* upload completely sent off: 2 out of 2 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Jan 2016 17:55:32 GMT
< Server: Apache
< malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)")arbitrary bugzilla json data here">
< X-Backend-Server: web4.bugs.scl3.mozilla.com
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

Note how this is in the *header* block, suggesting that something was trying to generate some sort of header and failed, reading the wrong query's data and generating an error instead of a valid header. The body of the response is indeed empty, hence Content-Length: 0.


I see at least 2 problems here:

1) Getting data from other queries. This is the part that I think is caused by parallelism. Thread safety or shared storage space (memcache, database) perhaps.

2) Whatever is trying to generate the header that goes there is not doing enough error checking and returning something completely bonkers.
(Assignee)

Comment 2

2 years ago
http://modperlbook.org/html/16-2-2-Content-Length-Header.html

> "According to section 14.13 of the HTTP specification, the Content-Length header is the number of octets (8-bit bytes) in the 
> body of a message. If the length can be determined prior to sending, it can be very useful to include it. The most important 
> reason is that KeepAlive requests (when the same connection is used to fetch more than one object from the web server) work 
> only with responses that contain a Content-Length header"

It so happens that for BMO we:
have KeepAlive turned on
and we never ever set a Content-Length header (in fact, from perl most of the time we don't know this.)

As Jake was seeing body-content in the headers... this makes me think that might be the problem.

gozer: The documentation says "Keep Alives work only with responses that contain a Content-Length header",
what do they mean by "Work"? I've looked at the code, and as far as I can tell there is nothing stopping you from not sending a content-length *and* the connection being kept open by the KeepAlive setting...
Flags: needinfo?(gozer)
Note that it's possible to disable keepalives to the backend nodes, at the cost of performance.
(Assignee)

Comment 4

2 years ago
KeepAlive is not the issue at hand -- because we see this even running curl against the webheads. My mistake.
After talking with gozer I have identified several other possibilities. Working through those now.
Flags: needinfo?(gozer)
(Assignee)

Comment 5

2 years ago
Here's what I know so far:

When this happens, we see this in the logs:

Jan  8 17:00:02 web2.bugs.scl3.mozilla.com apache[10370]: [Fri Jan 08 09:00:02 2016] [error] [client 96.58.191.108] malformed header from script. Bad header=malformed JSON string, neither: jsonrpc.cgi

We see other mailformed headers in the log:

[client 202.67.42.34] malformed header from script. Bad header=<!DOCTYPE html PUBLIC "-//W3C/: buglist.cgi, referer: https://bugzilla.mozilla.org/query.cgi

That is why we sometimes see the buglist header.

Something could be happening in XS land, or it could be something happening in a $SIG{__DIE__} handler. I'm narrowing it down.

With a kept-alive socket, printing out garbage -- it will sit there until the next request. So turning off keep alive would alleviate this slightly (but obviously this is still a problem! jsonrpc should not be omitting headers!)
(Assignee)

Comment 6

2 years ago
I'm thinking this is server push + keepalives.
https://github.com/mozilla/webtools-bmo-bugzilla/blob/master/buglist.cgi#L730-L747

should disable keep alives. If that works, we should either call close() for server pushes, force the process
that does a server push to exit, or remove the server push feature. I strongly endorse the last option.
(Reporter)

Comment 7

2 years ago
dylan - great work; that code does indeed look suspect.

fubar - can you disable keepalives in the dev and stage configs.  we'll do a quick sanity check to make sure nothing broke, and then we can disable it on production.
Flags: needinfo?(klibby)
ok, dev and staging both have KeepAlive disabled.
Flags: needinfo?(klibby)
(Reporter)

Comment 9

2 years ago
thanks fubar - things look good to me.  please flip the switch on prod.
Flags: needinfo?(klibby)
(In reply to Byron Jones ‹:glob› from comment #9)
> thanks fubar - things look good to me.  please flip the switch on prod.

done.
Flags: needinfo?(klibby)
(Assignee)

Comment 11

2 years ago
Created attachment 8707628 [details] [diff] [review]
1231918_1.patch

At least in part, the problem is that sentry adds a __DIE__ handler which calls ThrowTemplateError(). That function expects the http headers to already be sent! Examining the call sites, I concluded this was always the case (except in our __DIE__ handler). This patch fixes that call to ThrowTemplateError().

Meanwhile, we need to eventually not use a __DIE__ handler. It would be better to have an eval { } block in mod_perl.pl, and only install the __DIE__ handler on dev machines.

Note that bz-dev-manager adds its own __DIE__ handler -- you will need to revert that change to test this patch.

For testing, I used http://bugzilla.local/bmo/buglist.cgi?bugidtype=include&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=XML&keywords=css1%3C%00ScRiPt%20%0D%0A%3Eprompt%28964655%29%3C%2FScRiPt%3E&keywords_type=anywords&list_id=396

It remains to be seen if this is *only* problem we're encountering for the random intermixed HTTP responses. We may need to call CORE::GLOBAL::exit() when a fatal error happens, because mod_perl2 gets into an inconsistent state.
Attachment #8707628 - Flags: review?(dkl)
Comment on attachment 8707628 [details] [diff] [review]
1231918_1.patch

Review of attachment 8707628 [details] [diff] [review]:
-----------------------------------------------------------------

r=dkl

::: Bugzilla/Sentry.pm
@@ +336,5 @@
> +            my $cgi = Bugzilla->cgi;
> +            $cgi->close_standby_message('text/html', 'inline', 'error', 'html');
> +            Bugzilla::Error::ThrowTemplateError($message);
> +            print $cgi->multipart_final() if $cgi->{_multipart_in_progress};
> +

nit: remove extra newline
Attachment #8707628 - Flags: review?(dkl) → review+
(Assignee)

Comment 13

2 years ago
So this will have unintended consequences because of the die handler we use to send stuff to sentry.

To explain how __DIE__ handlers work, see this code:

https://gist.github.com/dylanwh/8e46c22bda50eb21e60f

It outputs this:
die handler: error message at eval.pl line 12.
warn handler: error message at eval.pl line 12.

What does this mean?

Even with my fix, any code that does this:

eval { operation_that_may_die };
if ($@) {
    # ...
}

will first output an HTTP error page, and then continue on from the if block!

I'm going to examine the code to see if this happens anywhere.

Note this could even happen in code we don't control...
(Assignee)

Comment 14

2 years ago
Okay, thankfully _in_eval() guards against this (which I didn't think eval { } block added (eval) to the stack.
Note this means exception handling is very slow (because we walk the stack for every exception)
Do we need a separate bug for "stop walking the stack for every exception", to track that perf issue?
(Assignee)

Comment 16

2 years ago
I don't have proof that it is very slow, other than remembering that you're not really encouraged to use caller()... I think I'd like to land this patch and see what happens.
(Assignee)

Comment 17

2 years ago
Aha! We need to use $^S instead of _in_eval().
(Assignee)

Comment 18

2 years ago
I thought maybe that wasn't a read-only SV, but it is so there's no fixing that under mod_perl.
(Reporter)

Updated

2 years ago
Component: Infrastructure → General
QA Contact: mcote
(Reporter)

Comment 19

a year ago
To ssh://gitolite3@git.mozilla.org/webtools/bmo/bugzilla.git
   ace6728..15a6c7e  master -> master
(Assignee)

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Did we reenable keepalives as part of shipping the fix in comment 19, or does this fix require *both* the comment 19 *and* keepalives off solutions in tandem?
keepalives have NOT been re-enabled, but I don't know if that is still required.
Status: RESOLVED → REOPENED
Flags: needinfo?(dylan)
Resolution: FIXED → ---
(Assignee)

Comment 22

a year ago
I'm fairly certain we can re-enable them. 98% sure.
Flags: needinfo?(dylan)
re-enabled keepalives on dev and stage. please test and let me know if I can re-enable on prod.
(Assignee)

Comment 24

a year ago
Looks good. We never were able to trigger this on dev or stage though.
ok, re-enabled on prod.
(Assignee)

Comment 26

a year ago
This seems to be resolved.
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
(Assignee)

Updated

4 months ago
Group: bmo-infra
(Reporter)

Updated

4 months ago
Group: infra
(Assignee)

Updated

2 months ago
Group: bmo-infra
You need to log in before you can comment on or make changes to this bug.