Closed
Bug 1266990
Opened 9 years ago
Closed 9 years ago
Autophone is receiving 502 Proxy Errors for fx-team and mozilla-central
Categories
(Testing Graveyard :: Autophone, defect, P1)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bc, Assigned: gbrown)
References
Details
Attachments
(2 files, 1 obsolete file)
386 bytes,
text/html
|
Details | |
2.09 KB,
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
Beginning at 3:23 AM PDT Autophone began experiencing the following errors for fx-team and mozilla-central:
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request /api/project/fx-team/jobs/ POST /api/project/fx-team/jobs/
Reason: Error reading from remote server
Reporter | ||
Updated•9 years ago
|
Attachment #8744626 -
Attachment mime type: text/plain → text/html
Reporter | ||
Comment 1•9 years ago
|
||
If it helps, fx-team began receiving these errors at 3:23 AM while mozilla-central began at 8:10 AM.
Comment 2•9 years ago
|
||
The error is from Zeus, albeit it's "Error reading from remote server" so presumably Zeus is just excluded the nodes from the pool due to too many 500s or similar.
There are some errors on New Relic, not sure if they are related:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/4eb47b-b37c2803-0a45-11e6-8950-c81f66b8ceca
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3fca17-ae4ccd9f-0a45-11e6-8950-c81f66b8ceca
The last prod deploy was on the 13th April.
The gunicorn error logs are also showing lots of:
[2016-04-24 18:00:00 +0000] [8842] [CRITICAL] WORKER TIMEOUT (pid:18401)
[2016-04-24 18:05:17 +0000] [8842] [CRITICAL] WORKER TIMEOUT (pid:20785)
[2016-04-24 18:06:57 +0000] [8842] [CRITICAL] WORKER TIMEOUT (pid:20834)
[2016-04-24 18:07:59 +0000] [8842] [CRITICAL] WORKER TIMEOUT (pid:20879)
(The max request time is set to ~30s)
Updated•9 years ago
|
Component: Treeherder → Treeherder: Infrastructure
Priority: -- → P1
QA Contact: laura
Reporter | ||
Comment 3•9 years ago
|
||
The root cause of this is a set of jobs with an extraordinarily large job collection to be submitted. Converted to json these job collections were many megabytes in length. The current implementation in Autophone is to store the job collections in a database queue as a json string. These strings appear to have been truncated and invalid json stored in the database.
I have deleted the offending jobs from the queues on the two machines where this happened and it appears the back log of treeherder job collections is being submitted.
We can move this from Treeherder to Autophone if you wish or we can leave this here to improve error handling / messaging regarding client failure such as this.
gbrown: we need to better handle the queued treeherder jobs.
Reporter | ||
Comment 4•9 years ago
|
||
Taking this into autophone's responsibility. Treeherder should definitely have more informative errors but this issue is all ours.
Component: Treeherder: Infrastructure → Autophone
Product: Tree Management → Testing
QA Contact: laura
Version: --- → Trunk
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → bob
Reporter | ||
Updated•9 years ago
|
Assignee: bob → gbrown
Status: NEW → ASSIGNED
![]() |
Assignee | |
Comment 5•9 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #3)
> The root cause of this is a set of jobs with an extraordinarily large job
> collection to be submitted. Converted to json these job collections were
> many megabytes in length. The current implementation in Autophone is to
> store the job collections in a database queue as a json string. These
> strings appear to have been truncated and invalid json stored in the
> database.
I wonder what happened there.
I used custom test code to generate a TreeherderJobCollection with 1000000 jobs; len(tjc.to_json()) was 532888362 bytes. I stored the 500 MB+ job string in the database, retrieved it from the database and successfully json.loads()'d it - no problem, and the retrieved string was not truncated.
![]() |
Assignee | |
Comment 6•9 years ago
|
||
I don't understand what happened here. If a job was truncated in the database, I would expect json.loads() to throw here: https://github.com/mozilla/autophone/blob/master/jobs.py#L406 -- but then we never would have submitted to treeherder.
...and the sqlite limit of 10^9 characters seems very permissive.
I've opted for imposing an arbitrary 10M limit on the size of the treeherder job collection - if it's bigger than that, we warn and discard the job before putting it in the database.
I've also added some logging to try to get better diagnostics in future.
Attachment #8749427 -
Flags: review?(bob)
Reporter | ||
Comment 7•9 years ago
|
||
Comment on attachment 8749427 [details] [diff] [review]
put arbitrary cap on max size of treeherder job collection json; improve logging
Review of attachment 8749427 [details] [diff] [review]:
-----------------------------------------------------------------
I'm sorry that I misled you as to the cause of the issue and led you down the wrong path.
I think the root issue is with too many error messages being sent with the completed log. See https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L488
I think an appropriate fix would be to limit the number of error messages which are sent with the completed job. 100 should be more than enough. We could add a counter to the loop and when we exceed the maximum number of error messages, break out of the loop. This will limit the size of the completed job and allow us to complete the job successfully without dropping it on the floor. We should append a final error message in this case however notifying the user that the error messages have been truncated.
A warning to the log would be appropriate. I almost want to say that we would want an email notification, but we could potentially be back to the flood of autophone messages again. If the test is reporting large numbers of failure, it should be noticed during sheriffing and there shouldn't be a need for email notifications.
::: autophonetreeherder.py
@@ +120,1 @@
> return False
I can see that you are looking for a truncated string coming back from the database and a reporting the tail of the string to indicate that. I actually don't think this is necessary.
I believe my original comment about the string being truncated was incorrect and that the most probable problem we experienced before was the huge request we sent which was rejected by the proxy.
::: jobs.py
@@ +23,5 @@
> SQL_MAX_RETRIES = 10
> + # An arbitrary limit on the number of characters written to a database field.
> + # sqlite should allow 10^9, but a lower limit seems safer and more likely to
> + # catch unexpected conditions.
> + SQL_MAX_LENGTH = 10000000
As I said, I believe I was wrong about the field truncation.
@@ +382,5 @@
> now = datetime.datetime.now().isoformat()
> + job_collection_json = job_collection.to_json()
> + if len(job_collection_json) > self.SQL_MAX_LENGTH:
> + logger.warning('jobs.new_treeherder_job: job discarded for excessive length %d' % len(job_collection_json))
> + return
I definitely don't like dropping the job with just a warning to the log.
Attachment #8749427 -
Flags: review?(bob)
![]() |
Assignee | |
Comment 8•9 years ago
|
||
I am glad to (hopefully) address the root cause.
This submits at most 100 errors to treeherder. If errors are truncated, errors_truncated is set to True; I don't know if that is advantageous, but it seems reasonable and does no harm: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-tier=3&filter-searchStr=autophone&revision=1579b9e2e50f3a27ad02d58cc9170c91e0973fec&selectedJob=3499599
I also added the request length to the email when treeherder submission fails. That won't tell us much, but it is one other easy piece of information that might help us to understand root cause in the future.
Attachment #8749427 -
Attachment is obsolete: true
Attachment #8751506 -
Flags: review?(bob)
Reporter | ||
Comment 9•9 years ago
|
||
Comment on attachment 8751506 [details] [diff] [review]
truncate excessive errors
Review of attachment 8751506 [details] [diff] [review]:
-----------------------------------------------------------------
::: autophonetreeherder.py
@@ +524,4 @@
> # numbers of the errors.
> if line:
> error_lines.append({"line": line, "linenumber": 1})
> + if len(error_lines) >= 100:
In general I would have preferred a counter rather than calculating the length of the array each time, but len(list) is O(1) and I don't think it matters here.
@@ -541,4 @@
> 'result': t.test_result.status
> },
> ],
> - 'errors_truncated': False
Too bad I didn't handle this case originally. :-(
Attachment #8751506 -
Flags: review?(bob) → review+
![]() |
Assignee | |
Comment 10•9 years ago
|
||
Blocks: autophone-deployments
Reporter | ||
Comment 11•9 years ago
|
||
deployed 2016-05-12 14:00:00
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•