Closed Bug 1284505 Opened 9 years ago Closed 9 years ago

During recent prod push, zeus was failing and error occurred. Push script should possibly fail early in these cases

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

Production
defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: dkl, Unassigned)

References

Details

Possibly caused by bug 1284456, todays bmo prod push generated the following error text in the output: [13:48:47] syntax error at line 1, column 0, byte 0 at /usr/lib64/perl5/XML/Parser.pm line 187. [13:48:47] 500 Connect failed: connect: Connection refused; Connection refused [13:49:18] NOTICE: web1.bugs.scl3.mozilla.com hasn't drained in 30s. Please verify active connections! [13:49:18] NOTICE: Push script will continue without draining node in 60s if still in wait [13:49:49] Active connections on web1.bugs.scl3.mozilla.com: 13 [13:49:49] ERROR: Timeout waiting for web1.bugs.scl3.mozilla.com to fully drain! [13:49:49] + ssh web1.bugs.scl3.mozilla.com /etc/init.d/httpd restart [13:49:55] Stopping httpd: [ OK ] [13:49:55] Starting httpd: [Tue Jul 05 13:49:55 2016] [warn] WARNING: HOME is not set, using root: /\n [13:49:56] [Tue Jul 05 13:49:56 2016] [warn] NameVirtualHost *:80 has no VirtualHosts [13:49:56] [ OK ] [13:49:57] + ssh web1.bugs.scl3.mozilla.com "cd /data/www/bugzilla.mozilla.org; contrib/clear-templates.pl" [13:49:59] clearing ./template_cache [13:50:00] Waiting 10 seconds for httpd to initialize... [13:50:10] + /root/bin/zxtmpool undrain web1.bugs.scl3.mozilla.com [13:50:11] [13:50:11] syntax error at line 1, column 0, byte 0 at /usr/lib64/perl5/XML/Parser.pm line 187. [13:50:11] 500 Connect failed: connect: Connection refused; Connection refused [13:50:11] Active connections on web1.bugs.scl3.mozilla.com: 15 [13:50:12] + restarthttpd https://bugzilla.mozilla.org/ web2.bugs.scl3.mozilla.com wait [13:50:12] + /root/bin/zxtmpool drain web2.bugs.scl3.mozilla.com It was mentioned that this was when the script was trying to remove the particular webhead from service to do the code update and the operation failed. I believe we should fail the entire operation in this case in the future as updating code on a live webhead is not preferrable. dkl
See Also: → 1284289
See Also: 12842891284487
(In reply to David Lawrence [:dkl] from comment #0) > > It was mentioned that this was when the script was trying to remove the > particular webhead from service to do the code update and the operation > failed. > I believe we should fail the entire operation in this case in the future as > updating code on a live webhead is not preferrable. Aside from the zlb1 issue, there's a misunderstanding of the deploy process. The call to zxtmpool is only for restarting httpd, to prevent it from interrupting in-flight access from a client. It is not connected to deploying code to the web heads. Given how the deploy command (which update-bugzilla-* call) works, there isn't any way to connect the two short of a hardhat on BMO.
I think we have two choices: 1) stop the update entirely, which leaves new code on all of the nodes (I'm skeptical about automating a rollback) and doesn't restart httpd or the push/jobqueue daemons, and does not clear memcached 2) continue the deploy process without draining web nodes, possibly interrupting user access thoughts?
I should note that this assumes that SOAP queries should ONLY go to zlb1. that's the only flow we have open, but if we can query the others than we can fix that...
zxtmpool updated to use external.zlb CNAME
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.