Closed Bug 1151284 Opened 10 years ago Closed 10 years ago

APAC and EU wireless upgrade

Categories

(Infrastructure & Operations :: Change Requests, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: van, Assigned: van)

Details

current controller software has critical bugs that have been fixed in the latest update. we need to upgrade these controllers' software ASAP. refer to 1150113 for more info. date, time, duration of maintenance April 17, 10PM PDT, 12 hours system(s) affected APAC and EU (par1, lon1, tpe1, akl1, ber1) wireless controllers and APs end-user impact wireless will be unavailable for ~30min (60min) while the APs reboot to install their new code release maintenance plan and timeline https://mana.mozilla.org/wiki/display/NETOPS/Juniper+wireless+upgrade#Juniperwirelessupgrade-WLC%28MSS:MobilitySystemSoftware%29 rollback plan / rollback point If any issue is noticed (instability, connections issues, etc...) restart the controllers to their backup partition and investigate notification mechanisms Oncall, whistlepig who will be point, who else will be involved Van, Arzhel
Flags: cab-review+
Flags: cab-review+ → cab-review?
Reviewed and approved in 4/8 CAB
Flags: cab-review? → cab-review+
APAC upgrade completed without issues. EU upgrade completed but we still have 1 AP down (wap302.ops.lon1). The AP boots but won't download the image. It stops/crashes at the 5 minute mark. I am working with JTAC to resolve the issue or RMA the device depending on outcome of their review of our logs. I've disabled PoE on the AP's interface and will try to reenable tomorrow to see if issue resolves. (Hoping that it just requires a long drain as several reboot attempts didn't work.)
attempted to bring the AP back online this morning. still same issues, AP tries to download the image for 5 minutes then gives up. 2015-0411-0058 opened for WLA RMA. AP displaying some unsual behaviors. packet loss while downloading image: --- 10.246.1.2 ping statistics --- 500 packets transmitted, 403 packets received, 0 errors, 19% packet loss page gain when after it gives up: --- 10.246.1.2 ping statistics --- 500 packets transmitted, 742 packets received, 0 errors, 0% packet loss *wifi2.ops.lon1.mozilla.net# show log trace match 4632 -300 APM_RF Apr 11 18:12:41.332410 DEBUG AUTORF_ERROR: autorf_reset_radio_state: AP 4632: ap not in configured state APM_RF Apr 11 18:12:41.332319 DEBUG AUTORF_ERROR: autorf_reset_radio_state: AP 4632: ap not in configured state SM Apr 11 18:12:41.331254 NOTICE SM-EVENT: APM reports AP 4632 is down (LB) APM_MGR Apr 11 18:12:41.329894 NOTICE MX_SEL_REC: deleting wlc sel rec for AP 4632. Reason = "connection timer" flags=0 APM_MGR Apr 11 18:11:06.213855 DEBUG dap_mgr_lb_run: Running LB for AP(4632) on Ctrler(10.246.0.30) WLA Apr 11 18:07:44.023415 INFO AP 4632 network: <254>Dec 31 17:00:25 syslog: dap_set_initial_state: has_wlcinfo was 1, state is 1, try_last_good is 1 WLA Apr 11 18:07:44.023333 INFO AP 4632 tapa: <318>Dec 31 17:00:23 syslog: Boot count: 146 [BsVS: na; UbP:7: 7720118081012]
ive also tried moving the AP interface out of the interface-range and pruned all VLANs but the ops VLAN but unfortunately that didn't resolve the issue either. there was still a lot of packet loss during the download phase and the AP still disappears after 5 minutes. i will configure an AP in SFO1 and we'll send it to LON1 to swap out. i'll see if we can RMA wap302 but it appears to be out of warranty per the support site.
Any known issues with wap301.ops.lon1.mozilla.net ? It bounced this morning.
Just lost another WAP in London: Mon 03:30:30 PDT [5040] wap309.ops.lon1.mozilla.net (10.246.1.9) is DOWN :PING CRITICAL - Packet loss = 100%
At least some of this is locals moving an AP without telling us first :(
the clusters have been upgraded in all regions. we'll be tracking the one bad AP in bug 1154024.
Assignee: server-ops → vle
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.