autophone-1 is down 2016-06-07

RESOLVED FIXED

Status

Testing
Autophone
--
major
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: bc, Assigned: van)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: Case ID is 5309395481)

Attachments

(3 attachments)

At 8:32 AM PDT 2016-06-07 I received an alert warning that the pulse queue used by autophone-1 was growing.

I attempted to ssh into autophone-1 however can not make a connection.
I sshd into autophone-2 and autophone-3 successfully.

It appears the server is down.
van: Can you look into this as soon as reasonably possible? If we need to initiate a service call, I'd like to get started as soon as we can.
Flags: needinfo?(vle)
FYI, this doesn't appear to be a case where a reboot was initiated and the host failed to complete rebooting. I did not receive any notification emails that a device had disconnected nor that the server was going to reboot. This appears to be a spontaneous failure.
(Assignee)

Comment 3

2 years ago
the server no longer boots up. powers on briefly for 1-3 seconds then shuts off. ive opened a case with HP and a CE will contact me for the hardware (motherboard) replacement.

For ETA for onsite CE you have to contact our onsite co-ordination team & the number is +1-844 806 3433 Opt 4.1.5.2

will update as soon as i get more info.
Whiteboard: Case ID is 5309395481
(Assignee)

Comment 4

2 years ago
the server no longer boots up. powers on briefly for 1-3 seconds then shuts off. performed various troubleshooting steps without success.

ive opened a case with HP and a CE will contact me for the hardware (motherboard) replacement.

For ETA for onsite CE you have to contact our onsite co-ordination team & the number is +1-844 806 3433 Opt 4.1.5.2

will update as soon as i get more info.
Assignee: nobody → vle
(Assignee)

Comment 5

2 years ago
it looks like the tech will receive the system board tomorrow at 5pm and will contact me for an escort into mtv2's QA lab for the replacement. will update again once ive touched base with the HP tech.

Dear Mr Van Le, 

Thank you for contacting Hewlett Packard Enterprise for your service request (5309395481).
We have scheduled your onsite task (5309395481-531).The onsite delivery will occur before 2016-06-08 17:00:00. We will notify you again when a technician is on the way. 

If you have any questions or wish to reschedule please contact us .

Yours sincerely,
Hewlett Packard Enterprise

ref:_00Dd0bUlK._50027k7
(Assignee)

Comment 6

2 years ago
it looks like the tech will receive the system board tomorrow at 5pm and will contact me for an escort into mtv2's QA lab for the replacement. will update again once ive touched base with the HP tech.

Dear Mr Van Le, 

Thank you for contacting Hewlett Packard Enterprise for your service request (5309395481).
We have scheduled your onsite task (5309395481-531).The onsite delivery will occur before 2016-06-08 17:00:00. We will notify you again when a technician is on the way. 

If you have any questions or wish to reschedule please contact us .

Yours sincerely,
Hewlett Packard Enterprise

ref:_00Dd0bUlK._50027k7
Created attachment 8761301 [details] [diff] [review]
bug-1278595-v1.patch

HP has not confirmed they will be on site to replace the faulty system board.

In the meantime we can move the devices from autophone-1 to autophone-2,3 temporarily.

van: if we have to move the devices, this is the plan.

We only have 6 USB2 and 6 USB3 ports per machine for a total of 12. These means we don't have enough ports for all of the devices. We'll disable the Nexus S devices to free up ports.

Disconnect from autophone-2
nexus-s-7
nexus-s-8
nexus-s-9

Disconnect from autophone-3
nexus-s-3

Move from autophone-1 to autophone-2
nexus-4-1
nexus-4-2
nexus-4-5
nexus-4-6
nexus-5-1
nexus-6-1
nexus-9-1

Move from autophone-1 to autophone-3
nexus-6p-1
nexus-6p-2
nexus-6p-3

This will use all available ports on autophone-2,3.
Van: ping me when/if you are ready to perform the move. I'll shutdown the phones prior to you starting.
van noticed deformities in nexus-4-{2,3}. He has removed them, removed their wifi credentials and will send to servicedesk for recycling. I've removed them from production and rebooted autophone-{2,3}. We'll deal with the loss over coverage at a later time.

https://github.com/mozilla/autophone/commit/f8419ac86ec208cd3c4aa0b3ebb46bdcda19273d
(Assignee)

Comment 11

2 years ago
:bc, i'm on site in mtv2 with the HP tech but unfortunately a new CPU didn't resolve the issue. the tech is going to have to escalate this once again. we'll have to pick this up once we come back from london.

i should expect a call from HP's tier 2 after the week of london. we're unsure what the next steps are, maybe they'll swap the whole chassis or send a different tech.
Flags: needinfo?(vle)
(Assignee)

Updated

2 years ago
Flags: needinfo?(vle)
(Assignee)

Comment 12

2 years ago
to follow up, the tech replaced almost everything but the server fails to POST - this is why we're getting escalated go tier 2.

day 1 - replaced system board
day 2 - replaced PSU backplane
day 3 - replaced CPU
Created attachment 8762618 [details] [diff] [review]
bug-1278595-2.patch

https://github.com/mozilla/autophone/commit/dce9b57595674930140266b662fd108fd1e8c47c

cover nexus-4-{2,3} with nexus-4-4 (spare) and nexus-4-7 (try) and disable inactive nexus-s devices in tests manifests.

updated and deployed 2016-06-14 08:58 UTC+1 (00:58 PDT)
(Assignee)

Comment 14

2 years ago
for whatever reason, even after informing both techs that I would be out of town for the all-hands, they still scheduled an on-site with me on Monday 6/13. when speaking with the tech in London, i had rescheduled for today and called him this morning but he did not pick up. 

i went ahead and contacted HP support to make sure i get contacted today or tomorrow. hopefully this gets straighten out ASAP.


[Monday, June 20, 2016 8:18 AM] -- Satish P says:
I see there was an email sent to you on June 10th 

"We have scheduled your onsite task (5309395481-533).The onsite delivery will occur before 2016-06-13 17:00:00. We will notify you again when a technician is on the way".

[Monday, June 20, 2016 8:19 AM] -- Van Le says:
i contacted the new tech
[Monday, June 20, 2016 8:19 AM] -- Van Le says:
and i also told jonathan that i was out of town


[Monday, June 20, 2016 8:20 AM] -- Satish P says:
Part is not sent to site yet and the case is still pending with that case.

[Monday, June 20, 2016 8:21 AM] -- Satish P says:
Please give me some more time on this

[Monday, June 20, 2016 8:39 AM] -- Satish P says:
Thank you for staying connected. 

I see since this case was processed initially as critical down server and was been elevated to L2 support and have shipped various parts with onsite tech and have tried swaping on server and issue still persists. 

Parts that was carried along with onsite tech was : 
501533-001 : 2 
532479-001 : 1 
536406-001 : 1 
610524-001: system board 

Since the issue still exists this case will be moved to senior research team and you will get more emails from research team who has the authority to ship more parts to fix the issue. 

For reference, your Case ID is 5309672280
(Assignee)

Comment 15

2 years ago
number for HP CE onsite coordination team (844)806-3433, in case i don't hear back today.

i've already emailed the level 2 HP CE team for an update.
(Assignee)

Comment 16

2 years ago
going to meet up with HP tech around 1230 at mtv2 for DIMM swaps. i don't think it's the memory but it's pretty much the only thing we haven't swapped out.
(Assignee)

Comment 17

2 years ago
hp tech came on site to test different DIMMs on the system board. as expected, this did not resolve the issue. 

the next step for them is to replace everything at once as replacing 1 item at a time could be frying/shorting a random component. the tech has put in the request and will need L2 approval for new everything and will contact me once received.
(Assignee)

Comment 18

2 years ago
didn't hear from HP so I gave them a call. they're going to approve replacing all modules/hardware at once. he'll be on site ~11am tomorrow 6/24/16. i'll be there to escort him.

:bc, since they're replacing all the hardware, do you want to replace the USB card as well? i believe you ordered a spare and this will rule out the card attributing to the initial melt down.
(Assignee)

Updated

2 years ago
Flags: needinfo?(vle) → needinfo?(bob)
(Assignee)

Updated

2 years ago
Flags: needinfo?(vle)
Please do replace it. We should have 2 spares.
Flags: needinfo?(bob)
Created attachment 8765403 [details] [diff] [review]
bug-1278595-3.patch

r=self. All the nexus-s devices are connected to autophone-1 now. We'll just keep it that way for now.
Attachment #8765403 - Flags: review+
https://github.com/mozilla/autophone/commit/7bae920edb118a7f64e2632deda482f8e5526f96

deployed 2016-06-27 03:30

Van, is there anything else we need to do ?
(Assignee)

Comment 22

2 years ago
>Van, is there anything else we need to do ?

if you're good, then we're good. 

Friday's events:

1) tech swapped out cpu, memory, and motherboard
2) server failed to POST.
3) tried swapping in and out multiple parts without success
4) tech escalated and we waited for further instructions.
5) while waiting, tech decided to try a spare PSU backplane he had destined to another customer.
6) server booted up, passed POST. it appears the first PSU backplane the tech received was DoA.
7) replaced USB 3.0 card and reconnected phones.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Flags: needinfo?(vle)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.