Closed
Bug 1475258
Opened 7 years ago
Closed 6 years ago
Intermittent DNS resolution errors
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: fubar, Unassigned)
References
Details
In MDC1 and MDC2 on the w10 moonshots, we're seeing intermittent DNS resolution errors, both for internal and external sites:
“dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.”
“NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,”
For a while, we were also running into an issue where the DNS search data from DHCP wasn't coming through, though I think that's stopped.
This may be an issue with the moonshot's, but I'd like to know if there's anything on the infoblox side that's showing any problems, or if there are any missing configuration settings that might contribute to this.
Comment 1•7 years ago
|
||
Can you please provide source addresses of the hosts experiencing these failures?
Comment 2•7 years ago
|
||
I'd love to do some traffic captures if there is any way to replicate
Reporter | ||
Comment 3•7 years ago
|
||
cknowles reminds me to be good and add timestamps and hostnames (and mid-air'ed with the above!):
Jul 11 02:39:17 T-W1064-MS-214.mdc1.mozilla.com generic-worker: 2018/07/11 09:32:35 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Jul 11 03:27:35 T-W1064-MS-269.mdc1.mozilla.com generic-worker: 2018/07/11 09:44:15 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Jul 11 05:06:32 T-W1064-MS-212.mdc1.mozilla.com generic-worker: 2018/07/11 11:11:02 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Jul 11 05:08:07 T-W1064-MS-213.mdc1.mozilla.com generic-worker: 2018/07/11 12:05:47 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Jul 11 05:18:39 T-W1064-MS-207.mdc1.mozilla.com generic-worker: 2018/07/11 11:15:09 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Jul 11 05:48:25 T-W1064-MS-255.mdc1.mozilla.com generic-worker: 2018/07/11 11:45:06 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
and...
Jul 11 05:48:25 T-W1064-MS-255.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:48:30 T-W1064-MS-029.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:48:41 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:48:48 T-W1064-MS-034.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:48:49 T-W1064-MS-017.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:48:49 T-W1064-MS-295.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:53:52 T-W1064-MS-129.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
Jul 11 05:53:54 T-W1064-MS-169.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015
I can come up with plenty more, papertrail is full of them.
Comment 4•7 years ago
|
||
If I start a traffic capture at any time and have it run for 5 minutes, is it safe to assume there will be some errors?
Reporter | ||
Comment 5•7 years ago
|
||
The TC look up error less frequent, but the NTP related one looks like it's a pretty steady stream, and 5-10 minutes should definitely catch some.
Comment 6•7 years ago
|
||
I ended up accidently capturing 30 minutes worth. Going to start digging in.
Comment 7•7 years ago
|
||
my previous traffic capture was for MDC2, I'm going to do a lengthy one for MDC1 now. I thought that some of these hosts were in MDC2.
:fubar,
If you could provide some logs of failing hosts in MDC2 that I could use to correlate that would be helpful as well.
Comment 8•7 years ago
|
||
I took about a 10 minute capture of MDC1. Working on analyzing it now.
Comment 9•7 years ago
|
||
The following filter seems to be a good way to find unresolved queries
dns && ip.dst_host == "10.49.40.166" && dns.flags == 0x8583
Comment 10•7 years ago
|
||
dns && ip.dst_host == "10.49.40.166" && dns.flags > 0x8580 seems to show more errors
Comment 11•7 years ago
|
||
After much digging and discussion with :marko, :cknowles, :digi,
Seeing dynamic updates failing in the logs. Then the host seems to fall off the network for a while. Then the host seems to figure things out and update logs to papertrail.
The error returned from InfoBlox is:
"Flags: 0xa885 Dynamic update response, Refused"
Looking at the logs inside of papertrail, the timestamps of when the log was received do not match the timestamp that the event occurred.
Reporter | ||
Comment 12•7 years ago
|
||
Thanks for the digging! I know Q had noticed last week that we were trying to do dynamic dns and was going to look into disabling it, so that's something.
> Then the host seems to fall off the network for a while.
Magic 8-ball says... next stop, looking at the chassis switches?
Comment 13•7 years ago
|
||
The, two timestamps. One is for paertrail, and one is for generic-worker
Jul 12 11:56:41 T-W1064-MS-177.mdc1.mozilla.com generic-worker: 2018/07/12 17:52:51 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known
The generic-worker timestamp is 17:52:51 UTC (the node is set to UTC) which would correlate to 10:52:51 pdt in papertrail. However, during the time of the generic-worker timestamp the node is not reporting to papertrail:
Jul 12 10:51:32 T-W1064-MS-177.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-177) has initiated the restart of computer T-W1064-MS-177 on behalf of user T-W1064-MS-177\GenericWorker for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015
Jul 12 10:51:35 T-W1064-MS-177.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015
Jul 12 11:56:41 T-W1064-MS-177.mdc1.mozilla.com Microsoft-Windows-WinRM: The WinRM service is not listening for WS-Management requests. User Action If you did not intentionally stop the service, use the following command to see the WinRM configuration: winrm enumerate winrm/config/listener#015
It looks like this particular node was offline for that hour.
It seems like the nodes are going off line but still trying to perform tasks. Then once they are reporting back online they are sending older logs to papertrail. I am going to dive into the event logs and try to figure out what is happening when it is off line.
These do seem to be close in time to the ""Flags: 0xa885 Dynamic update response, Refused" logs.
Comment 14•7 years ago
|
||
Right as the period of no connectivity on the nodes that have the getaddrinfow error, is the message below in the event viewer right at the beginning of the time period:
SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized.
Interestingly ms-042 rebooted and never came back. I had to do a cold boot to come back online:
Jul 12 15:26:26 T-W1064-MS-042.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-042) has initiated the restart of computer T-W1064-MS-042 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015
Jul 12 15:26:30 T-W1064-MS-042.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015
The same message appeared in the event viewer at 20:27 utc/15:27 pdt.
Comment 15•7 years ago
|
||
For every node in papertral that has logged "SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized." is having similar issues.
Updated•7 years ago
|
Assignee: infra → relops
Component: Infrastructure: DNS → RelOps
QA Contact: cshields → klibby
Comment 16•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #15)
> For every node in papertral that has logged "SingleFunc_14_0_0: Driver
> startup failed because the hca could not be initialized." is having similar
> issues.
I've created a support case with HPE (added Mark as secondary contact): case id #5330960870
Linked to the specific cartridge for the example moon-chassis-1 c42n1 T-W1064-MS-042:
Product serial number: CN67200DWH
Product number: 833105-B21
https://support.hpe.com/portal/site/hpsc/scm/caseDetails?caseID=5330960870
Description:
```
Operating system/version: Windows 10 x64 build 1703
Product: HPE ProLiant m710x Server Cartridge
Product vendor:
Problem description:
On the moonshot cartridges, we are seeing the following errors at boot in Windows 10. When these errors are reported, there is no network connection available. This problem has repeated on about 20 different cartridges, and the appearance has increased with time.
These errors are found in the eventlog (after recovered with coldboot, we can view the eventlog for the boot when there was no network):
SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized.
SingleFunc_14_0_0: Port #1 is configured to IB. Since IB is not supported in this device, it will automatically be configured to Ethernet instead. Check PortType registry key.#015
We are using mellonox driver version 5.1.11548.0
Troubleshooting steps taken:
To fix the problem, we can cold-boot the cartridge. It may take multiple tries cold-booting and in some cases we have re-installed Windows and when that completes the problem does not appear on the new boot.
```
Comment hidden (typo) |
Comment hidden (typo) |
Comment 19•7 years ago
|
||
Looking at the "hca could not be init.." error in the last two weeks, in papertrail, it has happened on 150 unique cartridges and only in MDC1:
```
$ papertrail --force-color --group MDC1 --min-time '3 weeks ago' --max-time 'now' "Driver startup failed because the hca could not be initialized." | tee win10_driver-startup-failed.log
$ grep -o "Jul [0-9]\+" win10_driver-startup-failed.log |LC_ALL=C sort | LC_ALL=C uniq -c
10 Jul 06
14 Jul 07
22 Jul 08
32 Jul 09
106 Jul 10
117 Jul 11
87 Jul 12
49 Jul 13
[david@george relops-infra]$ grep -o "T-W.*\.com" win10_driver-startup-failed.log | sort | uniq -c
2 T-W1064-MS-016.mdc1.mozilla.com
5 T-W1064-MS-017.mdc1.mozilla.com
21 T-W1064-MS-019.mdc1.mozilla.com
9 T-W1064-MS-021.mdc1.mozilla.com
1 T-W1064-MS-022.mdc1.mozilla.com
1 T-W1064-MS-024.mdc1.mozilla.com
1 T-W1064-MS-025.mdc1.mozilla.com
6 T-W1064-MS-026.mdc1.mozilla.com
1 T-W1064-MS-027.mdc1.mozilla.com
6 T-W1064-MS-028.mdc1.mozilla.com
9 T-W1064-MS-029.mdc1.mozilla.com
4 T-W1064-MS-030.mdc1.mozilla.com
2 T-W1064-MS-031.mdc1.mozilla.com
1 T-W1064-MS-033.mdc1.mozilla.com
9 T-W1064-MS-034.mdc1.mozilla.com
2 T-W1064-MS-036.mdc1.mozilla.com
1 T-W1064-MS-037.mdc1.mozilla.com
9 T-W1064-MS-038.mdc1.mozilla.com
5 T-W1064-MS-039.mdc1.mozilla.com
2 T-W1064-MS-041.mdc1.mozilla.com
7 T-W1064-MS-042.mdc1.mozilla.com
4 T-W1064-MS-043.mdc1.mozilla.com
8 T-W1064-MS-044.mdc1.mozilla.com
1 T-W1064-MS-061.mdc1.mozilla.com
2 T-W1064-MS-063.mdc1.mozilla.com
3 T-W1064-MS-064.mdc1.mozilla.com
3 T-W1064-MS-066.mdc1.mozilla.com
1 T-W1064-MS-068.mdc1.mozilla.com
2 T-W1064-MS-070.mdc1.mozilla.com
2 T-W1064-MS-074.mdc1.mozilla.com
2 T-W1064-MS-075.mdc1.mozilla.com
1 T-W1064-MS-076.mdc1.mozilla.com
1 T-W1064-MS-077.mdc1.mozilla.com
3 T-W1064-MS-078.mdc1.mozilla.com
1 T-W1064-MS-081.mdc1.mozilla.com
1 T-W1064-MS-082.mdc1.mozilla.com
1 T-W1064-MS-083.mdc1.mozilla.com
1 T-W1064-MS-084.mdc1.mozilla.com
1 T-W1064-MS-085.mdc1.mozilla.com
1 T-W1064-MS-087.mdc1.mozilla.com
1 T-W1064-MS-088.mdc1.mozilla.com
1 T-W1064-MS-089.mdc1.mozilla.com
1 T-W1064-MS-090.mdc1.mozilla.com
1 T-W1064-MS-106.mdc1.mozilla.com
1 T-W1064-MS-108.mdc1.mozilla.com
5 T-W1064-MS-110.mdc1.mozilla.com
3 T-W1064-MS-111.mdc1.mozilla.com
2 T-W1064-MS-112.mdc1.mozilla.com
1 T-W1064-MS-116.mdc1.mozilla.com
1 T-W1064-MS-118.mdc1.mozilla.com
2 T-W1064-MS-119.mdc1.mozilla.com
1 T-W1064-MS-120.mdc1.mozilla.com
2 T-W1064-MS-121.mdc1.mozilla.com
1 T-W1064-MS-123.mdc1.mozilla.com
1 T-W1064-MS-124.mdc1.mozilla.com
2 T-W1064-MS-125.mdc1.mozilla.com
1 T-W1064-MS-127.mdc1.mozilla.com
1 T-W1064-MS-128.mdc1.mozilla.com
2 T-W1064-MS-129.mdc1.mozilla.com
1 T-W1064-MS-131.mdc1.mozilla.com
2 T-W1064-MS-134.mdc1.mozilla.com
3 T-W1064-MS-135.mdc1.mozilla.com
1 T-W1064-MS-152.mdc1.mozilla.com
3 T-W1064-MS-153.mdc1.mozilla.com
5 T-W1064-MS-154.mdc1.mozilla.com
3 T-W1064-MS-155.mdc1.mozilla.com
2 T-W1064-MS-156.mdc1.mozilla.com
1 T-W1064-MS-157.mdc1.mozilla.com
1 T-W1064-MS-158.mdc1.mozilla.com
2 T-W1064-MS-159.mdc1.mozilla.com
1 T-W1064-MS-162.mdc1.mozilla.com
1 T-W1064-MS-164.mdc1.mozilla.com
1 T-W1064-MS-167.mdc1.mozilla.com
3 T-W1064-MS-168.mdc1.mozilla.com
1 T-W1064-MS-169.mdc1.mozilla.com
3 T-W1064-MS-172.mdc1.mozilla.com
4 T-W1064-MS-173.mdc1.mozilla.com
2 T-W1064-MS-174.mdc1.mozilla.com
3 T-W1064-MS-175.mdc1.mozilla.com
4 T-W1064-MS-176.mdc1.mozilla.com
3 T-W1064-MS-177.mdc1.mozilla.com
4 T-W1064-MS-179.mdc1.mozilla.com
2 T-W1064-MS-196.mdc1.mozilla.com
2 T-W1064-MS-197.mdc1.mozilla.com
2 T-W1064-MS-198.mdc1.mozilla.com
1 T-W1064-MS-199.mdc1.mozilla.com
7 T-W1064-MS-200.mdc1.mozilla.com
1 T-W1064-MS-201.mdc1.mozilla.com
3 T-W1064-MS-202.mdc1.mozilla.com
2 T-W1064-MS-203.mdc1.mozilla.com
3 T-W1064-MS-204.mdc1.mozilla.com
2 T-W1064-MS-205.mdc1.mozilla.com
6 T-W1064-MS-206.mdc1.mozilla.com
6 T-W1064-MS-207.mdc1.mozilla.com
5 T-W1064-MS-208.mdc1.mozilla.com
2 T-W1064-MS-211.mdc1.mozilla.com
3 T-W1064-MS-212.mdc1.mozilla.com
2 T-W1064-MS-213.mdc1.mozilla.com
3 T-W1064-MS-214.mdc1.mozilla.com
6 T-W1064-MS-215.mdc1.mozilla.com
5 T-W1064-MS-216.mdc1.mozilla.com
4 T-W1064-MS-217.mdc1.mozilla.com
1 T-W1064-MS-218.mdc1.mozilla.com
4 T-W1064-MS-219.mdc1.mozilla.com
2 T-W1064-MS-220.mdc1.mozilla.com
2 T-W1064-MS-221.mdc1.mozilla.com
4 T-W1064-MS-223.mdc1.mozilla.com
2 T-W1064-MS-224.mdc1.mozilla.com
3 T-W1064-MS-225.mdc1.mozilla.com
2 T-W1064-MS-241.mdc1.mozilla.com
2 T-W1064-MS-242.mdc1.mozilla.com
4 T-W1064-MS-243.mdc1.mozilla.com
2 T-W1064-MS-244.mdc1.mozilla.com
1 T-W1064-MS-245.mdc1.mozilla.com
3 T-W1064-MS-246.mdc1.mozilla.com
1 T-W1064-MS-247.mdc1.mozilla.com
1 T-W1064-MS-248.mdc1.mozilla.com
3 T-W1064-MS-249.mdc1.mozilla.com
2 T-W1064-MS-250.mdc1.mozilla.com
2 T-W1064-MS-251.mdc1.mozilla.com
2 T-W1064-MS-252.mdc1.mozilla.com
2 T-W1064-MS-253.mdc1.mozilla.com
20 T-W1064-MS-254.mdc1.mozilla.com
5 T-W1064-MS-255.mdc1.mozilla.com
1 T-W1064-MS-256.mdc1.mozilla.com
2 T-W1064-MS-257.mdc1.mozilla.com
3 T-W1064-MS-258.mdc1.mozilla.com
2 T-W1064-MS-259.mdc1.mozilla.com
3 T-W1064-MS-262.mdc1.mozilla.com
2 T-W1064-MS-263.mdc1.mozilla.com
3 T-W1064-MS-264.mdc1.mozilla.com
6 T-W1064-MS-265.mdc1.mozilla.com
4 T-W1064-MS-266.mdc1.mozilla.com
3 T-W1064-MS-267.mdc1.mozilla.com
3 T-W1064-MS-268.mdc1.mozilla.com
4 T-W1064-MS-269.mdc1.mozilla.com
2 T-W1064-MS-270.mdc1.mozilla.com
5 T-W1064-MS-283.mdc1.mozilla.com
3 T-W1064-MS-284.mdc1.mozilla.com
2 T-W1064-MS-285.mdc1.mozilla.com
2 T-W1064-MS-286.mdc1.mozilla.com
1 T-W1064-MS-287.mdc1.mozilla.com
4 T-W1064-MS-290.mdc1.mozilla.com
4 T-W1064-MS-291.mdc1.mozilla.com
2 T-W1064-MS-292.mdc1.mozilla.com
1 T-W1064-MS-293.mdc1.mozilla.com
4 T-W1064-MS-294.mdc1.mozilla.com
2 T-W1064-MS-295.mdc1.mozilla.com
3 T-W1064-MS-297.mdc1.mozilla.com
2 T-W1064-MS-298.mdc1.mozilla.com
```
Comment 20•7 years ago
|
||
HPE Support's first response is to apply their full driver update package:
ttps://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1
Operating System(s): Microsoft Windows 10 (64-bit)
File name: mwdp_m710x_20170508.zip (521 MB)
release notes includes:
"Updated Mellanox driver to WinOF version 5.35 on all the other supported OS."
I've asked for further investigation and if this is a known fixed problem (in the updated driver).
Comment 21•7 years ago
|
||
My request for support from Mellanox was denied because we do not have a premium support contract. My request was based on the serial number f4-03-43-03-00-df-4d-31 (taken from moon-chassis-1 c1n1, t-linux64-ms-001, as I can see the serial number through lspci).
HPE's support technician assigned to my case has asked me 4 times to install the driver and test if that fixes the problem. Each time, I have asked for more investigation and for them to inquire about the problem with Mellanox. The agent stated that "The driver upgrade has fixed the issue in some of the old cases." After my fourth request, the agent has stated that they cannot contact Mellanox because they are the hardware break/fix team. I have asked for the issue to be transferred to a team that can contact Mellanox.
Comment 22•7 years ago
|
||
Q, would it be reasonable to test the new driver to see if it eliminates the network problem? If so, could you test it?
https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1
Flags: needinfo?(q)
Comment 23•7 years ago
|
||
I think we can give that a go and see if it helps we would need to do it on a number of machines to catch the issue in any timely fashion
Flags: needinfo?(q)
Comment 24•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #22)
> Q, would it be reasonable to test the new driver to see if it eliminates the
> network problem? If so, could you test it?
>
> https://support.hpe.com/hpsc/swd/public/detail?sp4ts.
> oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1
Q, could you run the "HPS Report" utility on one of the machines that has seen the problem? (I keep selecting T-W1064-MS-042.wintest.releng.mdc1.mozilla.com, moon-chassis-1 c42n1, mdc1)
http://hpsreports.glb.itcs.hpe.com/HPSreports/
For escalation, the HPE agent has asked that I provide the HPS report for one of the machines. I've tried to get in through rdp and ssh, and had no success.
Comment 25•6 years ago
|
||
After updating network drivers we are now seeing these errors:
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Mellanox ConnectX-3 PRO VPI (MT04103) Network Adapter (PCI bus 14, device 0, function 0): SR-IOV cannot be enabled because FW does not support SR-IOV. In order to resolve this issue please re-burn FW, having added parameters related to SR-IOV support.#015
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: Execution of FW command failed. op 0x34, status 0xff, errno -110, token 0xffff, in_modifier 0, op_modifier 0, in_param 465ae000.#015
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: Driver startup failed because the hca could not be initialized.#015
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Mellanox ConnectX-3 PRO VPI (MT04103) Network Adapter (PCI bus 14, device 0, function 0): SR-IOV cannot be enabled because FW does not support SR-IOV. In order to resolve this issue please re-burn FW, having added parameters related to SR-IOV support.#015
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Port type registry value for device Native_14_0_0 could not be modified to value (PortType = none,auto). Previous value will be set.#015
Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: EXT_QP_MAX_RETRY_LIMIT/EXT_QP_MAX_RETRY_PERIOD registry keys were requested by user but FW does not support this feature. Please upgrade your firmware to support it. For more details, please refer to WinOF User Manual.#015
Comment 26•6 years ago
|
||
Been 6 months. Anything actionable here?
Comment 27•6 years ago
|
||
mark, dave, q, where are we with this issue?
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Comment 28•6 years ago
|
||
Mark, had you tested turning off SR-IOV in the bios? I was thinking that (in October?) you had found this error disappeared when SR-IOV was turned off and then had CIDuty turn it off on all the machines.
Maybe my memory is wrong however as I could not find anything about this when searching IRC logs.
Flags: needinfo?(dhouse)
Comment 29•6 years ago
|
||
This was addressed here in Bug 1499754.
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•