Closed Bug 1475258 Opened 7 years ago Closed 6 years ago

Intermittent DNS resolution errors

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fubar, Unassigned)

References

Details

In MDC1 and MDC2 on the w10 moonshots, we're seeing intermittent DNS resolution errors, both for internal and external sites: “dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.” “NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,” For a while, we were also running into an issue where the DNS search data from DHCP wasn't coming through, though I think that's stopped. This may be an issue with the moonshot's, but I'd like to know if there's anything on the infoblox side that's showing any problems, or if there are any missing configuration settings that might contribute to this.
Can you please provide source addresses of the hosts experiencing these failures?
I'd love to do some traffic captures if there is any way to replicate
cknowles reminds me to be good and add timestamps and hostnames (and mid-air'ed with the above!): Jul 11 02:39:17 T-W1064-MS-214.mdc1.mozilla.com generic-worker: 2018/07/11 09:32:35 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Jul 11 03:27:35 T-W1064-MS-269.mdc1.mozilla.com generic-worker: 2018/07/11 09:44:15 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Jul 11 05:06:32 T-W1064-MS-212.mdc1.mozilla.com generic-worker: 2018/07/11 11:11:02 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Jul 11 05:08:07 T-W1064-MS-213.mdc1.mozilla.com generic-worker: 2018/07/11 12:05:47 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Jul 11 05:18:39 T-W1064-MS-207.mdc1.mozilla.com generic-worker: 2018/07/11 11:15:09 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Jul 11 05:48:25 T-W1064-MS-255.mdc1.mozilla.com generic-worker: 2018/07/11 11:45:06 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 and... Jul 11 05:48:25 T-W1064-MS-255.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:48:30 T-W1064-MS-029.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:48:41 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:48:48 T-W1064-MS-034.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:48:49 T-W1064-MS-017.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:48:49 T-W1064-MS-295.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:53:52 T-W1064-MS-129.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Jul 11 05:53:54 T-W1064-MS-169.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 I can come up with plenty more, papertrail is full of them.
If I start a traffic capture at any time and have it run for 5 minutes, is it safe to assume there will be some errors?
The TC look up error less frequent, but the NTP related one looks like it's a pretty steady stream, and 5-10 minutes should definitely catch some.
I ended up accidently capturing 30 minutes worth. Going to start digging in.
my previous traffic capture was for MDC2, I'm going to do a lengthy one for MDC1 now. I thought that some of these hosts were in MDC2. :fubar, If you could provide some logs of failing hosts in MDC2 that I could use to correlate that would be helpful as well.
I took about a 10 minute capture of MDC1. Working on analyzing it now.
The following filter seems to be a good way to find unresolved queries dns && ip.dst_host == "10.49.40.166" && dns.flags == 0x8583
dns && ip.dst_host == "10.49.40.166" && dns.flags > 0x8580 seems to show more errors
After much digging and discussion with :marko, :cknowles, :digi, Seeing dynamic updates failing in the logs. Then the host seems to fall off the network for a while. Then the host seems to figure things out and update logs to papertrail. The error returned from InfoBlox is: "Flags: 0xa885 Dynamic update response, Refused" Looking at the logs inside of papertrail, the timestamps of when the log was received do not match the timestamp that the event occurred.
Thanks for the digging! I know Q had noticed last week that we were trying to do dynamic dns and was going to look into disabling it, so that's something. > Then the host seems to fall off the network for a while. Magic 8-ball says... next stop, looking at the chassis switches?
The, two timestamps. One is for paertrail, and one is for generic-worker Jul 12 11:56:41 T-W1064-MS-177.mdc1.mozilla.com generic-worker: 2018/07/12 17:52:51 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known The generic-worker timestamp is 17:52:51 UTC (the node is set to UTC) which would correlate to 10:52:51 pdt in papertrail. However, during the time of the generic-worker timestamp the node is not reporting to papertrail: Jul 12 10:51:32 T-W1064-MS-177.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-177) has initiated the restart of computer T-W1064-MS-177 on behalf of user T-W1064-MS-177\GenericWorker for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015 Jul 12 10:51:35 T-W1064-MS-177.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015 Jul 12 11:56:41 T-W1064-MS-177.mdc1.mozilla.com Microsoft-Windows-WinRM: The WinRM service is not listening for WS-Management requests. User Action If you did not intentionally stop the service, use the following command to see the WinRM configuration: winrm enumerate winrm/config/listener#015 It looks like this particular node was offline for that hour. It seems like the nodes are going off line but still trying to perform tasks. Then once they are reporting back online they are sending older logs to papertrail. I am going to dive into the event logs and try to figure out what is happening when it is off line. These do seem to be close in time to the ""Flags: 0xa885 Dynamic update response, Refused" logs.
Right as the period of no connectivity on the nodes that have the getaddrinfow error, is the message below in the event viewer right at the beginning of the time period: SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized. Interestingly ms-042 rebooted and never came back. I had to do a cold boot to come back online: Jul 12 15:26:26 T-W1064-MS-042.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-042) has initiated the restart of computer T-W1064-MS-042 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015 Jul 12 15:26:30 T-W1064-MS-042.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015 The same message appeared in the event viewer at 20:27 utc/15:27 pdt.
For every node in papertral that has logged "SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized." is having similar issues.
Assignee: infra → relops
Component: Infrastructure: DNS → RelOps
QA Contact: cshields → klibby
(In reply to Mark Cornmesser [:markco] from comment #15) > For every node in papertral that has logged "SingleFunc_14_0_0: Driver > startup failed because the hca could not be initialized." is having similar > issues. I've created a support case with HPE (added Mark as secondary contact): case id #5330960870 Linked to the specific cartridge for the example moon-chassis-1 c42n1 T-W1064-MS-042: Product serial number: CN67200DWH Product number: 833105-B21 https://support.hpe.com/portal/site/hpsc/scm/caseDetails?caseID=5330960870 Description: ``` Operating system/version: Windows 10 x64 build 1703 Product: HPE ProLiant m710x Server Cartridge Product vendor: Problem description: On the moonshot cartridges, we are seeing the following errors at boot in Windows 10. When these errors are reported, there is no network connection available. This problem has repeated on about 20 different cartridges, and the appearance has increased with time. These errors are found in the eventlog (after recovered with coldboot, we can view the eventlog for the boot when there was no network): SingleFunc_14_0_0: Driver startup failed because the hca could not be initialized. SingleFunc_14_0_0: Port #1 is configured to IB. Since IB is not supported in this device, it will automatically be configured to Ethernet instead. Check PortType registry key.#015 We are using mellonox driver version 5.1.11548.0 Troubleshooting steps taken: To fix the problem, we can cold-boot the cartridge. It may take multiple tries cold-booting and in some cases we have re-installed Windows and when that completes the problem does not appear on the new boot. ```
Looking at the "hca could not be init.." error in the last two weeks, in papertrail, it has happened on 150 unique cartridges and only in MDC1: ``` $ papertrail --force-color --group MDC1 --min-time '3 weeks ago' --max-time 'now' "Driver startup failed because the hca could not be initialized." | tee win10_driver-startup-failed.log $ grep -o "Jul [0-9]\+" win10_driver-startup-failed.log |LC_ALL=C sort | LC_ALL=C uniq -c 10 Jul 06 14 Jul 07 22 Jul 08 32 Jul 09 106 Jul 10 117 Jul 11 87 Jul 12 49 Jul 13 [david@george relops-infra]$ grep -o "T-W.*\.com" win10_driver-startup-failed.log | sort | uniq -c 2 T-W1064-MS-016.mdc1.mozilla.com 5 T-W1064-MS-017.mdc1.mozilla.com 21 T-W1064-MS-019.mdc1.mozilla.com 9 T-W1064-MS-021.mdc1.mozilla.com 1 T-W1064-MS-022.mdc1.mozilla.com 1 T-W1064-MS-024.mdc1.mozilla.com 1 T-W1064-MS-025.mdc1.mozilla.com 6 T-W1064-MS-026.mdc1.mozilla.com 1 T-W1064-MS-027.mdc1.mozilla.com 6 T-W1064-MS-028.mdc1.mozilla.com 9 T-W1064-MS-029.mdc1.mozilla.com 4 T-W1064-MS-030.mdc1.mozilla.com 2 T-W1064-MS-031.mdc1.mozilla.com 1 T-W1064-MS-033.mdc1.mozilla.com 9 T-W1064-MS-034.mdc1.mozilla.com 2 T-W1064-MS-036.mdc1.mozilla.com 1 T-W1064-MS-037.mdc1.mozilla.com 9 T-W1064-MS-038.mdc1.mozilla.com 5 T-W1064-MS-039.mdc1.mozilla.com 2 T-W1064-MS-041.mdc1.mozilla.com 7 T-W1064-MS-042.mdc1.mozilla.com 4 T-W1064-MS-043.mdc1.mozilla.com 8 T-W1064-MS-044.mdc1.mozilla.com 1 T-W1064-MS-061.mdc1.mozilla.com 2 T-W1064-MS-063.mdc1.mozilla.com 3 T-W1064-MS-064.mdc1.mozilla.com 3 T-W1064-MS-066.mdc1.mozilla.com 1 T-W1064-MS-068.mdc1.mozilla.com 2 T-W1064-MS-070.mdc1.mozilla.com 2 T-W1064-MS-074.mdc1.mozilla.com 2 T-W1064-MS-075.mdc1.mozilla.com 1 T-W1064-MS-076.mdc1.mozilla.com 1 T-W1064-MS-077.mdc1.mozilla.com 3 T-W1064-MS-078.mdc1.mozilla.com 1 T-W1064-MS-081.mdc1.mozilla.com 1 T-W1064-MS-082.mdc1.mozilla.com 1 T-W1064-MS-083.mdc1.mozilla.com 1 T-W1064-MS-084.mdc1.mozilla.com 1 T-W1064-MS-085.mdc1.mozilla.com 1 T-W1064-MS-087.mdc1.mozilla.com 1 T-W1064-MS-088.mdc1.mozilla.com 1 T-W1064-MS-089.mdc1.mozilla.com 1 T-W1064-MS-090.mdc1.mozilla.com 1 T-W1064-MS-106.mdc1.mozilla.com 1 T-W1064-MS-108.mdc1.mozilla.com 5 T-W1064-MS-110.mdc1.mozilla.com 3 T-W1064-MS-111.mdc1.mozilla.com 2 T-W1064-MS-112.mdc1.mozilla.com 1 T-W1064-MS-116.mdc1.mozilla.com 1 T-W1064-MS-118.mdc1.mozilla.com 2 T-W1064-MS-119.mdc1.mozilla.com 1 T-W1064-MS-120.mdc1.mozilla.com 2 T-W1064-MS-121.mdc1.mozilla.com 1 T-W1064-MS-123.mdc1.mozilla.com 1 T-W1064-MS-124.mdc1.mozilla.com 2 T-W1064-MS-125.mdc1.mozilla.com 1 T-W1064-MS-127.mdc1.mozilla.com 1 T-W1064-MS-128.mdc1.mozilla.com 2 T-W1064-MS-129.mdc1.mozilla.com 1 T-W1064-MS-131.mdc1.mozilla.com 2 T-W1064-MS-134.mdc1.mozilla.com 3 T-W1064-MS-135.mdc1.mozilla.com 1 T-W1064-MS-152.mdc1.mozilla.com 3 T-W1064-MS-153.mdc1.mozilla.com 5 T-W1064-MS-154.mdc1.mozilla.com 3 T-W1064-MS-155.mdc1.mozilla.com 2 T-W1064-MS-156.mdc1.mozilla.com 1 T-W1064-MS-157.mdc1.mozilla.com 1 T-W1064-MS-158.mdc1.mozilla.com 2 T-W1064-MS-159.mdc1.mozilla.com 1 T-W1064-MS-162.mdc1.mozilla.com 1 T-W1064-MS-164.mdc1.mozilla.com 1 T-W1064-MS-167.mdc1.mozilla.com 3 T-W1064-MS-168.mdc1.mozilla.com 1 T-W1064-MS-169.mdc1.mozilla.com 3 T-W1064-MS-172.mdc1.mozilla.com 4 T-W1064-MS-173.mdc1.mozilla.com 2 T-W1064-MS-174.mdc1.mozilla.com 3 T-W1064-MS-175.mdc1.mozilla.com 4 T-W1064-MS-176.mdc1.mozilla.com 3 T-W1064-MS-177.mdc1.mozilla.com 4 T-W1064-MS-179.mdc1.mozilla.com 2 T-W1064-MS-196.mdc1.mozilla.com 2 T-W1064-MS-197.mdc1.mozilla.com 2 T-W1064-MS-198.mdc1.mozilla.com 1 T-W1064-MS-199.mdc1.mozilla.com 7 T-W1064-MS-200.mdc1.mozilla.com 1 T-W1064-MS-201.mdc1.mozilla.com 3 T-W1064-MS-202.mdc1.mozilla.com 2 T-W1064-MS-203.mdc1.mozilla.com 3 T-W1064-MS-204.mdc1.mozilla.com 2 T-W1064-MS-205.mdc1.mozilla.com 6 T-W1064-MS-206.mdc1.mozilla.com 6 T-W1064-MS-207.mdc1.mozilla.com 5 T-W1064-MS-208.mdc1.mozilla.com 2 T-W1064-MS-211.mdc1.mozilla.com 3 T-W1064-MS-212.mdc1.mozilla.com 2 T-W1064-MS-213.mdc1.mozilla.com 3 T-W1064-MS-214.mdc1.mozilla.com 6 T-W1064-MS-215.mdc1.mozilla.com 5 T-W1064-MS-216.mdc1.mozilla.com 4 T-W1064-MS-217.mdc1.mozilla.com 1 T-W1064-MS-218.mdc1.mozilla.com 4 T-W1064-MS-219.mdc1.mozilla.com 2 T-W1064-MS-220.mdc1.mozilla.com 2 T-W1064-MS-221.mdc1.mozilla.com 4 T-W1064-MS-223.mdc1.mozilla.com 2 T-W1064-MS-224.mdc1.mozilla.com 3 T-W1064-MS-225.mdc1.mozilla.com 2 T-W1064-MS-241.mdc1.mozilla.com 2 T-W1064-MS-242.mdc1.mozilla.com 4 T-W1064-MS-243.mdc1.mozilla.com 2 T-W1064-MS-244.mdc1.mozilla.com 1 T-W1064-MS-245.mdc1.mozilla.com 3 T-W1064-MS-246.mdc1.mozilla.com 1 T-W1064-MS-247.mdc1.mozilla.com 1 T-W1064-MS-248.mdc1.mozilla.com 3 T-W1064-MS-249.mdc1.mozilla.com 2 T-W1064-MS-250.mdc1.mozilla.com 2 T-W1064-MS-251.mdc1.mozilla.com 2 T-W1064-MS-252.mdc1.mozilla.com 2 T-W1064-MS-253.mdc1.mozilla.com 20 T-W1064-MS-254.mdc1.mozilla.com 5 T-W1064-MS-255.mdc1.mozilla.com 1 T-W1064-MS-256.mdc1.mozilla.com 2 T-W1064-MS-257.mdc1.mozilla.com 3 T-W1064-MS-258.mdc1.mozilla.com 2 T-W1064-MS-259.mdc1.mozilla.com 3 T-W1064-MS-262.mdc1.mozilla.com 2 T-W1064-MS-263.mdc1.mozilla.com 3 T-W1064-MS-264.mdc1.mozilla.com 6 T-W1064-MS-265.mdc1.mozilla.com 4 T-W1064-MS-266.mdc1.mozilla.com 3 T-W1064-MS-267.mdc1.mozilla.com 3 T-W1064-MS-268.mdc1.mozilla.com 4 T-W1064-MS-269.mdc1.mozilla.com 2 T-W1064-MS-270.mdc1.mozilla.com 5 T-W1064-MS-283.mdc1.mozilla.com 3 T-W1064-MS-284.mdc1.mozilla.com 2 T-W1064-MS-285.mdc1.mozilla.com 2 T-W1064-MS-286.mdc1.mozilla.com 1 T-W1064-MS-287.mdc1.mozilla.com 4 T-W1064-MS-290.mdc1.mozilla.com 4 T-W1064-MS-291.mdc1.mozilla.com 2 T-W1064-MS-292.mdc1.mozilla.com 1 T-W1064-MS-293.mdc1.mozilla.com 4 T-W1064-MS-294.mdc1.mozilla.com 2 T-W1064-MS-295.mdc1.mozilla.com 3 T-W1064-MS-297.mdc1.mozilla.com 2 T-W1064-MS-298.mdc1.mozilla.com ```
HPE Support's first response is to apply their full driver update package: ttps://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1 Operating System(s): Microsoft Windows 10 (64-bit) File name: mwdp_m710x_20170508.zip (521 MB) release notes includes: "Updated Mellanox driver to WinOF version 5.35 on all the other supported OS." I've asked for further investigation and if this is a known fixed problem (in the updated driver).
My request for support from Mellanox was denied because we do not have a premium support contract. My request was based on the serial number f4-03-43-03-00-df-4d-31 (taken from moon-chassis-1 c1n1, t-linux64-ms-001, as I can see the serial number through lspci). HPE's support technician assigned to my case has asked me 4 times to install the driver and test if that fixes the problem. Each time, I have asked for more investigation and for them to inquire about the problem with Mellanox. The agent stated that "The driver upgrade has fixed the issue in some of the old cases." After my fourth request, the agent has stated that they cannot contact Mellanox because they are the hardware break/fix team. I have asked for the issue to be transferred to a team that can contact Mellanox.
Q, would it be reasonable to test the new driver to see if it eliminates the network problem? If so, could you test it? https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1
Flags: needinfo?(q)
I think we can give that a go and see if it helps we would need to do it on a number of machines to catch the issue in any timely fashion
Flags: needinfo?(q)
(In reply to Dave House [:dhouse] from comment #22) > Q, would it be reasonable to test the new driver to see if it eliminates the > network problem? If so, could you test it? > > https://support.hpe.com/hpsc/swd/public/detail?sp4ts. > oid=1009020644&swItemId=MTX_fbd71e2c6e034bc1a1170032cd&swEnvOid=4184#tab1 Q, could you run the "HPS Report" utility on one of the machines that has seen the problem? (I keep selecting T-W1064-MS-042.wintest.releng.mdc1.mozilla.com, moon-chassis-1 c42n1, mdc1) http://hpsreports.glb.itcs.hpe.com/HPSreports/ For escalation, the HPE agent has asked that I provide the HPS report for one of the machines. I've tried to get in through rdp and ssh, and had no success.
Flags: needinfo?(q)
After updating network drivers we are now seeing these errors: Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Mellanox ConnectX-3 PRO VPI (MT04103) Network Adapter (PCI bus 14, device 0, function 0): SR-IOV cannot be enabled because FW does not support SR-IOV. In order to resolve this issue please re-burn FW, having added parameters related to SR-IOV support.#015 Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: Execution of FW command failed. op 0x34, status 0xff, errno -110, token 0xffff, in_modifier 0, op_modifier 0, in_param 465ae000.#015 Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: Driver startup failed because the hca could not be initialized.#015 Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Mellanox ConnectX-3 PRO VPI (MT04103) Network Adapter (PCI bus 14, device 0, function 0): SR-IOV cannot be enabled because FW does not support SR-IOV. In order to resolve this issue please re-burn FW, having added parameters related to SR-IOV support.#015 Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Port type registry value for device Native_14_0_0 could not be modified to value (PortType = none,auto). Previous value will be set.#015 Aug 02 15:59:22 T-W1064-MS-022.mdc1.mozilla.com mlx4_bus: Native_14_0_0: EXT_QP_MAX_RETRY_LIMIT/EXT_QP_MAX_RETRY_PERIOD registry keys were requested by user but FW does not support this feature. Please upgrade your firmware to support it. For more details, please refer to WinOF User Manual.#015
Flags: needinfo?(q)

Been 6 months. Anything actionable here?

mark, dave, q, where are we with this issue?

Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)

Mark, had you tested turning off SR-IOV in the bios? I was thinking that (in October?) you had found this error disappeared when SR-IOV was turned off and then had CIDuty turn it off on all the machines.

Maybe my memory is wrong however as I could not find anything about this when searching IRC logs.

Flags: needinfo?(dhouse)

This was addressed here in Bug 1499754.

Depends on: 1499754
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.