Closed
Bug 1499754
Opened 7 years ago
Closed 7 years ago
Investigate mlx4_bus "SR-IOV cannot be enabled because FW does not support" error from Moonshots Win 10 nodes
Categories
(Infrastructure & Operations :: RelOps: Windows OS, task)
Infrastructure & Operations
RelOps: Windows OS
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
ct 17 08:43:35 T-W1064-MS-035.mdc1.mozilla.com mlx4_bus: Mellanox ConnectX-3 PRO VPI (MT04103) Network Adapter (PCI bus 14, device 0, function 0): SR-IOV cannot be enabled because FW does not support SR-IOV. In order to resolve this issue please re-burn FW, having added parameters related to SR-IOV support.#015
Oct 17 08:43:35 T-W1064-MS-035.mdc1.mozilla.com mlx4_bus: Port type registry value for device Native_14_0_0 could not be modified to value (PortType = none,auto). Previous value will be set.#015
I was hoping that this was going to go away post firmware upgrade. However, it persists.
| Assignee | ||
Updated•7 years ago
|
Assignee: nobody → mcornmesser
| Assignee | ||
Comment 1•7 years ago
|
||
Dhouse: Is there a support case for this issue already?
Flags: needinfo?(dhouse)
(In reply to Mark Cornmesser [:markco] from comment #1)
> Dhouse: Is there a support case for this issue already?
I don't see an existing support case for this issue. I'll create one.
I opened case# 5333373066 with HPE Support for this error message.
Flags: needinfo?(dhouse)
HPE Support quickly contacted me about this and discussed the problem. We verified that SR-IOV is set in the bios for the mellanox NIC#1. We tried turning it off, and then back on after a reboot, but that did not change the error (https://papertrailapp.com/systems/2224383012/events?focus=989329252077953042&selected=989329252077953042).
The support agent will research the problem further and review the show-all log for one of the chassis (moon-chassis-1 as I had tested the bios setting on cartridge #35, T-W1064-MS-035, matching the comment#1 log).
I also checked in papertrail to see that this error appears on all of our cartridges (running windows) and appears at each boot.
Mark, the HPE support agent contacted me again and he found that the m710x does not support SR-IOV. So he suggested we turn it off in the bios and for the driver (I found one example for turning it on: https://community.mellanox.com/docs/DOC-2242).
Would you like for me to turn it off in the bios on a set of machines?
Flags: needinfo?(mcornmesser)
| Assignee | ||
Comment 6•7 years ago
|
||
I will take a deeper look into this tomorrow. It is odd because we don't explicitly enable it anywhere.
Flags: needinfo?(mcornmesser)
| Assignee | ||
Comment 7•7 years ago
|
||
Hyper-V, mentioned in link in comment 5, is not enabled on these nodes.
The error persists when Virtualization Technology is disabled in the BIOS.
I also spent some time looking into the Mellanox registry entries. Which is located here: HKEY_LOCAL_MECHINE -> SYSTEM -> CurrentControlSet -> Control -> Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001. There is a value for SR-IOV. That is set to 1. I changed the data to 0 and rebooted, but the error persisted.
I also deployed a node without the Mellanox drivers updated. This only caused additional network connection issues.
I am going to continue focusing at the registry level to see if I can find a way to turn this off. My concern is that this maybe part of the some the network issues we are seeing on the moonshots. For instance when the node reboots but does not talk to the network.
Hi Mark,
HPE Support has also suggested upgrading the mellanox driver. Here is their latest note from Friday:
```
If possible could you kindly update the driver on the Mellanox Card and update us the status.
Moonshot Windows Deployment Pack for HPE ProLiant m710x Servers - Version:2017.05.08 (21 Jun 2017) - https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_fbd71e2c6e034bc1a1170032cd#tab3
Also if Hyper V feature is being used on the Cartridge then kindly follow the below Mellanox document and update us the status.
https://community.mellanox.com/docs/DOC-2242
Is the issue only with the Cartridge running Windows or the issue is also with the Cartridge running Linux
```
| Assignee | ||
Comment 9•7 years ago
|
||
System Utilities: System Configuration -> Embedded LOM 1 Port 1 :..... -> Device Level Configuration, Virtualization Mode was set to SR-IOV. This setting is specifically for virtualization and will allow a single device be seen as many.
I spun up a test pool of ms-022 through ms-025 with this setting set to none. The nodes installed through and passed several tests from the try push below. I think we should got through and change this settings on all the nodes.
Jmaher: Could you take a look at these tests and see if there is anything worrisome with them, please?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7c263974f81a1d0aea696fc80c3320454f98e43
Flags: needinfo?(jmaher)
Comment 10•7 years ago
|
||
no errors here and the noise appears to be a little bit lower, thanks for checking.
Flags: needinfo?(jmaher)
Comment 11•7 years ago
|
||
Hello Mark,
We managed to change the BIOS configuration for chassis one. We re-imaged each machine after that. We are going further with chassis two.
Also we configured the BIOS at a few machines that were missing from TC before to re-image them: 022, 023, 024, 025, 028, 029,
030, 040, 078, 114, 131, 153, 208, 264, 372, 430, 453, 461, 523, 557
For easy track our work, we use the following inventory [1].
You can check the Notes column to see if the BIOS configuration is done(cell empty) or not (cell contains the required action).
[1] https://docs.google.com/spreadsheets/d/1ELwhJ5VELEN56VVBE-asGiDPaRLPxlDrlx8hZ_U_Dtk/edit#gid=562893333
We will let you know when the next chassis will be done.
Comment 12•7 years ago
|
||
Updates:
Changed the BIOS configuration on the following machines:
* chassis 2: -MS-{061..078}
* another bunch of missing servers: 022, 061, 078, 124, 153, 157, 171, 243, 250, 267, 327, 341, 344, 519
Comment 13•7 years ago
|
||
Updates:
Changed the BIOS configuration and re-imaged the following machines:
T-W1064-MS-{079..168}
Comment 14•7 years ago
|
||
Updates:
Changed the BIOS configuration and re-imaged the following machines:
- 169-180 (chassis 4) (except 171)
- 196-225 (chassis 5) (except 208)
- 241-270 (chassis 6) (except 243, 250, 264, 267)
Those exceptions had the message (from spreadsheet) already removed.
Comment 15•7 years ago
|
||
Updates:
Changed the BIOS configuration and re-imaged the following machines:
281 – 298 – except 284 and 286 (Chassis7)
316 – 345 (Chassis 8)
361 – 390 (Chassis 9)
406 – 435 (Chassis 10)
Comment 16•7 years ago
|
||
Updates:
Changed the BIOS configuration and re-imaged the following machines:
-[MDC2] T-W1064-MS-{451..480} (Chassis 11)
-[MDC2] T-W1064-MS-{496..525} (Chassis 12)
Comment 17•7 years ago
|
||
Updates:
Changed the BIOS configuration and re-imaged the following machines:
T-W1064-MS-{541..565}
Comment 18•7 years ago
|
||
(In reply to Zsolt Fay [:zfay] from comment #17)
> Updates:
>
> Changed the BIOS configuration and re-imaged the following machines:
>
> T-W1064-MS-{541..565}
Hello, did someone continued changing the BIOS configuration for the rest of the machines from 566-570 and 581-600? I did not find any note left in the Moonshot Master Inventory document.
Comment 19•7 years ago
|
||
I have completed now the remaining machines, all Win MSs are done.
| Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•