The application in question would miss a substantial number of messages. A trace on the connected switch showed that all packets had been put on the wire. Tracing with Microsoft Message Analyzer on the machine showed these same messages missing, so our application probably was not at fault. Additionally, it did work on other machines just fine.
So I went back to the drawing board, reviewed and double-checked everything I had learned about high-throughput multicast messaging and
- set appropriately large socket receive buffer sizes in the multicast message receiving application,
- activated all TCP/UDP Rx/Tx offloads in the NIC configuration,
- activated receive side scaling (RSS) and picked the maximum number of RSS queues,
- set the NICs’ receive buffers to their maximum values,
- disabled flow-control,
- turned off all power-saving features in NIC and operating system,
- used the most aggressive interrupt moderation setting, and
- updated the NIC driver top the latest version.
In order to check NIC settings, I keep the following PowerShell snippet handy. It gives me the current, all valid and maximum values for each parameter of each NIC in the NIC team. And it doesn’t even require admin privileges.
(Get-NetLbfoTeam "MyNicTeam").Members | Get-NetAdapterAdvancedProperty | ft DisplayName,DisplayValue,ValidDisplayValues,NumericParameterMaxValue
Other useful sources:
- This post on the ServerFault blog about performance tuning Intel NICs is a bit dated and not multicast-specific, but nonetheless a great read.
- The documentation from Intel contains some useful comments on the trade-offs when you change certain parameters. Often the trade-off is performance versus memory, and since memory is easy and check to add, I usually opted for more performance.
But even after ensuring all parameter were at their optimal values, the problems persisted. So I spent some time setting up perfmon with these network-related performance counters.
One counter immediately jumped out: Packets Received Discarded was pretty much constant on the machines our application worked on. But on the machines where we noticed packet loss, this number was growing fast.
This Technet blog post has a good explanation of that performance counter and tips on how to gather it from multiple machines remotely using PowerShell.
It turns out the machines experiencing multicast message loss had substantially smaller receive buffers (512) compared to the machines that were working fine (2048 and 4096). Even though our setup script had correctly configured the maximum value for this parameter, that was apparently still insufficient.
So we ended up upgrading the NICs on the cluster experiencing the problems and the multicast messages loss went away.
Upon closer examination we also noticed TCP packet loss while our multicast application was running. But because resends were mostly successful, only introducing small delay this had gone unnoticed before.