Recently in my workplace we made some pretty serious gear shifts with regard to storage. Being vendor agnostic as I always strive to be I will say that our go-to fabric vendor changed as well as our go-to disk vendor. Since the change things work, but the more I dig into what’s going on under the hood, the more I feel like I’m living in a house of cards.
The change came as the result of 2 companies merging and coalescing on one hardware platform for all systems. Going in with the blinders on I was absolutely indifferent to the fabric change since I’ve worked with the new vendor’s gear in the past and took a liking to it, and as for the storage vendor they had a clean slate to start with so I had no reservations there either. In erecting our new gear some months ago we worked with staff from the company we merged with so as to build like for like environments to ease the merging environments down the road. One thing that made me quite suspect of either the gear we were installing or the practices of the personnel in the other organization was their suggestion to modify b2b credits on the storage side switch ports. Scratching my head I said that I prefer to leave port settings default and only make adjustments AFTER we’ve seen a symptom arise that suggests a need to deviate from default.
Fast forward about 3 quarters and we have 3 new arrays from ACME Storage, 2 directors and 6 pizza box FC switches from ACME SAN. Everything *works* but now that I’ve had time to breath I’m finding things in our environment I don’t like. Let’s compare counters from the busiest ISL links (primaries for a MetroCluster FCVI), before and after. Note that the before counters were last reset about 18 months prior to the dismantling of the ISL while the counters for the new ISL port are reset daily by a script run by our peers:
Frames Transmitted B2B Credit Zero Errors B2BC0 Percentage Old 2657260246611 2377700058 .089% New 2851542621 1306395534 45.8%
Both of these ISLs were configured with the staff from the switch manufacturer onsite and configured per best practices published by NetApp. As a matter of fact on the new configuration the B2B credit allocation was padded above and beyond the 150% padding that NetApp recommends. This is the ugliest counter to look at. Other ports have seen similar increases in errors, and seemingly for no good reason. Our production EMR has 8 host ports allocated for both the active and standby nodes, and the standby node is truly standby. Even still I’m seeing many b2b credit zero errors every second on those host ports, not the storage ports.
Mostly this is a pointless diatribe about changes I’m seeing. I’m truly concerned that I’m going to get to start micro-managing my FC ports in order to maintain performance and keep error counters low. If I reach that point I will no doubt be writing another post titled “Don’t buy this vendors junk unless you like being in the business of keeping the lights on.” Watch for that one.