Category Archives: Technology junk

I’m a SAN administrator. Stuff relating to system/network/storage administration goes here.

ZFS – add compression to make it perform better

Disclaimer first: I’m well aware that this is a very short-sighted disk performance analysis.

I had some stress testing I wanted to do to some new disks prior to putting them into production. Because I’m too poor/cheap to buy SAN adding disks will result in an outage of my services, which means taking them back offline in the event that something was wrong would at least double my outage window, probably more.

In addition to my meager endurance testing I also wanted to do my own analysis of how bad RAIDZ sucks. Don’t get me wrong, I love ZFS, but more than once its performance (or lack thereof) has come back to bite me. That’s why I’m adding disks to my storage system in the first place. With that in mind I created a few different types of zPools and ran Bonnie++ on them. I am starting with 15 146GB 15k RPM SAS disks, directly attached to FreeBSD 10.0. I had 3 different zPools I wanted to test:

  1. RAIDZ – 5 of 3+1
  2. RAIDZ – 1 of 14+1 (unrealistic to ever use I know)
  3. RAID10 – 7 of 1+1

I created a single zPool in each case representing the disk layouts I wanted to compare, I then created a single ZFS inside the disk pool with no tuning, except in the case of those with compression. For those with compression, the zfs option was set to lzjb.The command used to run each test was

bonnie++ -s 81920 -d /pool0/bonnie -n 0 -m diskbench -f -b -u root -x 6

The results of all six rounds of testing:

So I was disappointed in that the RAID10 didn’t do terrifically better than the RAIDZ 5 of 2+1. As a matter of fact I may end up choosing the latter model as it gives me over 400GB of additional usable space at minimal cost of redundancy and performance. On the other hand, with the RAID10 configuration I do have a hot spare. It may come down to the toss of a coin. One thing is (and always has been) certain though, and that is that I’ll definitely be compressing the data. It was for that reason, midway through I wanted to enable compression and how the compression vs. non-compression analysis came about. As I said at the beginning of this post the tests are barbaric and take into count no sense of the actual workload that this box will see, and also it will be delivered via NFS to the clients, however at first glance a compressed ZFS volume will outperform an uncompressed volume with this particular setup.

The reason for the additional speed using these barbarian tests is that I go from 80GB of block data to only 2.3GB when compressed. The compression ratio is indeed insane and is no doubt the reason for the increase in speed. At the expense of some leftover CPU cycles (which I have plenty of on my production system) I spend a lot less time going to spinning disk since such a great ratio of my useful data lives in ARC. On my production system, this 80GB test would have never went to disk as all of the compressed data would have been in ARC.

Once I get the disks attached to the production system we’ll run some real world tests where I create a workload somewhat representative of my clients and run IOmeter on a guest VM. The results of that – yet to come.

ACS Network Status Update – Sucks

If I spent 10 minutes writing about every poor experience I’ve had with Alaska Communications I’d be busy writing for over 24 hours. The number of people I’ve interacted with at that company that were actually pleasant to work with could easily be counted on 1 hand, and most of them worked in customer service in Anchorage. Today I’m going to just complain about one interaction with this company, my least favorite organization on the planet (no exaggeration).

The red circle was last Friday, ACS rolled a truck to *upgrade* the speed of my client’s Internet. The blue line is a continuous stream of data between my client and my servers in MN. It’s not obvious from the image above, but the speed *upgrade* took my client off the Internet for a total of 4 hours and 18 minutes during normal business hours. I suppose in the 1990s it was acceptable to have such a large outage for such a modest *upgrade*, but it’s 2014. I suggest that the upgrade was a modest one because they were attempting to go from a 3Mpbs business class service to a 7Mbps business class service. Not long ago I upgraded my low-cost and low-priority home Internet from 6Mbps to 50Mbps. There was no outage and the total amount of invested time on my part was about 15 minutes. In my mind this is how upgrades should happen in 2014.

The Friday outage was split into 2 parts. The first was a lengthy 3 hours and 3 minutes, during which time it was blackout for me. I had no knowledge of what was happening. I was E-mailed prior to the commencement of the upgrade asking if I had any advice. I replied with a basic structure of what my client’s router expects to see in order to reinitiate the connection to the ISP and the disclaimer that if he or she didn’t understand my E-mail then they should postpone the upgrade until we could work together on it. They proceeded so I assumed he knew what he was doing. I was eventually contacted by an individual back at the Alaska Communications call center, they needed my assistance (and understandably so) getting the office back online. I told them to reboot the router, at which point the PPPoE connection came up and all was fine, it seemed the technician did his job OKĀ  just my ddial PPP connection wasn’t coming online. A reboot is simple enough and something that Alaska Communications themselves wouldn’t hesitate to do with most customers, it just so happens my router is slightly more intimidating looking:

After the line came up I asked the technician to not touch anything further and leave the room. Although my client still wasn’t on the Internet, the remaining portion was up to me. I did my business then it was my time to leave work so I started my commute home. The connection stayed up for all of 15 minutes and 10 seconds. About 1 mile into my commute I got a text message saying it was down again. Normally I’d turn around to put myself back at a terminal, but this particular Friday I had a meeting after work that I didn’t want to miss. I continued on my way and did phone troubleshooting with staff at my client’s office.

After probably 45 minutes of great help from one of my client’s employees, we found the inability to connect even just using a laptop and Windows 7’s Internet Connection Wizard. I turned the problem back to Alaska Communications with a phone call back to the individual who had called me earlier. That ACS employee contacted my client’s staff, had them plug the router back in and it miraculously came online. I’m aggravated as I’m sure the ACS employee played it off like nothing was wrong the entire time when there is no doubt in my mind he did something on his end. There is no other explanation since we couldn’t even connect with a generic Windows 7 box. The 2nd outage while only a mere hour was big enough to make 3 9s system when measured over a year.

This brings us past the red circle above and into the orange circle. The Internet was online all of Friday evening, all of Saturday even though speeds dropped in the early morning as you can see, but the connection did stay up. The blue circle encompasses the Sunday outage, one which started about 30 minutes prior to the end of AK Daylight Savings Time. The outage lasted through the day, meaning for an entire day and then some my client could not send nor receive E-mails. Their servers were unreachable so no doubt some correspondence was bounced. ACS doesn’t have a 24 hour business unit, or at least not one with a listed phone number, so I had to wait until they opened at 8:00AM. The good news is even after they opened the technical support I was able to contact was utterly useless. “I can see you’re not online, but we can’t do anything about it without someone onsite.” Certainly I agree it would be nice to have someone onsite, but unfortunately it wasn’t in the cards. The people that would normally be there on a Sunday were out of town this particular weekend. The annoying part was that he didn’t even bother to look at connection logs or any such, they just gave up at “you’re not online.”

About 30 hours past, the connection was brought back to life Monday morning at 8:09AM AK time. This is what availability looked like for the week – Tuesday morning through today, note that the point when the *upgrade* started is clearly visible:

Since their Internet presence returned yesterday morning, my client hasn’t fallen back offline. Their bandwidth remains pretty random. At the end of the day, my only prayer is that I don’t have to work with ACS Personnel, ever again, in my life. They’re one thing I’ve been unable to leave behind by leaving AK, and I wish more than anything that I could change that.

HDS – Frustrated from day 1

If you know me well you know that I couldn’t care less about what badge your gear wears, as long as it works. On the motocross track I started on red then ended up carrying both yellow and blue in my trailer. I couldn’t tell you which color was my favorite, the all have their own special place in my heart and they all did their job well. The reason I never rode green, I didn’t like the folks who sold it. That fact will play into the remainder of my diatribe.

Paul (my direct peer at work) and myself are trying to make use of the fancy new machine that was set on our floor. HDS came in and configured up our VSP. The tech promised a day to set it up, it ended up being 3.5 days to do 2. I don’t blame the tech that made the promise as he isn’t the one who ended up doing the work; not his fault. That said, if it takes HDS 2 days to set up a paultry 30T (don’t ask how much paultry 30Ts cost!!!) usable array they’re never going to make it in datacenters the likes of a big organization like Microsoft, Google or Amazon. You’d never guess this based upon how much they love boasting how amazing their gear is. Mind you this is our first Hitachi box. We had to listen to endless beating of their own drum Hitachi staff did about how happy we’re going to be getting off IBM and on Hitachi.

Finally the tech is done with the initial setup and it’s ready for us to login to the box and start creating pools, carving LUNs and presenting disks. Probably the most exciting thing for a storage admin, logging onto a fresh array and digging in.

Direct quote from the Hitachi Storage Navigator User Guide:

To log in to Storage Navigator as the super user:
  1. Call your local service representative to obtain the superuser ID and default password.

Yup, here we are. We have an E-mail string going back and forth with pre-sales and the guys who spun the wrenches last week doing the initial config. Meanwhile, Paul and myself are seeing our deadline to have this array online creep closer and closer and the brick is sitting in the datacenter heating our cold air and being very green by wasting 100% of the electricity it’s consuming.

Thus far, I’m not leaping out of my chair with happiness because of our new vendor. If you happen to know the username and password to logon to my array, let me know. I certainly don’t.

Server 2012 “server specified requires restart” loop when installing WID

I was trying to create a RDS deployment, something I’ve done before without issues but this time when trying to install the necessary roles and features I ended up with the reboot loop described. Each time I tried to install, Server Manager complained “The request to add or remove features on the specified server failed. the operation cannot be completed because the server that you specified requires a restart.”

I narrowed the problem down to the WID (Windows Internal Database) feature installation causing the issue. From there I googled and found a MSDN page written partially in another language, hence my own English sharing of knowledge.

http://social.msdn.microsoft.com/Forums/ro-RO/e7e9bc17-14d1-43c9-809c-464f69b366cd/server-2012-windows-internal-database-error-during-installation

The post useful to me was the one by kswail about halfway down. Adjust your domain (or domain controller if appropriate) security policy to allow “NT SERVICE\MSSQL$MICROSOFT##WID” to log on as a service, a GPO setting that can be found under Computer Configuration > Policies > Windows Settings > Security Settings > Local Policies > User Rights Assignment. Simply adding the security principle mentioned above to that policy solved a problem that haunted me for 2 days.