Tag: Rant

Vendor Support–The Good And Bad

When you go out and buy yourself new hardware or software you have the option of purchasing maintenance agreements at the same time. For software this generally provides the ability to constantly upgrade to the latest and greatest product. For hardware this tends to provide onsite support for when things go wrong and an SLA around that support arriving and the hardware being fixed.

I’ve been dealing with hardware and software vendors for years, I thought I’d share a couple of stories the really depict excellent service and the stuff that you never want to deal with as a customer.

 

The bad

When I started at one of my previous positions I walked into a Dell shop. If you aren’t familiar with that term it means that all hardware purchased was through Dell and that it gave us steeper discounts on the hardware that we would purchase.

I was put in charge of the Windows team pretty early at this company and we started to go through a hardware refresh. I sat down with the team and started asking questions about how things were with Dell. To a person they liked the hardware and what it delivered however they hated the service. There were common problems with SLAs not being met, the wrong replacement parts being delivered and phone support being unable to provide decent assistance.

I brought these issues to the Dell account rep and explained that we were looking at a fairly significant budget spend the next year on hardware (>$1m) and that I needed to see better results from the support team over there if I was going to spend any of that money with them.

Over the next 6 months I thoroughly documented every engagement with their support staff. This support engagements included:

  • Server down – customer impact
  • Hardware problem, replacement part needed – non-customer impacting
  • General troubleshooting assistance required – non-customer impacting

I’m sad to say that Dell was only able to meet the 4 x 7 x 365 agreement we had for hardware support in 10% of the cases that we opened. Techs would show up late (or not at all), parts would be incorrect even when the tech was onsite in time (techs did not bring the parts, they would be delivered separately) and we would have trouble getting anything above a level 2 tech person on the phone who’s troubleshooting ability seemed to be limited to “have you tried turning it off and back on again”.

This was several years ago and Dell might have significantly improved their support since then, however when I left the company we did not have a single Dell server in any of the three datacenters I had built out.

 

The good

Software has bugs. We all know that and have experienced problems with vendor applications, but what happens when you run into a significant issue and how does the vendor respond?

Recently SQLSentry released a new version of their Performance Advisor for SQL Server tool which is for monitoring and tuning SQL Server. I performed an upgrade to the new version and resumed monitoring, I didn’t run into any issues or problems.

A couple of days later I got a call from our Windows folks stating they had an alarm on high memory utilization on the monitoring server. I logged in to take a look and was shocked to see the SQLSentry monitoring service had consumed over 5GB of memory. I bounced the service and it reset itself. Over the next couple of days memory usage increased again, causing me to restart the service.

At this point I engaged the support folks, in particular Jason Hall (blog|twitter). We started triaging the issue.

We started up perfmon and captured a few counters to file to try and localize the memory leak. This allowed us to discover the leak was in unmanaged code, making the trouble a lot tougher to track down.

The next step was to install the Windows Debugging Tools from Microsoft. With these deployed and a set of symbols downloaded we used UMDH to capture the before and after log heap allocations for the monitoring service. One comparison log later and we were able to track the issue down to a leak in Microsoft’s managed wrapper for the VDS (Virtual Disk Service) subsystem which is used by SQLSentry to monitor mount points.

I’m running several multi-node, multi-instance SQL Server Failover Clustered Instances and make extensive use of mount points (current count is 136 monitored mount points).

To test and confirm that VDS was the actual issue one of the SQLSentry development team threw together a very small 50 line application that I could hit a couple of buttons on an watch memory usage. It took a bit of a tired mouse finger, but I was able to verify quickly that VDS was indeed the problem.

Now fully understanding the problem in hand the SQLSentry team quickly built their own COM wrapper to handle mount point monitoring and provided me with a new build of the product. I went through a standard deployment and started the services back up again. A week later and the service is still running at around 500MB.

 

Throughout the process of problem triage, issue identification and resolution it was a very engaged support process with an appropriate level of urgency for each of the steps. Everything was handled to completion and I have been very happy with the support I received. That’s why my maintenance for this product will be renewed next year. I know that the money spent is worth it.

 

TL;DR

In the past I have spent a lot of money on very high levels of support from Dell and received nothing but poor service. As a result they lost several million dollars of business.

On the flip side I spent a small amount of money on maintenance with SQLSentry and received excellent support and levels of engagement which will help retain me as a long term customer.

 

I’d be interested to hear about your experiences with these vendors,  or any other.

The Importance Of Good Documentation

Believe it or not I’m not actually talking about server documentation here (for an excellent post on that go read Colleen Morrow’s The Importance of a SQL Server Inventory).

I have spent the last 12 days dealing with a single production release. It is being considered a significant release, but to be honest it really isn’t. The biggest challenge has been to do with the way that the release documentation has been provided and the fashion in which the scripts have been built.

 

What I got

Here’s a brief example of a change request I’ve seen:

  • Change Request:
    • Update database – products (this links to a Sharepoint page)
    • Use code from this location (links to a file share)
  • Sharepoint page
    • Go to this location (but replace the middle part of the link with the link from the change request page)
    • Copy this subfolder to your machine
    • Follow the process on Sharepoint page 2 to deploy the code
    • Once Sharepoint page 2 is complete run script X
  • Sharepoint page 2
    • run script 1
    • run script 2
    • run script 3

 

Pretty painful right? Now multiply that by 8 for each of the database code deployments that needed to be completed. No fun, no fun at all.

 

What do I want?

It’s going to be a work in progress but we’ll be working with this particular dev team to put together a unified document to simplify the release structure.

Here’s what I want to see:

  • Change Request:
    • Update database – products – deployment instructions attached
  • Attachment
    • Deploy script 1 (link to script)
    • Deploy script 2 (link to script)
    • Deploy script 3 (link to script)
    • Deploy script X (link to script)
    • Rollback script (link to script)

 

The difference?

Instead of having to reference several different Sharepoint locations in addition to a change control document I now have a single document, attached to the change, which clearly defines the process for the release, the order for scripts to be executed, a link to each of those scripts and the relevant rollback information.

It’s not something that I think is too out of line to provide, but I’ve found the folks who have been providing releases in this method are extremely resistant to change. I can understand that, but to be fair, they aren’t the ones under the gun trying to put something in to a production environment in a consistent and stable manner.

I’ve lots of fun meetings coming up to talk about this.

 

What about you?

How do you get your change control documentation? Is it something plainly written and easy to follow? Or do you have to have a degree in cryptography to get code in to production?

Stop Bad Database Design

Every year that goes by I sit in hope that I won’t see bad database design.

Every year I am disappointed.

 

As an example here is a table create statement that I saw the other day (table and column names have been changed to protect the innocent)

CREATE TABLE [dbo].BestestTableEVAR(

 Col1 [int] IDENTITY(1,1) NOT NULL,

 Col2 [uniqueidentifier] NULL,

 Col3 [uniqueidentifier] NOT NULL,

 Col4 [smallint] NULL,

 Col5 [smallint] NOT NULL,

 Col6 [bit] NOT NULL,

 Col7 [xml] NULL,

 Col8 [xml] NULL,

 ColA [xml] NULL,

 ColB [xml] NULL,

 ColC [datetime2](2) NULL,

 ColD [datetime2](2) NULL,

 COlE [datetime2](2) NULL,

 ColF [datetime2](2) NULL,

 CONSTRAINT [PK_BestestTableEVAR] PRIMARY KEY CLUSTERED 

(

 Col3 ASC

)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]

) ON [PRIMARY]

 

So what’s wrong with this?

The clustered primary key on this table is a GUID. Is that bad? That’s an unequivocal YES! Read all that Kimberly Tripp (blog|twitter) has to say about GUIDs in database design.

What makes this all the more crazy is that the table has an identity column. That’s a natural clustering key ready and waiting to be used and yet for some reason it’s not.

This table is going to fragment like crazy, it won’t scale and performance will be hideous. Additionally, thanks to the XML columns this table can’t even be rebuilt online meaning there’s no way to help the fragmentation or performance without actually taking the table offline to do it, meaning it can’t handle any transactions. This is a problem in a table on an OLTP system.

 

I would go back and change some things. Let’s say you wanted to keep the table structure the same, that’s fine, but let’s be smart about the keys and indexes.

It would make sense to change the identity column to be clustered (I would also make this the primary key) and then, to ensure uniqueness on Col2 which is the current primary key a unique index is warranted.

CREATE TABLE [dbo].BestestTableEVAR(

 Col1 [int] IDENTITY(1,1) NOT NULL,

 Col2 [uniqueidentifier] NULL,

 Col3 [uniqueidentifier] NOT NULL,

 Col4 [smallint] NULL,

 Col5 [smallint] NOT NULL,

 Col6 [bit] NOT NULL,

 Col7 [xml] NULL,

 Col8 [xml] NULL,

 ColA [xml] NULL,

 ColB [xml] NULL,

 ColC [datetime2](2) NULL,

 ColD [datetime2](2) NULL,

 COlE [datetime2](2) NULL,

 ColF [datetime2](2) NULL,

 CONSTRAINT [PK_BestestTableEVAR] PRIMARY KEY CLUSTERED 

(

 Col1 ASC

)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]

) ON [PRIMARY];

 

CREATE UNIQUE NONCLUSTERED INDEX UI_BestestTableEVAR_Col2 on BestestTableEVAR (Col2);

Sure, we still won’t be able to rebuild the indexes online, but we won’t have the same crazy levels of fragmentation that we would have had before.

 

I know I’ll be seeing a lot of bad design this year and I know that I’ll be forced to push that bad design into production. Doesn’t stop me trying to change things however. Help your devs out, let them know when their design is a problem. Who knows, maybe you’ll change things for the better.

Please Don’t Use Deprecated Data Types

I know that a lot of vendors like to write for the lowest common denominator (i.e. SQL 2000) but really folks it’s gone too far. I’m sick of cracking open vendor code that’s certified for SQL 2008 and seeing things like IMAGE and TEXT data types. Microsoft deprecated these things back when they released SQL 2005 (see http://msdn.microsoft.com/en-US/library/ms143729(v=SQL.90).aspx under Textpointers). Why are you persisting these things six years later?

I bring this up because I’ve come across further egregious usage of these data types in vendor code yet again. The vendor in question? Microsoft.

Yes, that’s right, the folks that deprecated the data type six years ago is still using it to a large extent within the ReportServer and ReportServerTempDB databases that support SQL Server Reporting Services. Seriously Microsoft? Can you please get with the plan and fix this nonsense?

The following query, run against the ReportServer database will show 14 different tables (31 columns) using a variety of NTEXT and IMAGE data types.

select st.name as TableName, t.name as DataType, sc.name as ColumnName 

    from sys.types t 

        inner join sys.columns sc

            on t.system_type_id = sc.system_type_id

        inner join sys.tables st

            on sc.object_id = st.object_id

where 

    t.name in ('image', 'text', 'ntext')

order by 

    st.name, t.name

 

I have filed a Connect item asking Microsoft to fix this. Please go vote for it at https://connect.microsoft.com/SQLServer/feedback/details/714117/ssrs-using-deprecated-data-types-in-its-databases and help us rid the world of this old stuff.

Your “Five Nines” Means Nothing To Me

There’s a lot of talk on High Availability. I know I’m a huge fan of it, in particular clustering (but I know it’s not for all situations and the changes that SQL 2012 with AlwaysOn Groups may mean that traditional clustering is used less and less). There are of course other HA solutions out there like Log Shipping, Mirroring and SAN replication technologies (sorry folks, I disagree on transactional replication as an HA concept unless you are talking downstream reporting infrastructure behind load balancers).

People constantly push to achieve the magical five nines availability number. What does that mean?

Five nines means that your SQL Server must be available 99.999% of the time. This means you can have only 5.26 minutes of downtime a year.

So let’s say that you are running clustering and mirroring; that you have load balancing with peer-to-peer replication; that you are running SAN replication between two datacenters and have log shipping setup ready to go live at the flick of a switch. You have all your hardware ducks in a row and are ready for any emergency that may crop up.

I commend you for all of that.

Now let’s say that your manager is going to bonus you on the downtime that you have. You are the DBA, a single link in the chain, and your entire bonus is going to based on the number of nines that you can get over the year.

Time to sit down with your manager and ask some questions:

 

  • Does this five nines bonus plan include maintenance time?
  • How are you going to measure uptime?
  • What tools are going to be used to measure it?
  • What are the tolerances on those tools?
  • What about upstream items?

 

 Let’s take a look at these one at a time.

 

Is maintenance time included?

There may be bugs in the database code for the application that would require an app being offline while changes are made. You may want to patch SQL Server with the latest service pack in order to maintain supportability. You may want to test failover and DR. If none of these things are excluded then chances are you’ll have trouble making three nines (8.76 hours of downtime in a year) and none at all of making five nines.

 

How will uptime be measured?

Is uptime calculated based upon SQL Server being up and running? Or maybe on that there are successful connections? How about certain queries getting returned? Or a website returning a page within a certain time period?

So let’s say you are running a query. What if that query gets blocked or deadlocked? SQL is up and processing other transactions but that single query is having a problem.

Successful connections? Great, doesn’t mean that the database is available that they are going to use. Same issue with checking if SQL Server is available. Heck, are you measuring server uptime or database uptime here?

Website returning a page? What if there’s an issue with the web server? Or a networking problem between there and the SQL database? 

 

What tool will you use?

Are you going to buy something off the shelf? Then who’s going to spend the time configuring it for your particular environment? Isn’t that time better spent on solidifying  things,  performance analysis? Query tuning?

You’re going to make your own tool? Great, how are you going to do that, what technologies? How is the information going to be reported upwards? Who has the last say on possible exceptions? You have the time to write this yourself? Awesome, I’m glad that you do.

 

What is your tool tolerance?

Tolerance? Yup. Let’s say that you’ve figured out all the other stuff and have agreements in place for everything and got a tool that will do the job. How does that tool work?

Let’s say it runs a query and fails. It says that the database is down. When will it next attempt to query the database? In 15 minutes? If that’s the case then any bad poll from your tool just put you outside of that five nines criteria and your bonus has been dinged.

Let’s just say that there was an actual problem and that happened 30 seconds after the last poll but you resolved it before the next one. Everything was down for 14 minutes but your tool didn’t capture that.

Some serious flaws there.

 

Upstream items

Are you the Windows, network, firewall, and datacenter admin? Are you responsible for generating the electricity that comes into the building? How about the HVAC system, is that yours? Are you checking the diesel levels in your backup generators weekly?

Each one of these upstream items can and will affect peoples ability to access your SQL Server and its databases. Unless you are responsible for them all then how can you be held accountable for perceived uptime on SQL Server?

SQL can be up, able to process data and all the databases available, but without a funcitoning network nobody would know that and your monitoring wouldn’t pick it up.

 

 

All of these things add up to not being able to accurately measure the mythical five nines uptime creature. If indeed we can’t get those kinds of accurate measurements and I can’t own those upstream processes then why should I be held accountable for a number.

Here’s a better idea. Let’s look at the processes, let’s look at whether or not things are available from a users perspective. Let’s gauge customer interactions, build outs, deployments. Let’s track tickets and issues. Let’s talk about the things we need to do to make things better. Let’s not go pulling a number out of the ether, it does nobody any favors.

Be A Courteous Blogger

Today I received an email, nothing unusual about that. This email happened to be from a SQL blogger. It’s one of those newletter type things, it even said at the bottom that I had opted into receiving this email. The only thing is that I, at no point, signed up to receive a newsletter. In fact this blog is not one I have visited other than on a couple of ocassions. I believe I may have commented on one of the posts I saw and that is how the blog owner got my email address.

My guess is that the blogger in question went through the WordPress database (yes WordPress site) and pulled emails for everyone that had posted comments and sent out this unsolicited email.

That is absolutely the wrong way to do something like this. A strongly worded email is on its way to the blog owner. Small mercies…they included an unsubscribe link on the email, so hopefully I won’t get another.

 

The right way to have done this was to do the exact same thing that Paul Randal (blog|twitter), Robert Davis (blog|twitter) and Brent Ozar PLF (blog|twitter) have done. That is let people know you have a newsletter and give folks the opportunity to opt in and receive it.

All you bloggers out there, please do the right thing.