Twitter

Entries in Rant (6)

Tuesday
Apr242012

The Importance Of Good Documentation

Believe it or not I’m not actually talking about server documentation here (for an excellent post on that go read Colleen Morrow’s The Importance of a SQL Server Inventory).

I have spent the last 12 days dealing with a single production release. It is being considered a significant release, but to be honest it really isn’t. The biggest challenge has been to do with the way that the release documentation has been provided and the fashion in which the scripts have been built.

 

What I got

Here’s a brief example of a change request I’ve seen:

  • Change Request:
    • Update database – products (this links to a Sharepoint page)
    • Use code from this location (links to a file share)
  • Sharepoint page
    • Go to this location (but replace the middle part of the link with the link from the change request page)
    • Copy this subfolder to your machine
    • Follow the process on Sharepoint page 2 to deploy the code
    • Once Sharepoint page 2 is complete run script X
  • Sharepoint page 2
    • run script 1
    • run script 2
    • run script 3

 

Pretty painful right? Now multiply that by 8 for each of the database code deployments that needed to be completed. No fun, no fun at all.

 

What do I want?

It’s going to be a work in progress but we’ll be working with this particular dev team to put together a unified document to simplify the release structure.

Here’s what I want to see:

  • Change Request:
    • Update database – products – deployment instructions attached
  • Attachment
    • Deploy script 1 (link to script)
    • Deploy script 2 (link to script)
    • Deploy script 3 (link to script)
    • Deploy script X (link to script)
    • Rollback script (link to script)

 

The difference?

Instead of having to reference several different Sharepoint locations in addition to a change control document I now have a single document, attached to the change, which clearly defines the process for the release, the order for scripts to be executed, a link to each of those scripts and the relevant rollback information.

It’s not something that I think is too out of line to provide, but I’ve found the folks who have been providing releases in this method are extremely resistant to change. I can understand that, but to be fair, they aren’t the ones under the gun trying to put something in to a production environment in a consistent and stable manner.

I’ve lots of fun meetings coming up to talk about this.

 

What about you?

How do you get your change control documentation? Is it something plainly written and easy to follow? Or do you have to have a degree in cryptography to get code in to production?

Sunday
Jan222012

High Availability–It’s Not All About The Hardware/Software

As I write this I’m sitting my my house working with the last hour of juice left in my laptop. We had a major Snow + Seattle = Bad Newssnowstorm here in the Seattle area which was followed by a big ice storm and some heavy winds, all of which have coincided to knock out power to over 200,000 homes. I’ve been without power for over 75 hours now and it’s given me a lot of time to think about multiple facets of High Availability (HA) and Disaster Recovery (DR).

HA/DR are critical items for all companies. HA in various degrees, but most certainly DR and the ability to provide some kind of business continuity model. Different companies, large and small, have different requirements around HA/DR. Some could easily remain down for a couple of days with no lingering effects, others cannot sustain downtime on their most critical systems of more than a couple of minutes. Based upon those requirements (and any budget constraints) a company could implement HA/DR using Failover Clustering, Mirroring, Log Shipping, SAN Replication, Peer-to-Peer Replication, VMotion or some other technology.

 

 

Where HA/DR often falls down

This is all good and well but what about the people that you need in order to get things up and running, the folks that you need in order to provide you with ongoing support in the event that a situation arises where you have to go to your DR site, or provide support for that critical system where a mirror has failed?

In the last 80 hours I have been called by the NOC at work stating that they have an issue. I don’t have power at home so my internet access is off. I don’t have sufficient power in my phone to tether it (don’t tell the wireless company I can do that) and it would take 4 hours sitting in the car to get sufficient power to allow me to do that. I do have a mobile broadband USB device, however I’m not getting coverage in my house right now from that carrier and I have been unable to get out of my road. In other words in the whole HA/DR side of things I’m a single point of failure. Luckily I have a colleague with power who I was able to point the NOC to in order to provide resolution to the issue.

My colleague and I live less than 5 miles apart, we are just luckily on different circuits and substations for the power company. If neither of us had been able to dial in then the problem would not have been resolved.

 

If you truly want to implement HA/DR you need to go further than just looking at your hardware infrastructure or the way that your software is setup. It means more than your DNS configuration and your Active Directory setup. There’s more to it than setting a Failover Partner in your application connection string. It’s about the people that you need to support these things!

 

 

Artificial limitations

I’ve been a DBA for about 12 years now. I’ve not been hands on with a server since doing datacenter implementation work about 5 years ago. Other than in very small shops it’s rare that a DBA touches a server, most don’t even have access to the datacenter in which their database servers reside. There has been a huge move towards setup in Colocation Datacenters (CoLo’s) over the last few years, meaning that more often than not the datacenter where you servers reside are probably not even in the same physical building that you are. There is also the current move to the Cloud initiative that’s gaining some traction. Given all of this information I ask you this:

Why are DBAs so often tied to a physical location for work?

 

This is a serious question. There are so few telecommuting options for DBAs out there. You can look on the job boards and see virtually none. There’s a lot of relocation opportunities if you want to move to another state, but it doesn’t make a lot of sense. If you are going to geographically disperse your hardware due to HA/DR needs why would you not do that same for the teams that have to support it?

So many companies point blank refuse to even entertain the idea of telecommuters from another state (heck the vast majority won’t even contemplate allowing IT support staff to work from home a couple of days a week).

The same companies that happily outsource all development processes to an unknown entity in India or China refuse to entertain the idea of allowing Windows/Unix Admins or DBAs to telecommute.

 

I feel that it is somewhat ironic that the same people that expect you to come to the office and be tied to a physical location also have you be on call so that you can provide remote support when something goes wrong.

 

 

Moving beyond the office

Forcing IT support staff to come to a physical location 5 days a week makes little to no sense to me. I can understand the desire to have folks in the same physical location but as your organization grows it really needs to find a way to get past that mentality. There are various options such as LiveMeeting or GoToMeeting that can provide a telepresence in meetings for people that work remotely. Instant Messaging can be used to constantly keep people in touch. Services such as Yammer can be used to quickly and easily provide collaborative thought outside of meetings.

Looking at my current situation, if my colleague and I had both been without power we would have had no way to provide support. Our servers sit in a CoLo that we don’t have access to so even if we could have gotten there, we would have had no hands on way to work with the servers.

Now imagine we had a third person on the team that lived and worked in another state (or that the second person was living in another state and telecommuted in). Even with power down for me (or the entire state) we would still have someone available to provide support (unless by extremely unusual circumstances they were also down). With a remote DR site they could have effected whatever steps were necessary to bring that online and check that things were working ok.

Let’s look at this from another angle. I’m in the Pacific Time Zone and get to work usually 7:00-7:30am and leave 4:30-5:00pm. If we had someone working in the Eastern Time Zone and they worked the same hours we would have coverage (without emergencies handled by being on call) from 4am PST – 4:30pm PST. That’s an extra 3 hours of standard DBA coverage. Imagine how much more work could get done. How overnight problems could be identified and resolved earlier?

We live in a global society and the company that I work for does business internationally. Would it not make a lot more sense to provide a wider margin of coverage? Think about it, it would also improve the quality of life for anyone that had to be on call. For me the worst time to get called out is about 4am as I’ll never get back to sleep. If there was someone already in the office on the east coast I wouldn’t need to get called out. The same for them with evening plans. They could go out to dinner or a movie knowing that they wouldn’t have to worry about missing a call.

 

 

TL;DR

It’s the 21st century, there are many options to prevent the need for anyone to be physically tied to an office location (certainly amongst IT staff). In order to provide higher levels of support to companies they really need to embrace telecommuting options beyond providing emergency support.

 

 

Late Note: After 80 hours PSE were able to restore power to my neighborhood. Throughout the outage PSE did a fantastic job of handling things. Ultimately over 375,000 were without power at one time or another. PSE provided constant information wherever possible via their Twitter feed and website. They brought in crews from around the country to assist in the restoration efforts. All in all I think they handled things really well. I have no complaints with how they handled any of this. (I just wish I’d been closer to the top of the restore list)

Monday
Jan022012

Stop Bad Database Design

Every year that goes by I sit in hope that I won’t see bad database design.

Every year I am disappointed.

 

As an example here is a table create statement that I saw the other day (table and column names have been changed to protect the innocent)

CREATE TABLE [dbo].BestestTableEVAR(
 Col1 [int] IDENTITY(1,1) NOT NULL,
 Col2 [uniqueidentifier] NULL,
 Col3 [uniqueidentifier] NOT NULL,
 Col4 [smallint] NULL,
 Col5 [smallint] NOT NULL,
 Col6 [bit] NOT NULL,
 Col7 [xml] NULL,
 Col8 [xml] NULL,
 ColA [xml] NULL,
 ColB [xml] NULL,
 ColC [datetime2](2) NULL,
 ColD [datetime2](2) NULL,
 COlE [datetime2](2) NULL,
 ColF [datetime2](2) NULL,
 CONSTRAINT [PK_BestestTableEVAR] PRIMARY KEY CLUSTERED 
(
 Col3 ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

 

So what’s wrong with this?

The clustered primary key on this table is a GUID. Is that bad? That’s an unequivocal YES! Read all that Kimberly Tripp (blog|twitter) has to say about GUIDs in database design.

What makes this all the more crazy is that the table has an identity column. That’s a natural clustering key ready and waiting to be used and yet for some reason it’s not.

This table is going to fragment like crazy, it won’t scale and performance will be hideous. Additionally, thanks to the XML columns this table can’t even be rebuilt online meaning there’s no way to help the fragmentation or performance without actually taking the table offline to do it, meaning it can’t handle any transactions. This is a problem in a table on an OLTP system.

 

I would go back and change some things. Let’s say you wanted to keep the table structure the same, that’s fine, but let’s be smart about the keys and indexes.

It would make sense to change the identity column to be clustered (I would also make this the primary key) and then, to ensure uniqueness on Col2 which is the current primary key a unique index is warranted.

CREATE TABLE [dbo].BestestTableEVAR(
 Col1 [int] IDENTITY(1,1) NOT NULL,
 Col2 [uniqueidentifier] NULL,
 Col3 [uniqueidentifier] NOT NULL,
 Col4 [smallint] NULL,
 Col5 [smallint] NOT NULL,
 Col6 [bit] NOT NULL,
 Col7 [xml] NULL,
 Col8 [xml] NULL,
 ColA [xml] NULL,
 ColB [xml] NULL,
 ColC [datetime2](2) NULL,
 ColD [datetime2](2) NULL,
 COlE [datetime2](2) NULL,
 ColF [datetime2](2) NULL,
 CONSTRAINT [PK_BestestTableEVAR] PRIMARY KEY CLUSTERED 
(
 Col1 ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY];
 
CREATE UNIQUE NONCLUSTERED INDEX UI_BestestTableEVAR_Col2 on BestestTableEVAR (Col2);

Sure, we still won’t be able to rebuild the indexes online, but we won’t have the same crazy levels of fragmentation that we would have had before.

 

I know I’ll be seeing a lot of bad design this year and I know that I’ll be forced to push that bad design into production. Doesn’t stop me trying to change things however. Help your devs out, let them know when their design is a problem. Who knows, maybe you’ll change things for the better.

Thursday
Dec152011

Please Don’t Use Deprecated Data Types

I know that a lot of vendors like to write for the lowest common denominator (i.e. SQL 2000) but really folks it’s gone too far. I’m sick of cracking open vendor code that’s certified for SQL 2008 and seeing things like IMAGE and TEXT data types. Microsoft deprecated these things back when they released SQL 2005 (see http://msdn.microsoft.com/en-US/library/ms143729(v=SQL.90).aspx under Textpointers). Why are you persisting these things six years later?

I bring this up because I’ve come across further egregious usage of these data types in vendor code yet again. The vendor in question? Microsoft.

Yes, that’s right, the folks that deprecated the data type six years ago is still using it to a large extent within the ReportServer and ReportServerTempDB databases that support SQL Server Reporting Services. Seriously Microsoft? Can you please get with the plan and fix this nonsense?

The following query, run against the ReportServer database will show 14 different tables (31 columns) using a variety of NTEXT and IMAGE data types.

select st.name as TableName, t.name as DataType, sc.name as ColumnName 
    from sys.types t 
        inner join sys.columns sc
            on t.system_type_id = sc.system_type_id
        inner join sys.tables st
            on sc.object_id = st.object_id
where 
    t.name in ('image', 'text', 'ntext')
order by 
    st.name, t.name

 

I have filed a Connect item asking Microsoft to fix this. Please go vote for it at https://connect.microsoft.com/SQLServer/feedback/details/714117/ssrs-using-deprecated-data-types-in-its-databases and help us rid the world of this old stuff.

Friday
Nov182011

Your "Five Nines" Means Nothing To Me

There's a lot of talk on High Availability. I know I'm a huge fan of it, in particular clustering (but I know it's not for all situations and the changes that SQL 2012 with AlwaysOn Groups may mean that traditional clustering is used less and less). There are of course other HA solutions out there like Log Shipping, Mirroring and SAN replication technologies (sorry folks, I disagree on transactional replication as an HA concept unless you are talking downstream reporting infrastructure behind load balancers).

People constantly push to achieve the magical five nines availability number. What does that mean?

Five nines means that your SQL Server must be available 99.999% of the time. This means you can have only 5.26 minutes of downtime a year.

So let's say that you are running clustering and mirroring; that you have load balancing with peer-to-peer replication; that you are running SAN replication between two datacenters and have log shipping setup ready to go live at the flick of a switch. You have all your hardware ducks in a row and are ready for any emergency that may crop up.

I commend you for all of that.

Now let's say that your manager is going to bonus you on the downtime that you have. You are the DBA, a single link in the chain, and your entire bonus is going to based on the number of nines that you can get over the year.

Time to sit down with your manager and ask some questions:

 

  • Does this five nines bonus plan include maintenance time?
  • How are you going to measure uptime?
  • What tools are going to be used to measure it?
  • What are the tolerances on those tools?
  • What about upstream items?

 

 Let's take a look at these one at a time.

 

Is maintenance time included?

There may be bugs in the database code for the application that would require an app being offline while changes are made. You may want to patch SQL Server with the latest service pack in order to maintain supportability. You may want to test failover and DR. If none of these things are excluded then chances are you'll have trouble making three nines (8.76 hours of downtime in a year) and none at all of making five nines.

 

How will uptime be measured?

Is uptime calculated based upon SQL Server being up and running? Or maybe on that there are successful connections? How about certain queries getting returned? Or a website returning a page within a certain time period?

So let's say you are running a query. What if that query gets blocked or deadlocked? SQL is up and processing other transactions but that single query is having a problem.

Successful connections? Great, doesn't mean that the database is available that they are going to use. Same issue with checking if SQL Server is available. Heck, are you measuring server uptime or database uptime here?

Website returning a page? What if there's an issue with the web server? Or a networking problem between there and the SQL database? 

 

What tool will you use?

Are you going to buy something off the shelf? Then who's going to spend the time configuring it for your particular environment? Isn't that time better spent on solidifying  things,  performance analysis? Query tuning?

You're going to make your own tool? Great, how are you going to do that, what technologies? How is the information going to be reported upwards? Who has the last say on possible exceptions? You have the time to write this yourself? Awesome, I'm glad that you do.

 

What is your tool tolerance?

Tolerance? Yup. Let's say that you've figured out all the other stuff and have agreements in place for everything and got a tool that will do the job. How does that tool work?

Let's say it runs a query and fails. It says that the database is down. When will it next attempt to query the database? In 15 minutes? If that's the case then any bad poll from your tool just put you outside of that five nines criteria and your bonus has been dinged.

Let's just say that there was an actual problem and that happened 30 seconds after the last poll but you resolved it before the next one. Everything was down for 14 minutes but your tool didn't capture that.

Some serious flaws there.

 

Upstream items

Are you the Windows, network, firewall, and datacenter admin? Are you responsible for generating the electricity that comes into the building? How about the HVAC system, is that yours? Are you checking the diesel levels in your backup generators weekly?

Each one of these upstream items can and will affect peoples ability to access your SQL Server and its databases. Unless you are responsible for them all then how can you be held accountable for perceived uptime on SQL Server?

SQL can be up, able to process data and all the databases available, but without a funcitoning network nobody would know that and your monitoring wouldn't pick it up.

 

 

All of these things add up to not being able to accurately measure the mythical five nines uptime creature. If indeed we can't get those kinds of accurate measurements and I can't own those upstream processes then why should I be held accountable for a number.

Here's a better idea. Let's look at the processes, let's look at whether or not things are available from a users perspective. Let's gauge customer interactions, build outs, deployments. Let's track tickets and issues. Let's talk about the things we need to do to make things better. Let's not go pulling a number out of the ether, it does nobody any favors.