BCP and Code Page Fun

Quick, what’s the fastest and easiest way to get data out of a SQL Server table?

Time’s up. It’s BCP.

If you aren’t familiar with BCP it is a command line utility to bulk copy data out of, and in to tables. It has been around for donkeys years (technical term), and while it doesn’t have the frills of SSIS it is fast, and lightweight. DBAs have been using it for years, but it may not be something that you’ve heard of.

Recently, while BCPing some data between servers I ran into an interesting problem which caused the data imported to be different than that exported. How does that happen? Code pages…

What is a code page?

A code page is a method of encoding a character set. There are many different code pages, but you are probably most familiar (if you work with SQL Server in the US) with code page 1252. This is the code page that SQL uses with the default install and collation settings, you may know it as SQL_Latin1_General_CP1_CI_AS.

Installation collation breakdown

SQL_Latin1_General – The language to which the sorting rules are applied
CP1 – Code page (CP1 = 1252)
CI – Case sensitivity, in this case Case Insensitive
AS – Accent Sensitivity, in this case Accent Sensitive

TechNet and the SQL Server documentation have a list of all the potential installation collations.

Using BCP

There are many arguments that can be used to adjust the way that BCP functions, certainly more than I could scope in this post. However there are a couple of very basic commands that you can use to get data out and in…

bcp <Database.Schema.Table> out <FileName> -n -T -S localhost

Let’s break this down

bcp – the executable
Database.Schema.Table – the object containing the data you want to export
out – the direction in which the data is going
filename – the file name (including path) for the data
-n – does the bulk copy using the native database data types
-T – use integrated authentication
-S – the SQL instance to connect to, in this case localhost

This will export the data to a native file (that won’t be easily readable in a text editor). It will be very efficient at doing what it does, and should give us the right result.

Let’s test this.

Export and Import Some Data

First we need to setup a couple of tables, and load a few rows of data so that we have something to work with.


CREATE TABLE CodePageTests
(
RowID INT NOT NULL,
TestString VARCHAR(100) NOT NULL
)
GO

CREATE TABLE ImportCodePageTests
(
RowID INT NOT NULL,
TestString VARCHAR(100) NOT NULL
)
GO
INSERT INTO CodePageTests (RowID, TestString)
VALUES (1, 'Copyright ©') ,
(2, 'GBP £') ,
(3, 'Registered Trademark ®') ,
(4, 'Eszett (S Sharp) ß')
GO

Now we’ll bcp the data out of the CodePageTests table in to a native file.

And bcp it back in to the import table

Now if we query the data we’ll see something isn’t quite right…

 SELECT c.RowID, c.TestString, i.TestString AS ImportedString
FROM CodePageTests c
 JOIN ImportCodePageTests i ON c.RowID = i.RowID

The copyright and trademark symbols have been changed to regular characters, even though the British Pound and Eszett symbols haven’t.

Why is this happening?

The reason this issue crops up is because of the code page that is used during the export (and the import) process.

When using the -n argument bcp will use the native data types. In this instance the © and ® characters will be read as a varchar type. You would, naturally, expect these to be exported and imported. The trouble is that bcp, instead of using the code page of the database, uses (unless otherwise specified) the code page currently set in the software that makes the call.

Here we are instantiating bcp from a command line window in a standard US configuration of Windows. As such the code page defaults to 437. You can check this by executing chcp at the command line:

If you do a comparison of the database code page of 1252, and the Windows one of 437 then you’ll see that a completely different set of characters is supported, and this is where things go off the rails.

How can this be fixed?

There are a few different options you can use to get the data out and in accurately.

Change the code page in the command window before calling bcp

By entering the code page you want to use after the chcp command you can change the default code page used in the command line window (this works just for this window and session, it does not make a permanent change). Then you can call bcp, and the data will be handled correctly.

Change the bcp command to include a code page reference

You can include the -C argument along with the number of the code page you want to use, or, if you know it is a default database install you can use ACP which is the reference to Windows code page 1252

Force bcp to use Unicode storage for certain data

By passing along the -N argument instead of -n you can have bcp examine the data and use a unicode format for character data. Using two-bytes of storage ensures that the data is accurately read and written (although it can cause problems with non-variable length character columns).

Use a format file

A format file has a large advantage as it specifies the data type and collation for each of the columns that you would be exporting or importing. You can generate xml files manually for this, or let bcp do the heavy lifting for you.

You can then use the format file in the bcp process.

Getting the data back in

To bcp the data back in again use the same method that you did for exporting it

Change the code page for the Window
Specify a code page in the bcp command
Use unicode storage for character data
Use a format file

Then you should be able to query the data and get the right results

SELECT c.RowID, c.TestString, i.TestString AS ImportedString
FROM CodePageTests c
JOIN ImportCodePageTests i ON c.RowID = i.RowID

TL;DR

While bcp is a great little executable for getting data in and out of SQL Server you have to be careful with how you use it in order to ensure that the data you are exporting and importing is accurate.

For lots more detail read all about the bcp utility on MSDN.

SirSQL.net

BCP and Code Page Fun

What is a code page?

Installation collation breakdown

Using BCP

Export and Import Some Data

Why is this happening?

How can this be fixed?

Change the code page in the command window before calling bcp

Change the bcp command to include a code page reference

Force bcp to use Unicode storage for certain data

Use a format file

Getting the data back in

TL;DR

One thought on “BCP and Code Page Fun”

Leave a comment Cancel reply

What is a code page?

Installation collation breakdown

Using BCP

Export and Import Some Data

Why is this happening?

How can this be fixed?

Change the code page in the command window before calling bcp

Change the bcp command to include a code page reference

Force bcp to use Unicode storage for certain data

Use a format file

Getting the data back in

TL;DR

Share this:

Related

One thought on “BCP and Code Page Fun”

Leave a comment Cancel reply