Reading Recommendations

I’ve recently been asking for some book recommendations to help both myself and my team improve our knowledge on the variety of topics in which we’re responsible for having expertise. Personally, I learn best by actually doing and even better when I’m doing whatever it is for a real, productive reason, but those situations don’t happen every day. So, I still think its good to stock the shelf of the mind with ideas and information that you can pull out when you need them. You might not master the topic from reading a book, but you’ll still be better prepared.

The topics I’m looking for more knowledge on include internals and performance troubleshooting for both SQL Server and Oracle, identifying and troubleshooting network related issues (including following and reading packet traces) and disk technologies like RAID and SANs (understanding their configuration, performance expectations, etc.).

These are the recommendations I’ve received so far.

SQL Server

Oracle

Network

  • TCP/IP Bible by Rod Scrimger, Paul LaSalle, Mridula Parihar, and Meeta Gupta

Disk Technologies

  • nothing …

Since I’ve been asking for recommendations, I thought I’d also pass one along. Troubleshooting, in general, is something that I believe I’m fairly good at. I never really had any formal training at dissecting and resolving technical issues, it just came as second nature to me and seemed like common sense.

At work, we ask all employees in technical support (and I believe in our quality assurance department as well), to read Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems. While the concepts and steps in the book weren’t groundbreaking to me, they did put a name to the steps I was already doing (like cutting the problem in half), and the “9 rules” helped me to be more organized in attacking a problem. For people who want to improve their troubleshooting skills, I highly recommend this quick and easy read.

What books do you recommend for learning more about troubleshooting network or disk technologies? If you have any suggestions, I’d love to hear them.

Sometimes It Helps To Have A Picture: RAID

I’ve never been a hardware person. Give me a software, database or network related issue to solve and I’m all yours. Start talking about bus speed, chip sets, overclocking or RAID and you’ve lost me. I’m not really sure why this is, I guess I’ve just never had a real need to understand hardware. But, times are changing and it’s time that I get at least a basic understanding.

Earlier this week, I was involved with troubleshooting a database performance issue with for a customer. The problem was clearly not with our application’s use of the database, it was much worse than that. The customer had not run any sort of maintenance on their database server in over 6 months – that’s right, no backups, no stats updates or rebuilds, no CheckDB. When they realized their errors and tried to do all of this maintenance, their server came to a screeching halt, froze up and required a hard reboot.

A few attempts had been made to resolve the issue before I was involved. After seeing that the database server had only 2GB of memory and all of that was being allowed to be consumed by SQL Server (2005, 32-bit), they had added another 2GB and then set up AWE to allow SQL Server to use 3.5 of the total. The customer had also moved their database transaction log to the D: drive from the C: drive after learning that we (and Microsoft) recommend that the log has its own dedicated disk. Still, the issues were still occuring.

The first thing I checked when I got connected was Performance Monitor – how was the server responding with just normal use. Were we looking at a memory issue, an I/O issue or a CPU issue. (Even though I don’t get the internals of these items, I can still read a PerfMon log.) I also checked some key tables in the database to get an idea of the size of the solution. It was a relatively small database – 30 GB but that was pre-sized. I’d estimate only 3-4 GB were actually being used. But back to PerfMon – Memory use was significantly low, the CPU was barely doing anything, but the avg reads/per sec. and avg writes per sec. for all three drives were off the charts. So, it’s a disk issue.

I nosed around in Windows Explorer and Device Manager trying to figure out more about the disk setup. (The customer wasn’t available at the moment.) Best I could tell, there were 3 OS drives (C, D and E).

  • The C drive had the page file, TempDB, and the OS. It was also housing the transaction log backups – the only maintenance job they’d been able to get to successfully run.
  • The D drive had the database transaction log and nothing else.
  • The E drive had the database MDF file and nothing else.

That doesn’t seem so bad, right? Wrong.

After looking at Device Manager a little more, I found that C and D appeared to be partitions of the same physical drive (D-0) while E appeared to be its own physical drive (D-1).

So, one physical drive had to deal with reads and writes for: the page file, TempDB, the transaction log and the backups of that log. Ouch. If I were D-0 I’d be looking for a vacation.

Eventually, we got a hold of the customer and I explained my concern about the disk setup and asked for clarification on the setup. Were they really 2 physical drives? Was RAID involved? Were they SAN drives? (I asked these questions not really knowing what the last 2 meant other than that if he said yes, I’d be looking for some help.)

There were 4 physical disks in the server. They were configured as RAID 1+0 and there was no room to add more disks.

After a brief look on Wikipedia, a call to a co-worker and a dry erase board session with my ‘brain trust’, I realized that this setup was even worse performance wise than when I’d thought there were 2 physical disks.

RAID 1+0 (aka RAID 10) uses a minimum of 4 physical disks. The first 2 drives are mirrors of each other (Mirror Set 1), and the second drives are also mirrors of each other (Mirror Set 2). On top of that, the disks are striped – a block of data is written to one Mirror Set and the next block of data is written to the other, then back to the first set. For fault tolerance and DR – this is pretty nice. But when every write you’re making in the database is getting written back and forth to all 4 disks and every read you’re making is also coming from at least one of those same disks, you have a problem.

An aside: With my new role at work (I’m now responsible for leading the group that provides database and performance support), I not only have a need to improve my hardware knowledge, I’ve also noticed it’s been more difficult for me to leave work at the office. In the interest of my health and sanity, I’m trying to take up some hobbies that can clear my head – but it’s freezing outside right now so a nice hike is out.

Today, I broke out my easel and did some watercolors. I present, RAID10 (as configured in this post):


So this was great. I finally understood some portion of RAID and it made complete sense to me. The next problem was getting the admin to understand. We went back and forth via email a few times – him referencing an image from some website, me referencing a different image from another website. He kept insisting that the “Logical Drives” (aka Mirror Sets) were what showed up in the OS as D-0 and D-1. I kept trying to explain the striping to him to now avail. Finally, I decided to draw my own picture when some transparent overlays in Paint.NET. (Similar to the watercolor above.)

And that did the trick. “Oh, yeah. That would cause a problem.” Sometimes having a picture really helps. So now they are getting some drives set up on the SAN and will be relocating both the transaction log and the database. I guess this means my next learning experience will probably be about “spindles and speed”.

Exam Eve: Pre-Test Thoughts on the 70-433

It’s the eve of my 70-433 exam and I’m not feeling terribly confident about my chances of passing the test tomorrow.

I’m sure a portion of my self-doubt is just normal pre-test jitters. It’s been nearly four years since I took a Microsoft exam and I recall being convinced I had failed the last one (70-290 – Windows Server 2003 Administration) until the nice little screen popped up and said that not only had I passed, but I’d only missed two questions!

The other portion of my anxiety is because I really haven’t focused on studying as much as I had hoped – despite setting a clear goal and mapping out time to accomplish it. I’m really good at procrastinating and in the midst of these few weeks of studying I also made a quilt, painted my home study and started playing a few new video games – I’m sure these were in some ways just to avoid studying.

Here are my pre-test thoughts on the test objectives:

Implementing Tables and Views (14 percent)
I feel pretty confident about this objective. I’m strong in creating and altering tables and views and indexing. I’m not quite as confident about spatial data types, partitioning and I’m worried there will be nit picky questions where they put some command in a statement that isn’t compatible because the table doesn’t have a PK and I’ll miss something little like that.

Implementing Programming Objects (16 percent)
Overall, not as confident on this one. I’m good with stored procedures, transactions and error handling. Not quite as confident about CLR and UDFs.

Working with Query Fundamentals (21 percent)
I’m probably the most confident about this section – this is what I do all the time and the new features I learned about (MERGE, OUTPUT, GROUPING SETS, PATINDEX, etc.) are things I could find practical applications for and I really got into trying them out. Unless there’s some nit picky trick questions, I think this section will go well.

Applying Additional Query Techniques (15 percent)
I have medium confidence here. I’m not too worried about subqueries, controling execution plans or the collation stuff but although I’ve done a lot of practicing, CTEs (especially recursive) and the ranking functions might stump me.

Working with Additional SQL Server Components (11 percent)
I could set up Database Mail and get it to work but I didn’t memorize all the ins and outs. Full Text search seems pretty straight-forward but I didn’t spend a lot of time on it and didn’t actually try it out at all. Not feeling great about the Service Broker or the change tracking either. This section probably won’t go well.

Working with XML Data (12 percent)
Ugh. I dread this section. I actually even had a real value use for some of the features of XML with SQL Server but I couldn’t get them to do what I wanted and found the syntax – especially OPENXML and XQUERY to be painful and think it would be easier to write C# code to convert XML data to relational data. I am comfortable with XML in general, the FOR XML portions, basic XPATH and schemas. I doubt that will be enough knowledge though.

Gathering Performance Information (11 percent)
This section is also something I’m pretty familiar with. Although I don’t use them a lot in my current role, I spent several years needing to regularly use Profiler or read execution plans to diagnose issues. I’m not as experienced with the Database Tuning Advisor (but it’s pretty much just a wizard from what I can tell) and I can figure out which DMV or catalog to use when I need it but I certainly don’t have them memorized. Overall, I’m fairly confident with this section though.

The exam is at 10am tomorrow. I’m a fast test taker – either I know it or I don’t – but I’m going to try to go slow and really use all of the time I can to go over things in detail. I should have a verdict by 1pm.

MERGE and OUTPUT in SQL Server

I spent a decent chunk of this afternoon studying for the upcoming 70-433 exam (rescheduled for December 20th). Today I focused primarily on MERGE and OUTPUT clauses.

MERGE is pretty interesting – compare the contents of two tables and update one based on the content in another. MERGE is new to SQL Server in 2008. OUTPUT is also interesting and I wish I’d known about it previously as it would’ve been very handy. OUTPUT let’s you see the changes you made and (optionally) store them! OUTPUT has been around in SQL Server since 2005.

I went through a tutorial on MERGE and then read up on both MERGE and OUTPUT on Books Online. Then, I closed by browser and tested myself to see what I’d learned.

Here’s what I came up with.

/* I'm opening a store that sells only things I like */

CREATE SCHEMA store
GO
CREATE TABLE store.products
( product_id INT
, product_name VARCHAR(150)
, product_desc VARCHAR(400)
, product_price MONEY)
GO
INSERT INTO store.products VALUES
(1, 'Bacon', 'Tasty Bacon', 5.49)
, (2, 'Cheese', 'Sharp Cheddar', 4.79)
, (3, 'Beer', 'Sierra Nevada Pale Ale', 8.49)
GO
SELECT * FROM store.products
GO

/* We're making some changes to our products and have another table with updates to our products list. */

CREATE TABLE store.product_updates
( product_id INT
, product_name VARCHAR(150)
, product_desc VARCHAR(400)
, product_price MONEY)
GO
INSERT INTO store.product_updates VALUES
(1, 'Bacon (Original)', 'Tasty Bacon, original recipe', 5.49)
, (3, 'Beer - IPA', 'Sierra Nevada Pale Ale', 8.49)
, (4, 'Bacon - Peppered', 'Tasty Pepper Bacon (new!)', 5.79)
, (5, 'Beer - Cheap', 'Yuengling Lager', 4.99)
GO
SELECT * FROM store.product_updates;
SELECT * FROM store.products;
GO

/* Use a Merge Statement to update our products based on the information in the product_updates table
NOTE: Not all products are listed in our source table, only products that have been changed */

MERGE store.products AS TARGET
USING store.product_updates AS SOURCE
ON (TARGET.product_id = SOURCE.product_id)
WHEN MATCHED AND
/* Update Product IDs that are matched*/
TARGET.product_name <> SOURCE.product_name
OR TARGET.product_desc <> SOURCE.product_desc
OR TARGET.product_price <> SOURCE.product_price THEN
UPDATE SET TARGET.product_name = SOURCE.product_name
, TARGET.product_desc = SOURCE.product_desc
, TARGET.product_price = SOURCE.product_price
/* WHEN NOT MATCHED BY SOURCE THEN
The product has no update, don't do anything
If we wanted to, we could delete rows that weren't found in the updates table */
WHEN NOT MATCHED BY TARGET THEN
/* This must be a new product, insert it! */
INSERT (product_id, product_name, product_desc, product_price)
VALUES (SOURCE.product_id, SOURCE.product_name, SOURCE.product_desc, SOURCE.product_price);
GO
SELECT * FROM store.products;
GO

Pretty sweet! Our products table is now updated with the appropriate changes!

Keeping a history of your changes could be very useful. In this case, we could use them to track all sorts of things – like how sales compared when prices changed, etc. In my daily life, it would be nice to have the changes recorded as a backup in case I needed to undo some particular change and didn’t realize it for a few hours or days. OUTPUT comes in handy here. It’s not limited to being used with MERGE either – it can be used with INSERT, UPDATE or DELETE. It can just display the changes or it can write them out to a table.

/* What if we wanted to keep a record of changes? Use an OUTPUT clause with the MERGE statement! */

-- undo changes from previous merge
DELETE FROM store.products;
GO
INSERT INTO store.products VALUES
(1, 'Bacon', 'Tasty Bacon', 5.49)
, (2, 'Cheese', 'Sharp Cheddar', 4.79)
, (3, 'Beer', 'Sierra Nevada Pale Ale', 8.49)
GO

/* Create a table to store the historical changes */
CREATE TABLE store.product_history
(change_date DATETIME2 --when change occurred
, change_type VARCHAR(100) --insert, update or delete
, old_product_id INT --beginning product_id
, old_product_name VARCHAR(150) --beginning name
, old_product_desc VARCHAR(400) --beginning description
, old_product_price MONEY --beginning price
, new_product_id INT --new product_id
, new_product_name VARCHAR(150) --new name
, new_product_desc VARCHAR(400) --new description
, new_product_price MONEY) --new price
GO

/* Same MERGE statement with an OUTPUT clause */

MERGE store.products AS TARGET
USING store.product_updates AS SOURCE
ON (TARGET.product_id = SOURCE.product_id)
WHEN MATCHED AND
/* Update Product IDs that are matched*/
TARGET.product_name <> SOURCE.product_name
OR TARGET.product_desc <> SOURCE.product_desc
OR TARGET.product_price <> SOURCE.product_price THEN
UPDATE SET TARGET.product_name = SOURCE.product_name
, TARGET.product_desc = SOURCE.product_desc
, TARGET.product_price = SOURCE.product_price
/* WHEN NOT MATCHED BY SOURCE THEN
The product has no update, don't do anything
If we wanted to, we could delete rows that weren't found in the updates table */
WHEN NOT MATCHED BY TARGET THEN
/* This must be a new product, insert it! */
INSERT (product_id, product_name, product_desc, product_price)
VALUES (SOURCE.product_id, SOURCE.product_name, SOURCE.product_desc, SOURCE.product_price)
OUTPUT
GETDATE() --date of change
, $action --type of change
, DELETED.* --old stuff
, INSERTED.* --new stuff
INTO store.product_history;
GO
SELECT * FROM store.product_history;

On a side note, I found that Oracle has a command similar to MERGE. Since it’s a combination of an UPDATE and a DELETE, they called it UPSERT! That sounds way more fun than MERGE.

The OS really does matter.

I admit that I’m not terribly proficient in Operating Systems other than those made by Microsoft. Back in 1999 or 2000, I attempted to install Linux on a spare machine in the interest of trying out some cool GUI experience that could be skinned but after about 2 weeks of trying, I still couldn’t get the network card to work and gave up. Other than that, my non-Windows experience pretty much consists of navigating my iPod and iPad. (And sometimes those don’t even make sense to me.)

To further prove my point, the following is an actual conversation that occurs regularly within our household:

My Partner:
Hey, do you know how to do X?
[speaking of a computer related task on her MacBook Pro]

Me:
Sure! You just right-click on …. um, oh yeah.

Today’s lesson is something that, in hind-sight, should have been glaringly obvious. However, I often take for granted that we all live in a Windows/Microsoft world.

I was working on a side project at work to convert a SQL script originally written for Microsoft SQL Server to Oracle. I took on the challenge with joy as I have a slight affinity for SQL*Plus after years of fighting with it. The original script was a stored procedure called from SQLCMD that output a text file in a specific format based on certain data in the database. The format of the text file had some specific requirements based on the process that would ultimately ingest it . Essentially, based on a data in a the database, I was to select certain data and format it in a specific way similar to the following:

SQL Server syntax:

SELECT
'Label1: ' + column1 + CHAR(13) + CHAR(10) +
'Label2: ' + column2 + CHAR(13) + CHAR(10)
FROM mytable;

The output in a text file looks something like this:

Label1: First Entry
Label2: First Code
Label1: Second Entry
Label2: Second Code
Label1: Third Entry

Migrating this to Oracle syntax is super easy.

SELECT
'Label1: ' || column1 || CHR(13) || CHR(10) ||
'Label2: ' || column2 || CHR(13) || CHR(10)
FROM mytable;

Plusses are replaced with double pipes; ASCII calls of CHAR() are replaced with CHR(). Simple.

So, I re-write the stored procedure (it was a bit more complex than my examples above) and create a .bat file to execute it, based on appropriate input, using SQL*Plus and a ton of parameters to make it real, plain text output.

SQL*Plus formatting parameters such as these limited my results to plain old text without headers, extra blank spaces after lines, etc.:

SET NEWPAGE 0;
SET SPACE 0;
SET LINESIZE 32767;
SET PAGESIZE 0;
SET ECHO OFF;
SET FEEDBACK OFF;
SET HEADING OFF;
SET MARKUP HTML OFF SPOOL OFF;
SET VERIFY OFF;
SET TRIMSPOOL ON;

If you haven’t guessed already, those CHAR(13) and CHAR(10) … (Or CHR(13) and CHR(10)) calls are important. They are the ASCII characters for, in order Carriage Return (13) and Line Feed (10).

So, I modify the Stored Proc to the appropriate syntax, make a ton of SQL*Plus formatting calls to get plain ol’ text out and test out the script against the database copy I have in the office. Golden. Off to the customer the script goes.

[Here's where I save you the back and forth between my contact, their input guys, our project manager, etc. which went on for quite some time.]

A few hours later they report that the export file hoses (aka breaks) the input processor. After a bit of debugging on their end, it appears that the export file looks like this in a text editor with symbols on:

Label1: First Entry[CR][CR][LF]
Label2: First Code[CR][CR][LF]
Label1: Second Entry[CR][CR][LF]
Label2: Second Code[CR][CR][LF]
Label1: Third Entry[CR][CR][LF]

The input isn’t designed to handle multiple [CR]s – it wants a [CR][LF] combination at the end of each line – and nothing else.

So, I re-run the SP in my test environment and view the output in a text editor with symbols on:

Label1: First Entry[CR][LF]
Label2: First Code[CR][LF]
Label1: Second Entry[CR][LF]
Label2: Second Code[CR][LF]
Label1: Third Entry[CR][LF]

Interesting, right? My output is exactly as expected. Theirs is different. WHY?

After analyzing the differences in environments the only thing I can come up with is that my Oracle instance is Windows based and theirs is Unix based. SQL*Plus is still running on Windows in both environments but … even if you’re selecting complete gibberish from a table (‘beit no table from SQLCMD or DUAL from SQL*Plus), you’d think the results would be the same. It’s only a Windows command-line after all …

Not so. And that’s my lesson learned today. Regardless of the tool used to access the database (afaik), the output is still dependent on the OS of the database host environment.

In the Windows world, the character [LF] (aka CHR(10) or CHAR(10)) is a Line Feed. The character [CR] (aka CHR(13) or CHAR(13)) is a Carriage Return. In the Unix world (as best I could determine from a few Google searches), the character [LF] (aka CHR(10) or CHAR(10), translates to a line feed and a carriage return. A [CR] (aka CHR(13) or CHAR(13)), is also a carriage return.

Although SQL*Plus was being run from Windows, the database backed was Unix and thus, so was the output. And so:

SELECT
'Label1: ' || column1 || CHR(13) || CHR(10) ||
'Label2: ' || column2 || CHR(13) || CHR(10)
FROM mytable;

Translates to the following on a Unix based database:

Label1: First Entry[CR][CR][LF]
Label2: First Code[CR][CR][LF]
Label1: Second Entry[CR][CR][LF]

I modified the original query as follows:

SELECT
'Label1: ' || column1 || CHR(10) ||
'Label2: ' || column2 || CHR(10)
FROM mytable;

And the results come out like this (against a Unix based database):

Label1: First Entry[CR][LF]
Label2: First Code[CR][LF]
Label1: Second Entry[CR][LF]

So yes, in hindsight, this should’ve been obvious – even if it’s gibberish SQL, it’s still being executed against the actual database. And, if that database uses a different OS, the results may vary.

Lesson learned.

Migration Options: DB2 to SQL Server

I did absolutely nothing for my 70-433 prep this week because I was distracted by a new project at work that really has me excited. So excited, I’ve been thinking about, working on it and even dreaming about it all week, minus a few potty breaks. Although I’m failing on my studying, I learned a ton.

The Mission:

Migrate a database from DB2 (UDB) to SQL Server (edition unknown).

Factors:

  • Portions of the schema are predefined.
  • Portions of the schema are dynamic. Many tables and their columns will vary based on configuration.
  • The database is large. (Exact size unknown, ~several hundred GBs.)
  • Time will be short. (Production DB, needs to be back online ASAP.)
  • There isn’t a DB2 option in SSMA. (Microsoft does offer a White Paper though.)

Options:

  1. Plain Old Export -Export to the data to text files and use bulk insert to re-ingest the data.
  2. Double Migration – Migrate from DB2 to Oracle using SQL Developer then, migrate from Oracle to SQL Server using SSMA.
  3. SSIS – Use SQL Server Integration Server to migrate the data.
  4. Something Else – What I ended up going with. We’ll get to that later.

What I Learned:

Since this blog is about what I learned, let’s explore the pros and cons of each of the above mentioned options. Some are obvious (like option #3) and some took a lot of digging, trial and error to realize.

Option 1: Plain Ol’ Export.

Because some of the tables are dynamically created and named, and their column names and data types are also dynamic, my migration strategy needs to deal with not only the data, but the schema as well.

A DB2 command line utility called db2look was a great solution for this.The options are fairly flexible and create the requisite DDL in a text file ready for execution in another system. Well, sort of ready.

The db2lookup reference offers several options to create DDL for specific schemas, specific creating users, generating views, granting permissions and more. I went with a fairly simple approach – just create the tables for my specified schema.

db2look -d mydatabase -z myschema -i username -w password
     -e -o myddlfile.sql

The resulting file contained the DDL for all tables in my schema, including index creation and constraints. Of course, the DDL used DB2′s data types so I used a text editor to find and replace their data types with the ones I needed in SQL Server. I also used a text editor to comment out the index creation and constraint sections since I don’t want to create those until after the data migration for performance reasons. This wasn’t the most elegant solution but it wasn’t terribly time consuming and it got the just done enough to move on to the data migration portion.

DB2 includes several command line utilities useful in exporting data to file. There are features similar to the SQL*Plus spool command but primarily, I worked with the aptly name export command to dump the contents of my query to file.

Along with the OF DEL parameter, DB2 EXPORT can be used to dump the results from any query to an ASCII text file. The chardel parameter specifies the character string identifier if you want to use something other than the default (double-quotes). Or, you can use nochardel to eliminate character string identifiers. Non-default column delimiters are defined with the coldel parameter.  (The default is the tab character.) And, the DB2 time-stamp format can be modified using the timestampformat parameter. (A full listing of the export modifiers can be found here.)

After working through several kinks with the Export options, I came up with a command that looked something like this:

db2 export to mytablename.txt of del modified by chardel''
     timestampformat="yyyy-mm-dd hh:mm:ss.uuu"
     select * from mytablename

This produced export files that looked something like this:

123    'my string'    '2011-11-01 09:23:12.000'    456    ''    789

Note the two single quotes in the 5th “column”. Those would turn out to be the bane of my existence with this option. You see, that’s not a null and it’s not a space. It’s an empty string.

When I first attempted to use BULK INSERT to ingest this data, it choked on the fact that there were single quotes around the dates (which are datetime, the original type). If I chose to export without the single quotes (i.e. using the nochardel parameter), the dates imported just fine into SQL Server, but that 5th column was interpreted as a null instead of an empty string.

I went back and forth with the export format and ingestion trying to get the dates to be dates, the strings to be strings and the empty strings to NOT be nulls but I just couldn’t get it right 100% of the time.

Finally, I turned to format files in an effort to explain the delimiters and file contents to BULK INSERT.

I ended up using the chardel” option in DB2 export and a format file that looked something like this

9.0
6
1       SQLCHAR       0       4       ",'"     1     entryID      ""
2       SQLCHAR       0       255     "','"    2     entryName    SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR       0       24      "','"    3     entryDate    ""
4       SQLCHAR       0       4       "',"     4     entryValue   ""
5       SQLCHAR       0       100     ","      5     entryString  SQL_Latin1_General_CP1_CI_AS
6       SQLCHAR       0       4       "\r\n"   6     entryAltInt  ""

This method again got me around the single quotes with the non-empty strings and the dates but even if I included an extra quote in column 4 and 5′s delimiters, I’d end up with at least one single quote in the imported data. Better than a null, but still not correct. (And, even if using a format file worked, I’d have to create one for every table. Even the ones I don’t know about already.)

Since I wasn’t the first to deal with the dreaded ‘empty string vs. NULL‘ issue unsuccessfully, I decided to put this method on hold for a while and see what other options were out there.

Option 2: Double Migration

While the SQL Server Migration Assistant (SSMA) doesn’t offer a DB2 option, Oracle’s SQL Developer does. This strategy was actually suggested by a co-worker as it would probably be the most accurate in dealing with data type conversion, empty-strings vs. nulls and a host of other issues. While it does sound like a good choice for accuracy, it sounds absolutely terrible for efficiency.

I’ve never used SQL Developer to migrate a database but I have used SSMA to migrate a database from Oracle to SQL Server. Since it was a personal test system and I didn’t really care how long it took, I didn’t bother to disable indexes or do anything else that might make performance better. Still, the GUI portion for data type mapping and the schema creation took at least an hour and the data migration took at least 4-5 hours for a 1.7 GB database.

Given those numbers, the potential database size and the fact that not one but two migrations would need to occur, I figured I was looking at about 60 days for a double migration (1.7 GB in 4 hours == 30 GB in 704 hours; times two), minus whatever time I could squeak out by disabling indexes. (Of course I’ll have to rebuild them after anyway but I’ll have to do that in any scenario.) Because of this, I chose to save this option as a last resort and didn’t bother testing it.

Option 3: SSIS

Prior to this project, I had never used SQL Server Integration Services (SSIS) but I’d read plenty about the power of this tool. I gave myself a crash course on the UI, touched base with our resident SSIS expert for some pointers and got to work.

Since my schema was already created from testing Option 1 and since that method seemed fairly quick for that task, I chose to just attempt the data migration with SSIS.

First, I created a new SSIS project in Business Intelligence Development Studio (BIDS) 2008. I setup two Data Connections – one to my SQL Server 2008 “destination” and one from my DB2 UDB “source”. For the DB2 connection, I installed Microsoft’s OLE DB Provider for DB2 as it would be much faster than using ODBC. (I already had the IBM OLE DB Provider for DB2 but had read a few notes online indicating that the Microsoft one was faster.)

After this initial setup, I created a ‘Data Flow’ task in the Control Flow and configured the Data Flow to migrate data from the source database (DB2) to the destination database (SQL Server 2008 R2). In my first go, I selected a specific table for migration using the ‘Table or view’ option of the Data Access Mode. (Not unlike what you see here, although this is for Azure and also uses some ODBC.)

I did no data type mapping, just let the migration occur based on the data types of my self-created schema. I executed the package and found it was quite speedy against the largest table in my test database. The data looked correct too – DB2 timestamps were SQL Server datetimes, empty-strings were actually empty-strings, nulls were actually nulls. Very nice.

Next, I attempted to use SSIS to perform the migration for the entire schema. I’d noticed a ‘Table name or view name from variable” option. I decided to use this option, along with a For Each Loop to call my Data Flow task and loop through a list of tables.

I set a table name variable within the Project, I created a For Each Loop containing the list of tables and setting the variable. Within the For Each Loop was my Data Flow task that took data from the source (DB2) using the variable set table name and migrated it to the destination (SQL Server 2008 R2) into the variable set table name. This worked great for the first table and then promptly came to a screeching halt yelling at me about not being able to find a column in my second table.

As I investigated further, noticing that the “missing” column didn’t even belong in Table 2 (rather, it was the first column of Table 1), I found that the column names and data types are actually strongly typed within the SSIS package. Which means, they can’t be overridden at run time. So, while you can use SSIS to loop through a list of tables, each of those tables needs to have the same column structure and order. Rats. Foiled again.

So, with unless I want to create a separate package for each table structure – including those tables and columns that I won’t know about ahead of time, it looks like SSIS might not be the right choice either.

(NOTE: While trying to find my reference links from the SSIS portion, I came across this post from Strate SQL which re-iterates my problem with the Data Flow task in SSIS. It looks like the article might have an alternative solution that I’ll have to look into.)

Option 4: Something Else Entirely

After having noted the speed with which SSIS was capable of migrating data between the two platforms, I was curious – why wasn’t there some way I could tap into this method? Or was it really the same as using BULK INSERT or bcp.

What I found is that SSIS actually uses INSERT BULK, “which is a mystery unto itself” according to this article. Thankfully, someone else had solved this “mystery” and reported that INSERT BULK is “produced by client bulk copy API. You cannot use the INSERT BULK syntax directly from client or server-side code. Instead you have to use one of the Bulk Insert APIs like .NET SqlBulkCopy …“.

Although I’ve wanted to do some “real” programming for a while, I’d never found the right project to really get me into it. It turns out, this project was just the motivation that I needed.

So, I fired up Visual C# 2010 and got to work writing a console application that would utilize the SqlClient BulkCopy class to import data into SQL Server. While my code is rather simple, since it was written for work, I can’t share it directly. However, the sample C# code on the MSDN page for BulkCopy was my starting point. I combined that with an OleDbConnection to the DB2 database and some text reading components to loop through a list of tables to perform a migration of the entire schema.

Using this method, the entire 1.6 GB database transferred in less than 9 minutes. Following the same equation used for Option 2, this means my full migration will take approximately 28 hours (1.6 GB in 9 minutes == 300 GB in 28 hours). That’s pretty fast!

My next steps involve turning my console application into a Windows Form application, making it more configurable (input and verify connection strings, specify table list file, etc.) and possibly using this same method to generate and execute the schema creation DDL as well as the indexes and constraints after the migration is complete. I’ll also be bulking up the test database to at least 10 GB to do more performance testing of this method.

I was quite an exciting week and I’m looking forward to getting back to the project after the weekend.

Trusting The Material

My primary resource in studying for the 70-433 exam is Microsoft’s Self-Paced Training Kit for MCTS Exam 70-433. While I understand that mistakes will happen, I was quite disappointed to find that their description of GROUP BY in relation to ROLLUP and CUBE was actually wrong!

The topic of ROLLUP and CUBE in addition to GROUP BY can get a bit confusing, I’ll admit, but it was the GROUP BY portion that they got wrong in the book!

The book gives the following query (for AdventureWorks2008)  as an example:

SELECT Production.Product.ProductSubcategoryID
, AVG(Listprice) AS 'Average'
, MIN(Listprice) AS 'Minimum'
, MAX(Listprice) AS 'Maximum'
FROM Production.Product
WHERE ListPrice <> 0
GROUP BY Product.ProductSubcategoryID

The book then cites the explanation of the query results as:

“The top row, where the ProductSubcategoryID is listed as NULL, is the summary row that provides the average, minimum, and maximum list prices of products across all subcategories.”

NO. No it isn’t. It’s the average, minimum and maximum of all entries in the Production.Product table that have a ProductSubcategoryID of NULL. And here’s how to prove it.

If you query the table, you’ll see there are 209 entries with a ProductSubcategoryID of NULL. If we perform these aggregate functions on just those NULL ProductSubcategoryIDs, we get the exact same results:

SELECT Production.Product.ProductSubcategoryID
, AVG(Listprice) AS 'Average'
, MIN(Listprice) AS 'Minimum'
, MAX(Listprice) AS 'Maximum'
FROM Production.Product
WHERE ListPrice <> 0
AND ProductSubcategoryID IS NULL
GROUP BY Product.ProductSubcategoryID

Total rows that appear with the label of NULL are a product of using WITH ROLLUP or WITH CUBE with a GROUP BY but any NULLs you see in a plain old GROUP BY are a product of the actual table data.

While I’m happy that I understand this topic and I’m not worried about it on the exam (unless the test creator is also the book author/editor, uh-oh!), I am a bit concerned about the quality of material I’m learning from. What happens if the information they provide about querying XML or Full Text are equally as “off”? I could be going into this exam with the completely wrong information and that would be a massive waste.

30 Days to 70-433

Yesterday was Day 1 of my new short-term goal: take and pass the Microsoft 70-433: SQL Server 2008 Database Developer exam by December 1st.

Now, before you worry too much, please know that this is not an attempt at an exam cram or a brain dump style of test taking. My goal is to actually know the material well enough to pass the exam.

I’ve been studying for the exam off and on for nearly 10 months. Many of the concepts are not new to me, in fact I’ve worked with a lot of them for over 9 years – since I first touched SQL Server. But other concepts are new and I’ve struggled with them (and/or not given them as much attention).

I have several tools in my arsenal in preparation for this project:

Today, I took a practice exam (which came with the book) and failed miserably (42%). I plan to focus on a chapter at a time – which I can customize the practice exams to also focus on and master one of these every 4-5 days.

I’ll update the blog with my progress as I go – things that I learn and things that I struggle with alike.

Wish me luck! When this is over, I’ll only have to write a white paper for work before the end of the year!

What did you learn today?

Did your parents ask you something similar when you were growing up? Mine usually did, but perhaps that was the product of having a teacher for a mother. I was thinking about this the other day and thought, what better topic for a blog?! I mean, it’s great if people want to read and find merit or meaning in my posts, but if nothing else, I will have a record of things I’ve learned that I can fall back on when I’m old(er) and I forget.

What I learn each day varies greatly – sometimes it’s a new guitar technique or culinary inspiration. Other times it may be some uber-geeky database related trick. And, occasionally, it may be a soft skill – those that I often overlook as valuable learning experiences.

Whatever it is, I’ll try to share it here.