Hello, everyone, I have decided to do a little branching into some of those things that isn’t aren’t exactly mainstream SharePoint. But it does happen on the database administrator side of life. I had a scenario where something had to be done, and there were quite a few things in the past few weeks, and I thought this might come handy out there. So, what I am going to be doing is to do what I would call a SQL Server Series. We all have tools in our toolbox. Having different tools and methods in your toolbox means that you can choose the right tool for the job when you’re confronted with a new duty. This article is an exercise in looking at some of the tools available and using several of them to accomplish a duty. . It also illustrates some techniques for manipulating text file data that aren’t immediately intuitive.
MCSE Training – Resources (Intense)
This article and solution use the following tools.
1. DOS batch files
2. Ultra Compare by IDM Solutions
3. SQL Server 2008 Standard Edition
A while back my Manager Manager asked me to compare directories on two indistinguishable web servers to be certain that the folders and the files were really identical. The following criteria would qualify a file or directory as being different:
1. The file or directory is missing from one server
2. The date stamp is different
3. The size is different
There were several things I considered in determining how to go about the task:
1. The approach had to work against the live web servers while they were online, so intense CPU operations were something to be avoided.
2. Because the web servers are online, I need to run the comparison off-hours, so automation is important.
3. We had some tools available, but I couldn’t spend money on new ones. We had a tool called Ultra Compare, which could run in command line mode and then perform the comparison I intended to do.
4. The web servers are in the DMZ and the SQL Servers are found inside the domain. The SQL Server process could not access the file system on the web servers directly.
Batch files are productive at for file system operations, ; Ultra Compare is efficient at performing the file system comparison and SQL Server thrives at manipulating data. . After a brief proof of concept, I decided on the following approach:
Use a batch file to process the comparison.
Call Ultra Compare to perform the comparison itself, executing each web root individually and and writes writing the output into individual text files.
Concatenate the output files into a single file for delivery to the database.
Clean up the individual output files.
Use SQL Server to manipulate the data and produce the report.
Import the single file into a staging table.
Manipulate and clean the data.
Populate the data into a single reporting table.
NOTE: Our web servers have internal IP addresses with 25 and 27 in the fourth octet. The In the remaining parst of this article and scripts; , I would will refer to them by these numbers. You could refer to them as whatever else makes sense to you to keep them separate. In this article, I have changed the IP addresses, servers, shares, and sites in the interest of security.
Comparing files Files Using Ultra Compare\
I chose Ultra Compare to carry out the comparison because it did exactly what I needed to do, it had a command line mode, and we own it already. It supports quite a few command line switches and the following command is the one that will compare the files in two directories and write the results to an output file.
– uc -r -dmb -vert -ne
\\ 172.16.123.55 \ publicsites$ \ myxdrive.com
\\ 172.16.123.57 \publicsites$\ myxdrive.com -o z:\temp\test_comp.txt
See below what the different parts of the command mean:
-r = means to search recursively; include sub-directories
-dmb = sets the folder compare to basic (presence, time and size)
-vert = sets the output to vertical
-ne = means to show differences in the output
\\ 172.16.123.55 \publicsites$\ myxdrive.com is the 1st folder to compare
\\ 172.16.123.57 \publicsites$\ myxdrive.com is the 2nd folder to compare
The outcome after comparison is formatted for reading in DOS. The initial two lines determine what directories are being compared. , Thenfollowed by the header rows, detail rows, and finally a footer row. It’s not the friendliest method in the world for importing into a database. Parsing the file will be dealt with once we get it loaded into the database.
Performing the Comparison
I was now posed with the challenge to of keeping each comparison isolated to its own text file because I needed the headers in place to stand as the directories compared. . This also permits me to process a single directory during development. The list is never static, so a loop in DOS would will do the magic. .
Below is the code to accomplish this:
echo Folder Comparison Started
for /f %%g in (‘dir %strServer55% /ad /b ^| findstr /v dir_to_exclude’) do (
<nul (set /p x= Processing %%g…)
uc -r -dmb -vert -ne “%strServer55%\%%g” “%strServer57%\%%g” -o “%strOutput%\%%g.uc.txt”
Folder Comparison Completed
The for /f performs everything in the do loop for each item returned by the dir command. . The %%g variable is the index of our loop.
The dir command returns a bare listing of directories, but skips the “dir_to_exclude” folder completely. . You don’t have to include the findstr command at all if you don’t have a directory to exclude.
The <nul command is a way of displaying progress while executing a batch file. . It displays the text but doesn’t output a line break like the echo command does.
The uc command invokes Ultra Compare to compare the current directory on each server.
The echo command outputs the word “done” and adds a line break afterwards
The <nul and echo commands are solely for aesthetics and aren’t necessary unless you want feedback as you’re developing and running your script.
Setting up Up the Comparison Result for SQL Server
Once all the yield records are made, I expected to move them into SQL Server for investigation. The least demanding approach to do this is to unite the individual file into a solitary file to duplicate to the server and import into a database.
The deletion step guarantees that I’m beginning with an unfilled file and not one I made in a past run.
Then a DOS loop will concatenate the output files into one.
echo Post-ComparisionComparison Started
<nul (set /p x= Deleting Consolidated File…)
if exist “%strOutput%\web_compare.txt” del “%strOutput%\web_compare.txt” /f /q
<nul (set /p x= Consolidating Comparison Files…)
for /f %%g in (‘dir %strOutput%\*.uc.txt /a-d /b’) do (
type “%strOutput%\%%g” >> “%strOutput%\web_compare.txt”
One thing that always disturbs me forever is a procedure that runs and leaves a considerable measure of shrapnel behind by not cleaning up after itself. All things considered, tThis little group document made one yield record for every directory in addition to one united document. The individual ones are no more longer required, so they should be deleted. Utilizing Using the same naming tradition that I utilized as a part of making the yield, they can be deleted en masse.\
<nul (set /p x= Deleting Comparison Files…)
del “%strOutput%\*.uc.txt” > nul
The final step is to copy the single comparison file over to the SQL Server for analysis.
<nul (set /p x= Copying to SQL Server for Processing…)
copy “%strOutput%\web_compare.txt” “\\mysqlserver\myshare$\temp\web_compare.txt” /y > nul
The batch file is now completed. . The only file I’ve left on the web server is the consolidated output file with the all the differences, so the web server is not accumulating a bunch of junk it doesn’t need. . Once the process is in production, I can delete the consolidated file after I copy it to the SQL Server.
Importing the Output File
Let’s take another look at our output file format.
There are numerous blocks like this, one for every directory. . The first two rows are the folders that were compared and the remaining files are broken into columns. . Given the format, I’m going to import into a single column in the staging table and have the parsing is performed later. . Note that an identity column and columns to hold the paths are being created now, but they won’t be populated during the initial load.
The import of the file into our staging table is simple, with just one bulk insert statement.
Here’s the very simple format file:
Now that the data is imported into the staging table, here’s what the table looks like.:
Parsing the Data
Here’s where the challenging and fun part (at least for me, but I like this stuff ;-)) begins. . I need to get the staging table columns populated with the parent paths and clean out the junk rows that aren’t needed. . There are several different steps here, so we’ll tackle them one by one.
The first step is to populate the ParentPath55 column, which is the path on the first server being compared. . Take a look at the first two rows in the staging table above. . The first row of each block of rows is the one that needs to be populated into the ParentPath55 column. . The SQL is simple enough. . I know this is a non-SARGable predicate, but I found the performance to be similar to using the LIKE operator and I don’t have to deal with the regular expression syntax.
Our staging table now looks like this:
The next step is to populate the ParentPath57 column in the same row where we just set ParentPath55. The value will come from the RawLine column in the row following the ParentPath55 row we populated. Not wanting to fall into the temptation to loop unnecessarily, I’ll employ a set-based solution by joining the table to itself based on the primary key. Put another way, for each ParentPath55 row, I’ll update the ParentPath57 column with the value from the row immediately following it.
Here’s our staging table now:
We now have a situation where the ParentPath55 and ParentPath57 columns are populated for the first row starting the set of rows for each root directory. The detail rows are the ones I really need, but they’re not populated yet. . I’ll employ a quirky update (also known as a three-part update; see http://www.sqlservercentral.com/articles/T-SQL/68467/ for more information) to populate the parent path columns with the first populated row in each block. . This again avoids the temptation to employ a loop and is nasty fast. . Remember we’re on SQL Server 2008, so we don’t get to use the LEAD and LAG window functions.\
The path columns in the staging table are now filled in for every row.
Deleting Unwanted Rows
Finally, I have a columnar data set I can work with. . The only thing left is to clean out the unneeded rows. . The empty lines, the header separators, and the lines denoting the parent paths aren’t needed, so a single update statement will handle everything. . Yes, there are non-SARGable predicates in here, but the same situation applies to the SUBSTRING as before and the NULLIF handles the NULLs and empty values. . We could accomplish the same thing with multiple update statements, but that proved to take longer than this single update.
Building a Permanent Table
Next, we have to build and populate a permanent table. . Note that a column to store the file type is included in the table. . I won’t populate it initially, but instead I will do it later.
Using a divide and conquer approach, I’ll use everyone’s reliable friend DelimitedSplit8K (You you would can find this in Jeff Moden’s “Tally OH!” article at http://www.sqlservercentral.com/articles/Tally+Table/72993/ for more information) to split up the single column and a crosstab query to break the data into columns. . I’ll then trim off the leading and trailing spaces that were brought in from the comparison. . Finally, I’ll clean the data in the columns a bit more and convert them to our proper data types. . When a file or directory is on one server but not the other, Ultra Compare populates the word “none” for the date stamp and size. . As I know of no way to convert the word “none” into either a Datetime or an Integer, I’ll NULL out those values.\
The Path55 and Path57 columns still have the date stamp in the column, which I don’t really need. . Getting rid of it is as simple as this.:\
Yes, this could have been done in when the table was populated, but I chose to break it out into a separate step. . If, for some reason, the boss wants to see it along with the root path, the value can be left there or split into separate columns before being cleared out.
The permanent table is getting there. . Here’s what it looks like now.
Note that I didn’t populate file type at all in the permanent table. . I could accomplish this by using a LIKE operator and a cross apply with a massive list of values, but since I’m going to be using this again and it’s probably going to grow, it makes sense to create a physical table with the mask to match and then use our cross apply against the table. . This is, by no means, an exhaustive list, but it covers many of the file types we use. . Yes, the leading % (wildcard) character means that the query will not be able to use an index in subsequent updates, but it’s the cleanest way I know of to match different conditions and determine the file type based on the name.
Now that the list is created, I’ll use it to set the file type based on the filename on 55. . Where it’s missing, I’ll use the filename on 57 in a separate pass. . Where it still isn’t covered, I’ll just set the type to “Uncategorized” and let the individual application owners figure it out. . Because the list is in a table, anything I missed when I created the list (and there will be some) can simply be added to the table and won’t require any code updates to set the file type properly.\
Our permanent table is now complete. . Here’s what it looks like:
Other Data Rules
I now have a single, denormalized table with all the differences. . If there’s more work to do in defining what gets taken out of the set, they can be deleted as appropriate. . For example, if you don’t care about images and backup files, those rows can be deleted by using the FileType column. . If you don’t want to see any difference more than two years old, they can be deleted by using one of the date columns.
If new rules are defined, we can rebuild the data with the new rules without performing a new comparison of the file system; we can simply reload our text file. . Remember, the last one is still on the SQL Server and the web server.
Where we go from here depends on our reporting requirements. . If the boss just wants an Excel file with all the differences, we can run a query, copy/paste, and deliver it. . If he wants a web page with filters and grids, we can normalize the data into an efficient reporting table structure, index it appropriately, write a simple web page and let people use it to their heart’s content.
An obvious summary query answers the question of what file types are mismatched most often. . We can see a summary of our mismatched file counts by type by a simple query:
Another easy one is what frequently-used extensions we don’t have covered in our table of file types. . Those can be queried like this:
Expansion and Adaptability
There are many ways to customize this process to suit your own particular needs. . What you do depends on your exact situation. . This was written to solve a specific problem, but the techniques could be used to solve other unique situations.
Many individual steps in this process could be changed to suit your particular situation. . For example, if you don’t have Ultra Compare and you can’t get approval to buy it, you could use your tool of choice. . You’d have to change the process to parse the incoming text, but it could be done.
Personally, Ultra Edit is my text editor of choice and I think Ultra Compare does a fine job of comparing files and folders. . It isn’t free, but is relatively inexpensive. . I’ve used it for years and find it to be well worth the money, but we all have our favorite tools. . For information on Ultra Edit or Ultra Compare, see the Ultra Edit site at http://www.ultraedit.com/.
Other Scenarios and Next Steps
If the requirement is that this process is run weekly and a new data set populated to our reporting page, we could schedule make the batch file as a scheduled task in the operating system, convert the SQL into a procedure and create a database job to run the procedure. . The only requirements are that the batch file runs from a machine that has Ultra Compare installed and also has visibility to the file systems of the both the web servers and the SQL Server.
If you have to run the process regularly and the SQL Server can see the file system on the web servers, you could have the stored procedure do all the work. . It could run the batch file using xp_cmdshell and then process the results. . The only requirements are that Ultra Compare is installed on the SQL Server and the web server file system visibility exists. . Personally, I don’t like installing software on my SQL Servers, but that’s just my preference. . I’d rather run it elsewhere.
Where you go from here with this ends with the most famous answer of all: It depends.