Lost Learnings: Simple Web Client, the Sequel

In an earlier post, A Simple Web Client, I demonstrated a simple program for pulling data from a download website to an Oracle table using the htp package. A few years later, I needed to do a similar data pull to a SQL Server table. In this post, I'll demonstrate a simple way of pulling data from an URL into a SQL Server table.

In an ideal world, languages and packages would be standard enough that one could readily port a stored procedure from one vendor's database and reuse the code without too much trouble in another vendor's database. While most SQL is pretty transportable, the stored procedure languages are highly proprietary and do not port at all from one vendor to another ( one exception: IBM's support of Oracle's PL/SQL language).

The SQL Server solution takes a different approach from the Oracle PL/SQL solution. In the latter, an Oracle PL/SQL procedure used built-in Oracle packages to open and manage an HTTP connection, to read data from the HTTP socket, and to write the data to the console. In practice, instead of writing the data to the console, we might have processed the data directly or loaded the data to a staging table for later processing. The SQL Server solution is different; rather than use T-SQL to manage the HTTP connection, this solution uses an external program to read the data from the web into a file on a Windows Server before bulk importing the data into a SQL Server staging table. After we load the data to a staging table, we can process the data as needed by our application.

Let's look at the SQL Server solution. First, we'll use the data from the Federal Reserve from the previous post again. Put the URL https://www.federalreserve.gov/datadownload/Output.aspx?rel=H10&series=122e3bcb627e8e53f1bf72a1a09cfb81&lastObs=10&from=&to=&filetype=csv&label=include&layout=seriescolumn&type=package into your browser. When your browser opens the URL, you'll get a CSV file like the following:

We have a four-column spreadsheet, with some meta-data in rows 1 through 5, the column headers in line 6, and the data starts in row 7. We'll create a four-column table to store the results. In a SQL Server Management Studio (SSMS) window, create the following table:

use MyDatabase
go

create table stageFedRates (
 timePeriod date,
 jrxwtfb_nb dec(15,8),
 jrxwtfn_nb dec(15,8),
 jrxwtfo_nb dec(15,8)
 )
 ;

go

We have a couple of choices of programs to pull data from the web into an external file. One option is wget, another is curl. Curl is already installed on my Windows 10 computer and my Windows 2008 Server, so we'll use curl. The syntax is pretty simple, just type curl and the URL enclosed in double quotes. Don't forget the double quotes -- the URL is full of special characters that command console will try to interpret if not quoted. The output goes to STDOUT by default. Let's try it in a command shell window:

C:\> curl "https://www.federalreserve.gov/datadownload/Output.aspx?rel=H10&series=122e3bcb627e8e53f1bf72a1a09cfb81&lastObs=10&from=&to=&filetype=csv&label=include&layout=seriescolumn&type=package"

"Series Description","Nominal Broad Dollar Index ","Nominal Major Currencies Dollar Index ","Nominal Other Important Trading Partners Dollar Index "
"Unit:","Index:_1973_Mar_100","Index:_1973_Mar_100","Index:_1997_Jan_100"
"Multiplier:","1","1","1"
"Currency:","NA","NA","NA"
"Unique Identifier: ","H10/H10/JRXWTFB_N.B","H10/H10/JRXWTFN_N.B","H10/H10/JRXWTFO_N.B"
"Time Period","JRXWTFB_N.B","JRXWTFN_N.B","JRXWTFO_N.B"
2018-07-09,123.4946,89.6613,160.2890
2018-07-10,123.4637,89.7541,160.0923
2018-07-11,123.8276,89.8616,160.7776
2018-07-12,123.7232,90.1620,160.1318
2018-07-13,124.0710,90.2872,160.7554
2018-07-16,123.8317,90.0232,160.5674
2018-07-17,124.1553,90.4107,160.7806
2018-07-18,124.3573,90.5535,161.0480
2018-07-19,125.0530,90.9072,162.1562
2018-07-20,125.0484,90.9072,162.1456
c:\>

This is the same data we saw in the Excel spreadsheet above. Two useful options of the curl command are the --stderr option and the --output option. Those two options specify files to store STDERR and the output file:

C:\>curl --stderr curlErrors.txt --output curlData.csv "https://www.federalreserve.gov/datadownload/Output.aspx?rel=H10&series=122e3bcb627e8e53f1bf72a1a09cfb81&lastObs=10&from=&to=&filetype=csv&label=include&layout=seriescolumn&type=package"

C:\>dir curl*.*
 Volume in drive C has no label.
 Volume Serial Number is 1836-25CF

 Directory of C:\

2018-07-28  03:38 PM               804 curlData.csv
2018-07-28  03:38 PM               399 curlErrors.txt
               2 File(s)          1,203 bytes
               0 Dir(s)  352,925,888,512 bytes free

C:\> type curlErrors.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   804    0   804    0     0    804      0 --:--:-- --:--:-- --:--:--  2233

Now that we can pull data from our source, let's look at loading into a SQL Server table. From a new SSMS window, use the bulk import command to load the table:

use MyDatabase
go

Bulk Insert stageFedRates 
   from  'C:\curlData.csv ' 
   with ( fieldterminator = ',',   
          rowTerminator = '\n',  
          firstRow = 7,
          maxErrors = 999999999 )

The fieldTerminator and rowTerminator options specify that the input is a CSV with rows delimited by new line characters and the columns by commas. We know there is some metadata in the download, so we specify firstRow=7 to start loading data. Finally, maxErrors specifies how many errors the import will tolerate before aborting the load. This will depend on the data; if the data is very clean, specify a small number of errors, but if not, specify a large number so all the data loads into the table for post-load cleanup.

In a SSMS window, select all the data and see what we imported into the SS table:

Now we're able to pull data from the web into a file and load that data into a table. Next, we'll create a stored procedure to automate this process. We've demonstrated all the commands we need to do this, but there's one more hurdle to leap: working within the context of the Windows file system permissions. The directories that we can read/write are not necessarily the directories that SQL Server service can read/write.

To work around this, we will query SQL Server's TMP environmental variable, and we will use that directory to write to and read from. The following stored procedure will do the trick. The comments will explain how it works along way. It's worth leaving the PRINT statements in the code for debugging; getting all the quotes and double-quotes correct can be quite tricky.

use MyDatabase
go

create procedure [dbo].[loadFedReserve] AS
begin
    -- Get SQL Server's tmp directory.    
    declare @destinationDirectory varchar(256);
    declare @cmd varchar(256) = 'echo %TMP%'; 

    -- Create a temporary table 
    create table #tmpResult (
 envVar varchar(256)
 )
    -- Store the results of executing "echo %TMP%" on the Windows server
    -- into the temporary table
    insert into #tmpResult exec xp_cmdshell @cmd 

    -- Now get the TMP directory from the table
    set @destinationDirectory = ( select top 1 envVar from #tmpResult);
 
    -- Build the command line, including the output and stderr file locations.
    -- We will reuse the @destination variable when we import the file to SQL Server.
    -- Be sure to use the double quotes around the URL  
    declare @curlCommand varchar(512) = 'curl ';
    declare @destination varchar(128) =  @destinationDirectory +'\curlData.csv ';
    declare @errors varchar(128) = @destinationDirectory + '\curlErrors.txt" ';
    set @curlCommand = @curlCommand + '--stderr "' + @errors
    set @curlCommand = @curlCommand + ' --output ' + @destination 
    set @curlCommand = @curlCommand + '"https://www.federalreserve.gov/datadownload/Output.aspx?' +  
       'rel=H10&series=122e3bcb627e8e53f1bf72a1a09cfb81&lastObs=10' + 
       '&from=&to=&filetype=csv&label=include&layout=seriescolumn&type=package"'
  
    -- Little bit of debugging, then execute the command
    print  @curlCommand; 
    exec xp_cmdshell @curlCommand
 
    -- The table is usually a staging table that we reuse, so delete any data in it.
    delete from stageFedRates; 

    -- Now build a sql statement with the bulk import command and the file locations
    -- that we determened before
    declare @sqlStmt nvarchar(256) = 'Bulk Insert stageFedRates from  ''' + 
  @destination + ''' with ( fieldterminator = '','', rowTerminator = ''\n'',  firstRow = 7,maxErrors = 999999999 )'
    
    -- A little more debugging, then execute the load.  Building the SQL command and 
    -- executing it with sp_executeSQL let's us change the imported file programmatically. 
    print @sqlStmt;
    exec sp_executeSQL @sqlStmt
 
    select count(*) RowsImported from fedRates
END

GO

Depending on how your SQL Server instance is configured, you may need to enable the use of xp_cmdshell. I can execute the following on the instance of SQL Server on my PC; the help of a DBA is probably required for an enterprise server.

exec sp_configure 'show advanced options', 1;  
go  
-- Refresh the configuration 
reconfigure ;
go 
  
-- Enable xp_cmdshell   
exec sp_configure 'xp_cmdshell', 1;  
go
  
-- Refresh the configuration 
reconfigure ;
go

After compiling the store procedure, open a new SSMS window, execute the stored procedure, and select the results of the load:

exec loadFedReserve

select * 
from stageFedRates

There should be 10 rows, like below, and it will take SQL Server less than a second to pull the data and load the table:

Data returned from a SQL Select in SQL Server Management Studio

Using the curl command and a bit of code in a stored procedure, we can pull external data from the web and load it into SQL Server. After loading the data into the staging table, we can manipulate the data as needed by our business requirements. The load is fast enough to be useful for interactive processing with small amounts of data, or we can use the SQL Server Scheduler to batch load large amounts of data.

Lost Learnings

Saturday, July 28, 2018

Simple Web Client, the Sequel

No comments:

Post a Comment

Labels

About the Author