Thursday, March 24, 2016

SQL Server Listagg and SQL Recursion, Part Three

I've had many occasions to import and export data from one database to another, and I've found SQL recursion to be very useful when I need to string a set of values into one column.   Recently, I had a little hiccup:  a query that I expected to take a few seconds was still running after five minutes.    Clearly, something was wrong.  After convincing myself that the query was correct, I began to examine the data and some of the intermediate query results.

Most of the keys had perhaps five or six values concatenated into a column.  However there were a few keys that had 30 to 40 values to concatenate.    I've reviewed some of the intermediate results before, and I knew the algorithm expanded out about half the possible combinations and then selected the one row with the complete answer.  When there were only three or four values to concatenate to a key, this was not a problem.  But with 30 to 40 values to concatenate to one key, I had a very long running query.

The solution to the performance problem is to reduce the amount of work the query does.   And the key to that is in SQL Server Listagg, Part Two.  In part two, I discussed the problem of concatenating values together when one of the values repeats.  I introduced the use of a surrogate key to control the recursion. In part one, I used the values to control recursion, starting with the smallest value, and repeatedly concatenating larger values to the result.   In part two, I assigned surrogate keys to the values, started the recursion with the smallest surrogate key, and repeatedly concatenated values with larger surrogate keys.   The queries in part one and part two are very similar, and they create a lot of intermediate results that are later discarded.

The new query is much more efficient, and it gains efficiency in two places.  Say we have a set of values that we want to concatenate,  {A, B, G, L}. Now suppose we have a set of surrogates {1, 2, 3, 4} assigned to {A, B, G, L}.   In the first two posts, we use a correlated subquery to find the smallest value to start.  But if we have assigned surrogate values to each of the target values, we do not need the correlated subquery; the starting surrogate value will always be 1.  This makes the query simpler and faster.

In the algorithm in part one, we start with "A", then look for a larger value to concatenate.  That value could be B, G, or L.  Our part one algorithm will iterate thru and create rows with all those combinations: {A},  {A, B}, {A,G}, and {A,L}.   And the iterations continue: {A,B,G}, {A,B,L}, {A,G,L}.  And, finally, {A,B,G,L}.   But we really only want {A,B,G,L}.     Now, if we start with surrogate 1, the next surrogate must be 2, the one after that 3, and  finally 4.    Our intermediate results are {A}, {A,B}, {A,B,G}, and finally {A,B,G,L}.  In part one, we had an algorithm where the work increased exponentially with the number of values to concatenate.  Now, we have an algorithm where the work increases linearly with the number of values to concatenate.   This algorithm is no more complicated that the one we saw in part two, but it's far more efficient.  This is a bigger improvement than eliminating the correlated subquery above.

Here's the code, notice the comments after "from add_surrogate" and "inner join rq":

-- Start with a new common table expression and create a unique 
-- surrogate value for each value in the VALUE column.  The OLAP  
--  row_number() function will do that easily.
with add_surogate (id, value, surogate_value) as (
    select  id, 
            value,       
            row_number() over (partition by id order by value)     
    from sample
    )

-- Add the surrogate_value to the rq CTE
,rq ( id, value, surogate_value,  concat_value, depth) as (

   select a.id,          
          a.value, 
          surrogate_value,
          cast( a.value as varchar(256)), 
          1
   from add_surrogate a
   -- Using surrogate values, we do not need the correlated sub-query
   -- to find the smallest initial value.  The smallest surrogate 
   -- value will always be 1
   where surrogate_value = 1  

   union all 

   select c.id,     
          c.value, 
          surrogate_value, 
          -- No changes, concatenate values as before 
          cast( rq.concat_value + ',' + c.value as varchar(256)), 
          rq.depth + 1
   from add_surrogate c
   inner join rq 
      on c.id = rq.id
         -- Now control the progress of the recursion using the
         -- surrogate value. In part two, the condition was   
         -- the new surrogate > the old surrogate.  That works,
         -- but here's where we can do better:  The next surrogate
         -- will always be the previous surrogate value plus 1.  
        and surrogate_value = surrogate_value + 1
     
)

,all_rows as (
    select id, value, concat_value, depth,
           max(depth) over (partition by id) max_depth
    from rq
)

select id, concat_value
from all_rows
where depth = max_depth -- Select final answers


Moore's Law often works in our favor:  if performance doubles every two years, then we can solve bigger and more difficult problems than we could before. But sometimes the problem we're trying to solve outstrips Moore's Law,  and then we need a better way of solving the problem. In this case, a small tweak in the code delivers a huge performance boost.