Lost Learnings: recursion

Showing posts with label recursion. Show all posts

Thursday, March 24, 2016

SQL Server Listagg and SQL Recursion, Part Three

I've had many occasions to import and export data from one database to another, and I've found SQL recursion to be very useful when I need to string a set of values into one column. Recently, I had a little hiccup: a query that I expected to take a few seconds was still running after five minutes. Clearly, something was wrong. After convincing myself that the query was correct, I began to examine the data and some of the intermediate query results.

Most of the keys had perhaps five or six values concatenated into a column. However there were a few keys that had 30 to 40 values to concatenate. I've reviewed some of the intermediate results before, and I knew the algorithm expanded out about half the possible combinations and then selected the one row with the complete answer. When there were only three or four values to concatenate to a key, this was not a problem. But with 30 to 40 values to concatenate to one key, I had a very long running query.

The solution to the performance problem is to reduce the amount of work the query does. And the key to that is in SQL Server Listagg, Part Two. In part two, I discussed the problem of concatenating values together when one of the values repeats. I introduced the use of a surrogate key to control the recursion. In part one, I used the values to control recursion, starting with the smallest value, and repeatedly concatenating larger values to the result. In part two, I assigned surrogate keys to the values, started the recursion with the smallest surrogate key, and repeatedly concatenated values with larger surrogate keys. The queries in part one and part two are very similar, and they create a lot of intermediate results that are later discarded.

The new query is much more efficient, and it gains efficiency in two places. Say we have a set of values that we want to concatenate, {A, B, G, L}. Now suppose we have a set of surrogates {1, 2, 3, 4} assigned to {A, B, G, L}. In the first two posts, we use a correlated subquery to find the smallest value to start. But if we have assigned surrogate values to each of the target values, we do not need the correlated subquery; the starting surrogate value will always be 1. This makes the query simpler and faster.

In the algorithm in part one, we start with "A", then look for a larger value to concatenate. That value could be B, G, or L. Our part one algorithm will iterate thru and create rows with all those combinations: {A}, {A, B}, {A,G}, and {A,L}. And the iterations continue: {A,B,G}, {A,B,L}, {A,G,L}. And, finally, {A,B,G,L}. But we really only want {A,B,G,L}. Now, if we start with surrogate 1, the next surrogate must be 2, the one after that 3, and finally 4. Our intermediate results are {A}, {A,B}, {A,B,G}, and finally {A,B,G,L}. In part one, we had an algorithm where the work increased exponentially with the number of values to concatenate. Now, we have an algorithm where the work increases linearly with the number of values to concatenate. This algorithm is no more complicated that the one we saw in part two, but it's far more efficient. This is a bigger improvement than eliminating the correlated subquery above.

Here's the code, notice the comments after "from add_surrogate" and "inner join rq":

-- Start with a new common table expression and create a unique 
-- surrogate value for each value in the VALUE column.  The OLAP  
--  row_number() function will do that easily.
with add_surogate (id, value, surogate_value) as (
    select  id, 
            value,       
            row_number() over (partition by id order by value)     
    from sample
    )

-- Add the surrogate_value to the rq CTE
,rq ( id, value, surogate_value,  concat_value, depth) as (

   select a.id,          
          a.value, 
          surrogate_value,
          cast( a.value as varchar(256)), 
          1
   from add_surrogate a
   -- Using surrogate values, we do not need the correlated sub-query
   -- to find the smallest initial value.  The smallest surrogate 
   -- value will always be 1
   where surrogate_value = 1  

   union all 

   select c.id,     
          c.value, 
          surrogate_value, 
          -- No changes, concatenate values as before 
          cast( rq.concat_value + ',' + c.value as varchar(256)), 
          rq.depth + 1
   from add_surrogate c
   inner join rq 
      on c.id = rq.id
         -- Now control the progress of the recursion using the
         -- surrogate value. In part two, the condition was   
         -- the new surrogate > the old surrogate.  That works,
         -- but here's where we can do better:  The next surrogate
         -- will always be the previous surrogate value plus 1.  
        and surrogate_value = surrogate_value + 1
     
)

,all_rows as (
    select id, value, concat_value, depth,
           max(depth) over (partition by id) max_depth
    from rq
)

select id, concat_value
from all_rows
where depth = max_depth -- Select final answers

Moore's Law often works in our favor: if performance doubles every two years, then we can solve bigger and more difficult problems than we could before. But sometimes the problem we're trying to solve outstrips Moore's Law, and then we need a better way of solving the problem. In this case, a small tweak in the code delivers a huge performance boost.

Saturday, March 28, 2015

SQL Server Listagg, Part Two

Last week's post examined using common table expressions and recursive SQL to concatenate strings grouped by an id value in SQL Server. This is an easy operation to perform in DB2 and Oracle; both databases support the LISTAGG function. But SQL Server does not support the LISTAGG function, so we needed to take a different approach to getting the same results.

Our query works well, but sometimes our data may have values that repeat within an id. Because our query controls the recursion with the values that we're concatenating, we might not get the result back that we wanted.

Let's start with last week's sample table, and add a row to it:

    insert into sample values(3,'G');

Run the last week's final query, and our results look this:

Id	Concat_value
1	A,B
2	C,D
3	E,F,G
3	E,F,G
4	H

This is probably not the result we hoped for. So, first question: what did we hope to see? If the correct result is four rows, with the value of "E,F,G" for Id = 3 , then we just need to add the distinct function to the final query.

But maybe the correct answer is four rows, and the concat_value for Id = 3 is "E,F,G,G". In that case, we need to do a little work to our query. As originally written, the query uses the VALUE column to control the recursion. If the values repeat, then we need to introduce a surrogate value to the query. Let's examine the query and let the code comments do the talking:

-- Start with a new common table expression(CTE) that creates a unique surrogate
-- value for each value in the VALUE column.  The OLAP row_number() function will 
-- do that easily.
with add_surogate (id, value, surogate_value) as (
    select  id, 
     value,       
            row_number() over (partition by id order by value)     
    from sample
    )

-- Add the surrogate_value to the rq CTE
,rq ( id, value, surogate_value,  concat_value, depth) as (

   select a.id,          
          a.value, 
   a.surogate_value,
          cast( a.value as varchar(256)), 
          1
   from add_surogate a
   where a.surogate_value = (
         select min(surogate_value)   -- Start the recursion using the surrogate value
         from add_surogate b
         where a.id = b.id
         )

   union all 

   select c.id,     
          c.value, 
          c.surogate_value, 
          -- No changes, concatenate values as before 
          cast( rq.concat_value + ',' + c.value as varchar(256)), 
          rq.depth + 1
   from add_surogate c
   inner join rq 
      on c.id = rq.id
         -- And control the progress of the recursion using the
         -- surrogate value
        and c.surogate_value > rq.surogate_value
        and rq.depth < 10
)

,all_rows as (
    select id, value, concat_value, depth,
           max(depth) over (partition by id) max_depth
    from rq
)

select id, concat_value
from all_rows
where depth = max_depth

The changes from last week's query to this week's query are pretty simple:

Add a new common table expression to create a surrogate value
Add the surrogate value to the recursive common table expression
Use the surrogate value to control the recursion
Continue to build the result string as before

With these simple changes, we get the desired results:

ID	Concat_value
1	A,B
2	C,D
3	E,F,G,G
4	H

More Reading:

The vendors' online documentation have good examples:

SQL Server 2008 Recursive Queries Using Common Table Expressions
DB2 Info Center
Oracle 11gR2 Don't be put off by the term "subquery factoring". It's just recursion.

Friday, March 20, 2015

SQL Server ListAgg Function, Part One

Lately, I've been straddling the Oracle database world and the Microsoft SQL Server world, moving data from Oracle tables to SQL Server tables. One of the things I needed to do was concatenate string values grouped by an id value.

This is a pretty easy thing to do in the Oracle world. Let's create a table and add some data:

create table sample (
    id number,
    value varchar2(16)
);
insert into sample values (1, 'A');
insert into sample values (1, 'B');
insert into sample values (2, 'C');
insert into sample values (2, 'D');
insert into sample values (3, 'E');
insert into sample values (3, 'F');
insert into sample values (3, 'G');
insert into sample values (4, 'H');
commit;

Next, using the LISTAGG function available in DB2 and Oracle, our query will return an id and the concatenation of the values for each id:

select id, 
       listagg(value,',') within group (order by value) concat_value
from sample
group by id

ID	CONCAT_VALUE
1	A,B
2	C,D
3	E,F,G
4	H

That was easy on an Oracle database! Now, search the SQL Server documentation, and you won't find a LISTAGG function. Next, google "SQL Server listagg", and you'll find lots of discussion. Stackoverflow contributors offer many ways of solving this problem, usually using stored procedures or functions. In this post, I will suggest a SQL solution to the problem using SQL recursion and common table expressions.

First, grab the create table and the insert SQL above. Change the varchar2 datatype to varchar (or nvarchar) and create the sample table on your SQL Server database. Next, construct the recursive query.

with rq ( id, value, concat_value, depth) as (

   select a.id, 
          a.value, 
          cast( a.value as varchar(256)), 
          1
   from sample a
   where a.value = (
         select min(value) 
         from sample b
         where a.id = b.id
         )

   union all 

   select c.id, 
          c.value,  
          cast( rq.concat_value + ',' + c.value as varchar(256)), 
          rq.depth + 1
   from sample c
   inner join rq 
      on c.id = rq.id
        and c.value > rq.value
        and rq.depth < 10
)

select rq.* 
from rq
order by id, depth

Before we execute the query, let's examine it:

the recursion returns 4 columns: the id, the current value, the result of concatenating the values, and the recursive depth
we CAST the concatenated value to a varchar large enough to hold the result
we use a correlated query to return the smallest value for each id. This gives us a starting point for the recursion.
Tracking the depth is useful while developing a query. The test for rq.depth < 10 will keep the query from spinning away if there's an error.
The condition c.value > rq.value performs the same function as "order by value" in the Oracle query.

Execute the query, and we get the following result table:

ID	Value	Concat_value	depth
1	A	A	1
1	B	A,B	2
2	C	C	1
2	D	C,D	2
3	E	E	1
3	F	E,F	2
3	G	E,G	2
3	G	E,F,G	3
4	H	H	1

The answer we want is in the result table above, we just need to filter out the intermediate rows. The rows we want have have the maximum depth value by ID. We will use the max function by id to find the maximum depth in a second common table expression, and then filter the results in our final query:

with rq ( id, value, concat_value, depth) as (

   select a.id, 
          a.value, 
          cast( a.value as varchar(256)), 
          1
   from sample a
   where a.value = (
         select min(value) 
         from sample b
         where a.id = b.id
         )

   union all 

   select c.id, 
          c.value,  
          cast( rq.concat_value + ',' + c.value as varchar(256)), 
          rq.depth + 1
   from sample c
   inner join rq 
      on c.id = rq.id
        and c.value > rq.value
        and rq.depth < 10
)

,all_rows as (
    select id, value, concat_value, depth,
  max(depth) over (partition by id) max_depth
    from rq
)

select id, concat_value
from all_rows
where depth = max_depth

When we execute this query, we get the same results as our original query using the LISTAGG function in Oracle:

ID	Concat_value
1	A,B
2	C,D
3	E,F,G
4	H

Using common table expressions and recursion, we now have a way to aggregate strings together in SQL Server. But this is just the start; using the method for string aggregation as a model, we can develop other aggregations as well. Next week, we talk about handling cases where values are repeated within an id.

Tuesday, October 22, 2013

Recursive SQL

Our shop recently upgraded to Oracle 11g Release 2. Having used DB2 for many years, two new 11gR2 features caught my eye. First, Oracle 11gR2 supports aliasing the names of common table expressions. Second, Oracle supports SQL-99 syntax for recursion using common table expressions.

Let's start with some data first. In the Oracle editor of your choice, run the following script. The script creates some test data for us in the system catalog, and we'll use this data to solve a practical problem.

create table t1 ( v1 varchar2(1));

create view v1 as select * from t1;

create view v2 as select * from v1;

create view v3 as select * from v2;

create view v4 as select * from v1;

The script creates five new objects in the system catalog. The DBA_DEPENDENDCIES view shows the dependency of one object on another. In our example, view V1 depends on table T1. View V2 depends on view V1. View V3 depends on view V2. View V4 depends on view V1. Every object, whether it is a table, view, package, or object of any type, has a row in DBA_DEPENDENCIES if the object depends on another object, or if the object has other objects depending on it.

The USER_DEPENDENCIES view is a subset of DBA_DEPENDENCIES; DBA_DEPENDENCIES may be not be available to everyone, so these examples will use USER_DEPENDENCIES. The two columns of interest to us are the NAME and REFERENCED_NAME columns. A row in this view means that object NAME depends on object REFERENCED_NAME. In the following query, we see every dependency that we described in the previous paragraph.

select name, referenced_name
from user_dependencies
where referenced_name in ('V1','T1','V2','V3') 
order by 1;

NAME    REFERENCED_NAME
-----   ---------------
V1      T1
V2      V1
V3      V2
V4      V1

The USER_DEPENDENCIES view shows us all the direct dependents of an object. The USER_DEPENDENCIES view shows that view V4 depends on view V1. USER_DEPENDENCIES does not show that view V4 indirectly depends on table T1. If you don't believe that, drop table T1 and then run a query against view V4. Not so good!

A shop with many tables, views, packages, procedures, functions, and triggers, will have an extensive hierarchy of dependencies. So, while this view shows the direct dependents, it does not show all the dependents. Does view V4 depend on table T1? Before we drop table T1, we need to know the answer to that question!

This problem is easily solved with recursive SQL queries. We'll take a look at two ways of handling this. First, we'll examine how to answer this question using recursion with common table expressions, à la DB2 or Oracle 11gR2. Here's the query:

with recurs (name, rname, lvl ) as (
   select name, referenced_name, 1
   from user_dependencies
   where referenced_name = 'T1'

   union all

   select a.name, a.referenced_name, r.lvl + 1
   from user_dependencies a,
        recurs  r
   where r.name = a.referenced_name
)
select name, rname, lvl
from recurs; 

NAME    RNAME   LVL
----    -----   ---
V1      T1      1
V2      V1      2
V4      V1      2
V3      V2      3

The first new feature in 11gR2 is aliasing the common table expression columns. We have declared RECURS as a common table expression with two columns, NAME and RNAME. Oracle 11gR1 supported common table expressions, but 11gR1 did not support aliasing the column names.
The next new feature is the recursive query. We have two sub-queries unioned together. The first query is our starting point: REFERENCED_NAME = 'T1'. The second query is the recursion. We select rows from a join of the USER_DEPENDENCIES table and the RECURS common table expression. The rows are linked by joining the REFERENCED_NAME of USER_DEPENDENCIES to the NAME of RECURS. In other words, we join the child row from USER_DEPENDENCIES to the parent row from RECURS. The LVL column shows the depth of the hierarchy.

DB2 LUW has supported common table expressions and recursion for at least 10 years. Oracle's pre-11g2 databases supported recursive queries in a limited way using the START WITH ... CONNECT BY PRIOR syntax. Run the following query, and we get the same results, although not in the same order. LEVEL is an Oracle pre-defined column showing the depth of the hierarchy. In the previous query, we found this by keeping track of the level (LVL) ourselves.

select name, referenced_name, level
from user_dependencies
start with referenced_name = 'T1'
connect by prior name = referenced_name;

NAME    REFERENCED_NAME  LEVEL
----    ---------------  -----
V1      T1               1
V2      V1               2
V3      V2               3
V4      V1               2

Like Oracle's outer join operator "(+)" from the last post, START WITH ... CONNECT BY PRIOR is very common in Oracle shops, so it's important to understand it. Also, it's very concise, always a benefit to those among us who have ten thumbs on the keyboard.

Recursive queries are another good tool to have in our toolbox. Trying to sort out the object dependencies is just one use. Another common problem is sorting out the system privileges. Many applications define their own authorization schemes, granting access to roles or groups, then including those roles/groups in other roles/groups. Recursive queries can answer the question "what users have authorization to which tables", and these queries can show the underlying hierarchy.

But before we go, for our example query to be really useful, we need to account for the rest of the columns in DBA_DEPENDENCIES. In addition to the object's name, DBA_DEPENDENCIES includes the object's owner and the object's type. If we don't account for any object's type and owner, we quickly end up with a recursive loop, especially if our objects include package specs and package bodies. Here's the improved query:

with recurs (name, owner, type, rname, rowner, rtype,  lvl ) as (
   select name, owner, type, referenced_name, referenced_owner, referenced_type, 1
   from dba_dependencies
   where referenced_name = 'T1'

   union all

   select a.name, a.owner, a.type, a.referenced_name, a.referenced_owner, a.referenced_type, r.lvl + 1
   from dba_dependencies a,
        recurs  r
   where r.name = a.referenced_name
     and r.owner = a.referenced_owner
     and r.type = a.referenced_type 
) cycle name, owner, type set cycle_detected to 1 default 0
   
select * 
from recurs;

Now when we join a parent row to a child row, we're joining on all the appropriate columns, and a recursive loop is less likely. Another improvement in this query is the use of the CYCLE keyword. CYCLE instructs the database to detect when our query has previously visited the row defined by name, owner, and type. When the database detects this condition -- a recursive loop -- the database sets the CYCLE_DETECTED column to 1 and does not descend further. Now we have a really useful query.

More Reading

For more information, the vendors' publications have some good examples:

DB2 Info Center
Oracle 11gR2 Don't be put off by the term "subquery factoring". It's just recursion.

Also, see my blog entry discussing recursion to implement aggregation in SQL Server.

Monday, August 26, 2013

Translate and Replace and Replace and Replace and ...

One task many programmers face is examining a string of characters and replacing specific characters with different ones. Every COBOL programmer is familiar with the INSPECT verb, and if they've really been programming COBOL for a while, they're familiar with the now-obsolete EXAMINE verb, too. We're not going to step that far back; instead we'll look at a couple ways of handling this problem in Oracle's PL/SQL programming language.

Oracle's PL/SQL function library includes several functions that replace single characters or multiple characters with a new characters or string of characters:

translate translate
replace replace
regexp_replace regexp_replace

Regexp_replace and its siblings regexp_like and regexp_substr are worthy of their own post, so we'll focus on the translate and replace functions.

Translate is the simplest function. Translate performs single characters substitution. There are three arguments: the input string we want to examine, a string of characters that we want to replace, and a new string of characters that we want to use as replacements. Here's a simple example:

select 'abcdefgh' input,
       translate('abcdefgh','bdfh','BDFH') output
from dual;

INPUT       OUTPUT
--------    --------
abcdefgh    aBcDeFgH

We see that every occurrence of b, d, f, or h was replaced by the upper case equivalent. Every character in the second argument is replaced by its equivalent character in the third argument.

This is a really easy way to do one-to-one character translations. Sometimes, we want to replace a single character by a string of characters, or perhaps replace a string of characters by a single character. The replace function handles this nicely. Here's an example; anyone needing to translate multi-byte characters to a single-byte character is familiar with this problem:

select '10 ≥ 5' input,
       replace('10 ≥ 5','≥ ','>=') output
from dual;

INPUT       OUTPUT
--------    --------
10  ≥  5     10  >= 5

This is pretty handy. We translated the single character greater than or equal to symbol into a two-character string that we can represent in a 7-bit ASCII codeset.

The replace function can convert strings into single characters, too. Here's another example:

select '(c)2013' input, 
       replace('(c)2013','(c)','©') output 
from dual;

INPUT       OUTPUT
--------    --------
(c)2013     ©2013

This is pretty handy, too. We can take a string encoded in 7-bit ASCII and replace character strings like (c) and >= with their single-character equivalents in a multi-byte character set.

There's just one thing missing: Translate can change many different characters for us in one call, but it does not handle character string replacements. Replace can change characters strings, but only one string at a time. If we want to change both the © and the ≥, we need to do something like this:

select '© ≥' input,
       replace(replace('© ≥','©','(c)),'≥','>=') output
from dual;

INPUT    OUTPUT
-----    ------
© ≥      (C) >=

This works fine for a couple of translations, but for more than a few translations, this is pretty cumbersome. Why not let the computer do it for us? We'll define a new function, called MREPLACE, that will handle multiple replace strings. The first argument will be the string we want to examine. The following arguments are pairs of strings: first the old string, then the new string. Here's an example of MREPLACE in use:

select '© ≥' input,
       mreplace('© ≥',
                '©','(c)',
                '≥','>=') output
from dual;

INPUT OUTPUT                                                                     
----- ------
© ≥  (c) >=

Perfect! Now, with just one function call, we have an easy way to replace multi-character or single-character strings with multi-character or single-character strings in one function call. Here's the source:

create or replace
function mreplace (
    p_string in varchar2, 
    p_from1 in varchar2 default null, p_to1 in varchar2 default null,
    p_from2 in varchar2 default null, p_to2 in varchar2 default null,
    p_from3 in varchar2 default null, p_to3 in varchar2 default null,
    p_from4 in varchar2 default null, p_to4 in varchar2 default null,
    p_from5 in varchar2 default null, p_to5 in varchar2 default null,
    p_from6 in varchar2 default null, p_to6 in varchar2 default null,
    p_from7 in varchar2 default null, p_to7 in varchar2 default null,
    p_from8 in varchar2 default null, p_to8 in varchar2 default null 
                  ) return varchar2 as

begin 

  if ( p_from1 is null ) then return p_string;
  else
     return mreplace(replace(p_string,p_from1, p_to1),
                     p_from2 , p_to2 ,
                     p_from3 , p_to3 ,
                     p_from4 , p_to4 ,
                     p_from5 , p_to5 ,
                     p_from6 , p_to6 ,
                     p_from7 , p_to7 ,
                     p_from8 , p_to8 );
  end if;           

end mreplace;

How does it work? MREPLACE uses recursion, repeatedly calling itself. MREPLACE keeps calling itself until the p_from1 argument is null. The test to stop the recursion is important if you don't want your DBAs and sysadmins darkening your office door! Also, notice that MREPLACE accepts one string to examine and eight from/to pairs. When we invoke it recursively, we call it with one string to examine that includes our string and the first from/to pair:

          replace(p_string, p_from1, p_to1)

but only seven from/to pairs:

                     p_from2 , p_to2 ,
                     p_from3 , p_to3 ,
                     p_from4 , p_to4 ,
                     p_from5 , p_to5 ,
                     p_from6 , p_to6 ,
                     p_from7 , p_to7 ,
                     p_from8 , p_to8 )

So, each time MREPLACE calls itself, it shifts all the arguments to the left. The eigth from/to pair are not specified and defaults to null. When p_from1 is null, then all the arguments are used, all the substitutions are completed, and MREPLACE returns the answer.

For the purposes of illustration, MREPLACE accepts up to 8 from and to pairs, but in practice we would set it to accept more, maybe up to 32 from/to pairs, or 64 from/to pairs or even more if required. Remember, MREPLACE stops the recursion when it finds the first null p_from1 argument, so you should define it to accept more from/to substitution pairs than you expect to use.

Happy computing!

Lost Learnings