Thursday, November 20, 2014

Directory Lookups

Every organization has an employee directory, and building search functions to find someone in the directory is a common problem.  How do you look someone up?  By their last name? By their first name?  And what do we mean by last and first names?  In some countries (China, Hungary), the family name precedes the given name. Or by their full name, including a middle initial?

If an organization is small, it's easy to load all the names into a select list and let the user pick one.  For large organizations, searching the company directory is more challenging.   Is Marguerite in the directory as Marguerite, or Margaret, or maybe even Gretchen, or maybe just Marge?  And how many John Smiths and Jane Does do we have in the directory?

In fact, searching text is a challenging problem.  Consider product searches.   Go to www.ebay.com, enter a product, and you will likely get tens, perhaps hundreds of good matches.  In my experience, the eBay search engine works very well.  Other vendors do not do as well -- a search that turns up hundreds of poor matches is not very useful. 

In general, we want a search strategy that returns lots of good matches and very few poor matches.   In this post, we'll use the company directory as an example of how to solve this problem.

So, suppose someone sits at a web page, types in a name, and presses Enter.  How do we select a list of likely names to present?   Most organizations have tried a few approaches.  Soundex looks promising, but in practice tends to return many false positives.  Many organizations home-grow a solution:  only look at the first few characters of each name, perhaps drop all the vowels and squish everything together, or maybe swap first and last names.

Most home-grown solutions suffer from not returning enough likely matches and returning too many bad matches  --- the worst of all situations.  Worse, many organizations invest much time and effort home-growing the poor-performing solution. There are a lot of solutions out there, but rather than re-invent the wheel, let's examine a solution using Oracle's text indexes and fuzzy matching.  We can get good results by using the out-of-box tools provided in the Oracle database.

Let's start with a simple table and some data.  First we'll create a table to hold the names, split into a first name, last name, nickname, and finally all the names concatenated together. More about the all_names column later.

create table names (
    first_name varchar2(50),
    last_name varchar2(50),
    nick_name varchar2(50),
    all_names varchar2(160)
);

The first_name (John), last_name (Smith), and nick_name (Johnny) are entered as part of normal table maintenance.  The all_names columns is a concatenation of the three names, and we'll use a trigger to maintain the all_names column:

create or replace trigger names_before_iu
before insert or update on names
for each row
begin
   :new.all_names := :new.first_name || ' ' ||
                     :new.last_name || ' ' ||
                     :new.nick_name ;
end;

Next, let's load some test data. We'll use the system catalog to get some "names", with the object's name acting as a first_name and the object's type acting as a last_name. In addition, we'll use the Apex demo_customers table as a source of names:

insert into names (first_name, last_name)
select object_name, object_type
from user_objects;

insert into names (first_name, last_name)
select cust_first_name, cust_last_name
from demo_customers;

Our next step is creating an index on the table. We will index the all_names column using a ctxsys.context type of index:

create index names_ctx
    on names ( all_names)
    indextype is ctxsys.context
    parameters('sync(on commit)');

We will use this context index to do fuzzy matching. Our queries will use the contains function in the where clause, and we need to construct a fuzzy argument to pass to the contains function.  The fuzzy function expands the search to include similarly spelled words.  The fuzzy argument takes three parameters:
  • a search string that is at least 3 characters long
  • a number between 1 and  80 specifying the lowest similarity score we will accept.  A lower number returns more results, a higher number returns better matches.
  • a number specifying how many variations of the search string to use. The more variations we request, the more results we will see.
  • a string specifying whether to weight or noweight the results
These numbers will become clear after a few examples. Sometimes things become clear when you poke them with a stick and see what happens, and this is one of those times.  The fuzzy match argument is a bit tricky to type, so let's create a function to make our typing easier.

create or replace
function mk_fuzzy(p_name in varchar2,
                  p_similarity in number default 50,
                  p_n_variations in number default 10)
    return varchar2 is
    
begin

   return 'fuzzy(' || p_name ||  ',' ||
           p_similarity || ',' ||
           p_n_variations || ',W)';

end mk_fuzzy;

Now for our query.

select score(1), a.first_name, a.last_name  
from names a
where contains(all_names, mk_fuzzy('viw',1,3), 1)>= 0
order by 1 desc;

Let's start at the top.  First, the score function takes one argument, in this case the number 1.   The argument refers to the third argument in the contains function, the number 1.  These two numbers should match, and the score function returns the score from the contains function for each row returned by the query.  Below we will use the contains function twice, and the score index ties the score to a specific contains query.

Next, the contains function takes three parameters:
  • the column to search on 
  • a fuzzy argument.  In this case, we're looking for the word "view" misspelled as "viw".  The similarity score argument of 1 specifies to return any match, and the variation argument specifies to use three variations of the "viw" argument.
  • an index for the score function
 Let's run the query and examine our results:

SCORE(1) FIRST_NAME LAST_NAME
48V1VIEW
48V2VIEW
48V3VIEW
48V4VIEW
48V_EMPVIEW
8DEMO_ORDERS_BIUTRIGGER
8DEMO_ORDER_ITEMS_BITRIGGER
8DEMO_ORDER_ITEMS_BIU_GET_PRICETRIGGER
8DEMO_PRODUCT_INFO_BIUTRIGGER
8DEMO_CUSTOMERS_BIUTRIGGER
8DEMO_TAGS_BIUTRIGGER
0WEB_CLIENT2PROCEDURE
0WEB_CLIENTPROCEDURE
0TESTTABLE_SEQSEQUENCE


The scores range from 0 (no match) to 48 (poor match). 48 is not a great score, but the query does find all of our views.

The all_names column includes both first and last names.  After some trial and error against a real-life table of several thousand names, the following query produces a good combination of  relevant results:

select score(1), a.first_name,       
       score(2), a.last_name,       
       score(1) + score(2)
from names a
where contains(all_names, mk_fuzzy('viw',50,50), 1) >  0
  and contains(all_names, mk_fuzzy('emp',50,50),2) > 0 
order by 5 desc;

This query returns the each name, the score for the each name, and the sum of both scores.  Notice the score indexes:  score(1) returns the score from the first use of contains, and score(2) returns the score from the second use of contains.  Ordering the results by the sum of the scores gives us the best matches first, and it's agnostic as to whether the family names follow given names or not. And because our all_names column includes nick names, our query works well with an argument consisting of a nick name and a family name, too.

Against our sample table, the above query returns the following results:

SCORE(1)FIRST_NAMESCORE(2)LAST_NAMETOTAL_SCORE
52V_EMP76VIEW128

With both our name arguments misspelled ( "emp" instead of V_EMP, "viw" instead of VIEW), the query still returns the correct answer, and the query doesn't return any bad matches.   This is exactly what we want!

The Oracle fuzzy matching tools deliver high quality search results with a minimum of new development and much less future maintenance.  In addition, tweaking the similarity score and the term expansion in the fuzzy matching function makes the tool very flexible.   If we don't get enough results, or if the results are poor, tweaking the arguments will solve the problem, and it's much easier to tweak the two arguments than re-writing home-grown code.