You are currently on IBM Systems Media’s archival website. Click here to view our new website.


Bookmark and Share
RSS

Recent Posts

i Can … Find a Needle in a Haystack using OmniFind Text Search Server for DB2 for i

April 6, 2010

Almost every application or Web domain that we come in contact with seems to have a search engine associated with it. Whether we’re looking for information about the latest gadget, answers to one of the great mysteries of the universe, or that lost e-mail, search engines are indispensable tools for finding relevant information in a vast sea of data. Did you know that starting in 6.1 this search technology is available for DB2 for i?

OmniFind Text Search Server for DB2 for IBM i is a no-additional-charge licensed program that, once installed, enables indexing and searching text data stored in a DB2 column. Full-text search indexes can be built over columns of many different data types that can contain text (BLOB, CLOB, DBCLOB, CHAR, VARCHAR, GRAPHIC, VARGRAPHIC, BINARY and VARBINARY).

 In addition to plain text, the column may contain HTML, XML, or rich-text documents such as PDF, Microsoft Word, Lotus WordPro or Microsoft PowerPoint.

For example, let’s assume I have a set of short stories (in PDF format) that I want to store in a DB2 BLOB column.

I’ve created a table to store the documents, using SQL, like this:

> CREATE TABLE story_library.story_table(
story_id         INTEGER GENERATED ALWAYS AS IDENTITY
NOT NULL,
            publish_date          DATE,
            story_doc         BLOB(10M),
            PRIMARY KEY (story_id));


Now, let’s assume I’ve loaded the above table with all of my great stories, and I’ve used OmniFind to index the text data in the story_doc column.

(Creating and administrating an index is accomplished through the use of SQL-stored procedures and is explained in detail in the product’s user manual.)

The text search index allows me to use the built-in CONTAINS and SCORE functions in my SQL queries. CONTAINS will tell me if a document matches my query. A zero return code indicates that the document isn’t a match, while a one indicates that the document is a match. The SCORE function will return a score value that indicates how relevant the match is.

If I wanted to retrieve the story_id for stories about a blind mouse, then I could write an SQL query like this:

> SELECT story_id FROM story_library.story_table WHERE CONTAINS(story_doc, 'blind mouse') = 1;

This query result will include matches that we’d expect from a typical search engine.  The search is case insensitive, and linguistic variations on the search words will be matched. In other words, the previous query will indicate a match for documents that contain “Blind Mice.” In a similar manner, a search for “bad wolves” would return documents that contained “the Big Bad Wolf.”

Other common conventions for search syntax are honored. If I really only want exact matches for “three blind mice,” with the search terms in order and no variations on the words considered, I can enclose the search expression in double quotes.

    SELECT story_id FROM story_library.story_table WHERE CONTAINS(story_doc, '"three blind mice"') = 1;

The logical operators AND, OR, and NOT are supported,
    SELECT story_id FROM story_library.story_table WHERE CONTAINS(story_doc, '"three little pigs" AND “big bad wolf”') = 1;

Usually when we use a search engine, we want the most relevant document to be listed first in the results. This is the purpose of the SCORE built-in function. SCORE returns a value between zero and one.  Zero indicates the document isn’t a match at all, and one indicates a theoretical perfect match of the search criteria.

Because the score indicates “how well” a document matches a search query compared to other documents in the index, it can be used to order the documents returned in a result set.

i.e.,
    SELECT story_id FROM story_library.story_table WHERE CONTAINS(story_doc, '"three little pigs" AND “big bad wolf”') = 1 ORDER BY SCORE(story_doc, '"three little pigs" AND “big bad wolf”') DESC;

The addition of “ORDER BY SCORE” to the SQL query ensures that the most relevant documents are returned first.

Let’s say I only want stories about little pigs that were published in 2010. This proves to be very easy to do using SQL and CONTAINS.

> SELECT story_id FROM story_library.story_table WHERE CONTAINS(story_doc, 'little pigs') = 1 AND YEAR(publish_date) = 2010;

In this blog, we’ve seen the new 6.1 built-in CONTAINS and SCORE functions allow an application to locate and rank unstructured text documents by using powerful (yet still familiar) search expressions. When working with text data in a DB2 column, full-text indexing is an invaluable way of efficiently figuring out which documents are of the most interest and importance.

For further reading on how to create and administer a text search index, as well as more sophisticated search examples, check out the OmniFind documentation in the IBM i Information Center.

Another great resource is Kent Milligan’s white paper.

This week's blog article was written by Nick Lawrence. Nick works for IBM in Rochester, Minn., on DB2 for IBM i. His responsibilities include catalog support, DB2 object management and full-text search. Thanks Nick!

Posted April 6, 2010| Permalink

-->