Monday, October 18, 2010

Boolean Search to SQL CONTAINS

Today I rolled out Boolean Search feature for PostJobFree.
PostJobFree uses SQL Server backend, so it seems obvious to use Full-Text search that SQL Server have to process boolean search queries from users.
Unfortunately it's not as easy:
Full-Text search CONTAINS query is pretty strict, and crashes the whole SELECT statement if CONTAINS clause syntax is not correct. For example:
asp sql
query would crash SQL Server's SELECT command if the query would be passed into CONTAINS clause as-is. Correct syntax would be:
asp AND sql
So I need to write a parser that parses user's search input into boolean search query object hierarchy and then renders CONTAINS query for SQL Server from that hierarchy.
Here's how I designed it.
Search object hierarchy consist of the following parts:
1) Word. It consists only of letters, digits, dots, and dashes. Dashes can be only be inside the word, between letters/digits. Dash outside the word is considered as negation.
2) Phrase. Phrase can have any chars wrapped in quotes.
3) OrList. OrList consists from two or more elements, separated by "or". For example:
("sql server" or oracle or DB2)
4) AndList. AndList consists from two or more elements, separated by "and" or "and not" operators. For example:
sql and asp and not oracle
This query would be rendered into AndList that consists of 3 elements.
Both OrList and AndList could consist not only from atomic pieces (such as Word and Phrase), but also from OrList and AndList subelements.

Parsing logic:
1) Split input query by quote char ('"'). Basically even elements in this split list are phrases. Everything else does not contain phrases.
2) After phrases are extracted, partially processed list consists of strings and Phrase objects.
3) Next step is to extract parenthesis that are not wrapped by other parentheses. Consider example:
(one and (two or three)) or four
"one and (two or three)" would be extracted from original query, and then recursively processed again.
The recursion ends when there are no more parentheses.
4) When recursion is so deep, that there are no parentheses left in a subquery, then that subquery is split by "or" word (or "|" char).
5) Last major step is to split remaining subquery into AndList. The trick here would be to appropriately assign negation if subquery was preceded by negation orerator "not", "!", or leading dash ("-").

After query is fully parsed, it's time to get final CONTAINS query for SQL Server.
It can be accomplished by overriding ToString() method for Word, Phrase, OrList, and AndList:
1) Word just converting containing text to UpperCase.
2) Phrase's ToString() overload converts InnerText to UpperCase and wraps it by quotes.
3) AndList joins all elements into single string using "&" separator or "&!" separator if one of AndList's subqueries was negated.
4) OrList joins all elements into single string using "|" separator. Then it wraps the result by parenthesis.

2 comments:

Anonymous said...

Hi

I like your logic and am searching for a C# code implementation that does exactly this. Any clues where I might find an example of this?

Thanks! Paul

Dennis Gorelik said...

I spent about 3 days writing that code.
You may do that too.

I don't know about any published solutions for that.

Followers

About Me

My photo
Email me: blog@postjobfree.com