By Leon van der Veen on Sunday February 2, 2014 10:48 pm

To make my site more complete I wanted to add a search engine. Because I store my data in a MongoDB database and because I wanted to practice with MongoDB, I decided to create my own search engine.

My search engine requires some preparation on the data. The algorithm leans on generated keywords and their weights. Every item stored in the database is analyzed for keywords. The keywords are sorted using the appearance count. After this step some often used words are filtered out. You should think of words like "the", "to", "me", "you", "am", etc. These words, because they appear so often, will ruin the search results. The counts of the filtered keywords will now be normalized, giving every keyword a percentage of occurrence in the (filtered) text.

The search algorithm works on every collection (table) in the database with these analyzed keywords. If you search on for example "MongoDB sql" a query like this will occur: give all items whose keywords contain "mongodb" or "sql". You are searching on two words, but the search engine will also return items who just (partial) match one of these words.

The above algorithm is not really a "good" search engine yet. If you search on "MongoDB sql" you want to find items who match both words and probably not who match just one of the two. To solve this I created a ranking system. The ranking system takes three things in account: 1) the number of keywords who match the search query, 2) the weights of the keyword found in the item and 3) the completeness of the matching keywords. For each matched keyword in an item a score will be calculated using the three points stated earlier. The score is determined the following way:

item.score = item.score + keyword_weight × (searchWord.length / keyword.length)

So for each match on a keyword the score will be made higher. The (searchWord.length / keyword.length) part makes sure that matches who completely match a keyword are ranked higher as words who partly matches a keyword. For example when you search on "sql" the keyword "nosql" will only match 3 characters out of 5 giving it a lower ranking than when the keyword was "sql", because then the ration would have been 3 out of 3.

This gives the following algorithm:

  • Keyword generation:
    • Analyze text for words
      • Count each word on appearance
      • Sort words on count (big to small)
    • Filter most common words out (for better search performance)
    • Normalize the word counts to percentages
      • For each word store a weight number calculated as following:

        weight = word_count / word_count_sum
         
    • Store all found words and their weights
      • For example:

        db.collection.insert({
          "keywords": {"words": ["keyword1", "keyword2", ..., "keywordN"],
          "weight": [0.02, 0.0113, ..., 0.00013]},
          ...
        })

         
  • Search algorithm:
    • Search words in MongoDB:

      db.collection.find({
        "keywords.words": ["word1", "word2", ..., "wordN"]
      }).sort({"$natural": 1})

       
    • For each found item a ranking score is determined
      • Initialize the score on zero: item.score = 0
      • For each matched keyword do:

        item.score = item.score + (
          keyword_weight
          × (searchWord.length / keyword.length)
        )

         
    • The items are sorted using this ranking score (big to small)

This algorithm gives interesting results. For example when you search on "MongoDB sql" it is possible that an item who only matches on the word "mongodb" is ranked higher, because this word appeared quite frequently in the text, than an item who matches both "mongodb" and "sql", because these words maybe only occurred once in the text. If this is preferred behavior is open for discussion. An up-side is that texts who really have "mongodb" as a topic are preferred above pages who maybe loosely mentioned "mongodb" and "sql".

There are a few down-sides to my algorithm. Namely that for every item in the database a (big) list of keywords need to be stored, taking up precious storage space. Another thing is that I programmed the ranking system in pure PHP code, meaning that the entire unlimited search result needs to be loaded in PHP's memory. As far as I know there are no possible ways in MongoDB to add smart sorting expressions in the query itself.

Images:


Comments

 Last 5 songs I played (Last.fm)

Loading...