matching webpages keywords to a set of keywords in database

in my database i’ve list of say 5000 rows (as keywords). now given an article (may be even 1000 words), i want to match any words matching to keywords in database. for example, i’ve these records in db (rows are mentioned using comma):
tv, tv and videos, movie, horror movie, camera, digital camera, canon digital camera

Each row has another column named ‘URL’. And my input string may be like:

Read More

i hate horror movies. Canon releases a new digital camera. So far I own 3 digital cameras…’

from the above string, i need to match:

  • horror movies against my stored keyword horror movie. but i don’t want only movies to be matched against movie as horror movie keyword is more appropriate here.
  • digital camera against stored keyword digital camera
  • digital cameras… against stored keyword digital camera

(bold phrases exists in the article, italic phrases are in database)

Iteration through each keywords in database may be impossible and unrealistic. so far i’ve learned Solr may be a fit. but i’m not sure how will I index & query Solr. For querying Solr, I’ve to provide the keywords. But i do not know what are my keywords. I just know the whole article. A keyword may consists 1 or many words. Totally random, but i can say maximum 5 words.

After matching, i need to replace the keywords in the article with the next column’s (URL) value. for example, the keywords in the article horror movies need to be replaced with the URL column of the horror movie.

can anyone enlighten me with the correct path? any help is appreciated.

thanks in advance

Related posts

Leave a Reply

1 comment

  1. I think there are actually a few parts to your question but I am to assume that you basically want to submit an article to your website and then your website backend will the process this article and replace all keywords with a bold font and url?

    Given that you want to manipulate your article and inject data from your db perhaps a raw php solution would fit the bill (but I must admit I’m not familiar enough with Solr / Lucene to offer advice). You would obviously have to do some benchmarking but 5,000 rows of data is not unreasonable amounts of data to deal with in this type of system as I assume you won’t be submitting articles every few seconds or even minutes?

    Therefore you would need to:

    1. Select all your ‘keywords’ from your database at once, including your url’s (to be stored in memory for processing the article).
    2. Convert (if required) your keyword data into a multi-dimensional associate array (as it will be easier to deal with if your db query hasn’t already done this).
    3. Establish your plural variations either in an additional array or perhaps, if they vary from keyword to keyword in an additional column in your db).
    4. When you submit your article you have to (one way or other) loop through all 5000 keywords (including additional loops for all plural variations) searching the whole article for said keywords.
    5. If any keywords are found in your article you would then obviously need to replace these matches with the bold font and matching url.
    6. Then once all keywords in the article have been processed you obviously need to then either save or show this article to the user?

    Two other points regarding your keywords are:

    • Establishing if your plural keywords extend beyond a ‘s’ postfix
      (i.e. are you also trying to match plural words which actually
      change the spelling like, ‘buddy’ => ‘buddies’)?
    • Grammatically, how far are you prepared / need to go, would you need
      to or want to match things like digital camera’s to digital camera?

    Sorry for the rather long winded / cryptic answer but I think it’s important to understand the full scope of your project before looking for a solution. You may already have these details down but from your question I think there are too many variables to offer a complete answer. I would gladly offer some code samples / further explanation on any of the above points if you require?

    Edit: Bit confused as you make reference to identifying your keywords, aren’t the keywords the words already in your db? Or are you trying to organically create a db of keywords from submitted articles, so your system will harvest your articles for keywords whilst also searching and replacing for currently stored keywords from your db?