Non weighted keyword based URL classification

Keyword based URL classification (non weighted) can be good for environments where zero tolerance is needed.

A quick recap, keyword based classification means that when a word is encountered then the document or web page will be classified based on the keyword, for example if the word ‘sex’ would appear we can assume an adult document (you can read the entire: URL Classification summary).

The problem with such approach is that some words can be either good or bad, if we take the keyword ‘sucks’ which can be an indication of adult, it can also be in legit phrases such as: “man, that test sucks”, or it can be in a phrase that may or may not be adult: “that woman was sucking a lollipop”.

The non weighted keyword approach can work well in two scenarios:

  1. You try to block a search phrase and don’t have enough information to know if the search is of an adult nature or not (if you don’t take the search results into account).
  2. You don’t care about false positive and prefer to be over cautious.

We can see example of such blocks in Google, if safe search is enabled, some keywords will not return any results, for example the keyword ‘porn’ would be blocked, although you can have some legit uses of it, like in ‘porn blocker’.