High level overview of URL classification (URL categorization)

URL categorization or data categorization (if you want to classify a web page) which can also be called URL classification or data classification – can look magical, and in a way it is, because even if you use one of the known way to classify a web page you still need to be creative in order to get the best result which means low false positive and negatives.

The common ways to perform classification which this site will discuss in greater depth in other posts and go into the pros and cons of every approach:

Non weighted keyword based

You have a list of keywords for each category, and if you find a specific keyword you give the document that category of the keyword, for example you can say that every document that contains the word ‘fuck’ should be blocked, even though some documents may contain that word but would be family friendly otherwise. You can read more about: Non weighted keyword based URL classification.

Weighted keyword based

Same approach as the previous section, but each keyword has a different score, some low some high and a document will be in a category only if a certain threshold was passed, this allow greater accuracy then the previous method. You can read more about: Weighted keyword based URL classification.

SVM

Support Vector Machine, was invented 20 years ago and allows the software to determine how does a certain category looks based on a training set, first you give it N amount of documents of certain types (you can also provide N documents of different type for further training) and then you tweak the algorithm by doing a special calibration with another set of N documents of the same type, from that point on the algorithm can try to determine the classification of a document.

You can read more about: SVM based URL classification – Part 1, SVM based URL classification – Part 2.

Manual classification

Every page is reviewed by a human and the correct category is set. A strict set of rules should be set because each person thinks differently (and can be affected by culture and religion) and it’s not uncommon for two people to disagree over a category of a specific document, for example, a nude renaissance portrait, is it art? Or is it nudity?

Manual classification with crowdsourcing

Same approach as manual but instead of having number of trained people doing the classification, you leverage the power of the crowd with services like Mechanical Turks or Microworkers to classify a document.

Link based

This method can be used as a secondary helper to the previous methods, for example we can assume that a web page will give out links to similar pages, so if we have a list of popular gambling sites, we can assume safely that a web page without outgoing links to those sites is related to gambling (this approach will not work with portals, and statistics site).

Computer vision

At the moment of writing the article there isn’t any credible service that uses this approach, but this can be a legit way to do so when computer vision matures. The way it works is trying to determine the classification of the page by detecting the type of images on the page.

Non weighted keyword based URL classification

Keyword based URL classification (non weighted) can be good for environments where zero tolerance is needed.

A quick recap, keyword based classification means that when a word is encountered then the document or web page will be classified based on the keyword, for example if the word ‘sex’ would appear we can assume an adult document (you can read the entire: URL Classification summary).

The problem with such approach is that some words can be either good or bad, if we take the keyword ‘sucks’ which can be an indication of adult, it can also be in legit phrases such as: “man, that test sucks”, or it can be in a phrase that may or may not be adult: “that woman was sucking a lollipop”.

The non weighted keyword approach can work well in two scenarios:

  1. You try to block a search phrase and don’t have enough information to know if the search is of an adult nature or not (if you don’t take the search results into account).
  2. You don’t care about false positive and prefer to be over cautious.

We can see example of such blocks in Google, if safe search is enabled, some keywords will not return any results, for example the keyword ‘porn’ would be blocked, although you can have some legit uses of it, like in ‘porn blocker’.