URL categorization or data categorization (if you want to classify a web page) which can also be called URL classification or data classification – can look magical, and in a way it is, because even if you use one of the known way to classify a web page you still need to be creative in order to get the best result which means low false positive and negatives.
The common ways to perform classification which this site will discuss in greater depth in other posts and go into the pros and cons of every approach:
Non weighted keyword based
You have a list of keywords for each category, and if you find a specific keyword you give the document that category of the keyword, for example you can say that every document that contains the word ‘fuck’ should be blocked, even though some documents may contain that word but would be family friendly otherwise. You can read more about: Non weighted keyword based URL classification.
Weighted keyword based
Same approach as the previous section, but each keyword has a different score, some low some high and a document will be in a category only if a certain threshold was passed, this allow greater accuracy then the previous method. You can read more about: Weighted keyword based URL classification.
Support Vector Machine, was invented 20 years ago and allows the software to determine how does a certain category looks based on a training set, first you give it N amount of documents of certain types (you can also provide N documents of different type for further training) and then you tweak the algorithm by doing a special calibration with another set of N documents of the same type, from that point on the algorithm can try to determine the classification of a document.
You can read more about: SVM based URL classification â€“ Part 1, SVM based URL classification â€“ Part 2.
Every page is reviewed by a human and the correct category is set. A strict set of rules should be set because each person thinks differently (and can be affected by culture and religion) and it’s not uncommon for two people to disagree over a category of a specific document, for example, a nude renaissance portrait, is it art? Or is it nudity?
Manual classification with crowdsourcing
Same approach as manual but instead of having number of trained people doing the classification, you leverage the power of the crowd with services like Mechanical Turks or Microworkers to classify a document.
This method can be used as a secondary helper to the previous methods, for example we can assume that a web page will give out links to similar pages, so if we have a list of popular gambling sites, we can assume safely that a web page without outgoing links to those sites is related to gambling (this approach will not work with portals, and statistics site).
At the moment of writing the article there isn’t any credible service that uses this approach, but this can be a legit way to do so when computer vision matures. The way it works is trying to determine the classification of the page by detecting the type of images on the page.