SVM based URL classification – Part 1

How it works

SVM is a method used to determine the type of an object, and object can be anything like: web pages, text, images, hand writing.

The way it works (without getting into the math, if you do want to look at the math and go deep you can look at this: SVM guide) is that you give a classifier N amount of training sets (objects of the type you train the classifier to detect), and then you give the classifier another N of the same type of objects and you tweak the classifier to be more accurate.

What you would do is train a classifier for each category and then run a multi category classification using the SVM to try and detect the document.

How it works visual explanation

SVM looks at the objects in space (it’s called feature space and it can be from 2 dimensions to n-dimensions), for our example we will look at 2d space:

SVM

In the image you can see white circles and the algorithm needs to detect what is a white circle and what is black circle based on position in feature space, training the algorithm is needed to it can determined where does the boundary between the white and black circles (the solid black line in the image).

In the right image the algorithm uses is linear and the space between the dotted lines is determined when tweaking the classifier on the second run.

In the left image the algorithm is a kernel machine and is curve, again the space between the dotted line is determined by tweaking the algorithm.

Challenges

The first paragraph is overly simplistic; in reality SVMs are much more complex then magically training the classifier.

Challenge 1 – Algorithm

SVM can use number of detection algorithms: Linear algorithms, curved algorithms (Kernel machines) and each one has number of types, each category will benefit from a different algorithm. Once approach is to classify each category with number of algorithms and when trying to classify an object do a vote between classifiers of different algorithms.

Challenge 2 – Training set

The training set and tweaking set must be accurate because if for example on ‘porn’ training set you would put a news site by mistake, it will contaminate the sample and will cause the classifier to fail.

Another problem is the number of sites you need to provide, let say you have 100 categories and you need 100 sites for the first run and 100 sites for the tweak, you need to provide a total of 20,000 sites just for one language.

Challenge 3 – Training set coverage

Because there are so many types of sites in a single category you need to make sure the training set is broad as possible, for example let take the category ‘porn’, if you provided 100 sites of the same look and feel (for example a regular porn site) and then you tried to classify a different look and feel site (a forum with porn links) the classifier may not be able to correctly classify the site.

Challenge 4 – Representing a document

The example on the second paragraph with the 2D circles is pretty straight forward but with URL classification we use documents which can’t be represented in 2D, there are number of ways to convert a document, this will be covered in the next post.

Post continues in: SVM based URL classification – Part 2