SVM based URL classification – Part 2

In our previous article we discussed how SVM works, now it’s time to transit from theory to practice.

Processing the document

This post will ignore meta data inside a document (for example Title in a HTML page, it will be covered in later posts). First we need to normalize the document.

Stop words

First step would be to remove any stop keyword, stop keywords are keywords that will not help our classification attempt, for example here is a short list: a, the, on, of, in. You can find many sites that provide will stop words, in various languages.

Zipf’s law

Zipf’s law establishes relationship between word frequencies in various languages, basically what it says is that the speaker will and listener will try to work as easy as possible, which means that the speaker will speak in the easiest way to him, which means many sentences can be ambiguous and will force the listener to process the conversation “harder” in order to understand, on the other hand the listener will want that the speaker will work “harder” and will be detailed and non ambiguous in his speaking.

Zipf’s law explains why stop words exists and why they have minimal affect on the classification process.

Stemming

Stemming is the process of converting keywords to the root stem, for example the words: running, ran, run will be converted to the base stem of run, also it will convert plural form to singular for example: berries, berry will be converted to berry (singular form). Some sites can provide you with a stemming database in various languages.

Converting the document to feature space

Once the document is normalized we need to convert it to feature space, in our previous post we showed a 2d feature space, with documents the feature space is infinite.

To convert the document we compile an index of keywords, for example:

Father – 1
Mother – 2
Car – 3
Truck – 4
Likes – 5

The keywords are taken from the data set of the current category only and the index is relevant only to this category, so for the category ‘porn’ we will have a different index then category ‘news’. Once we compiled the index from the keywords we have a feature space with N dimensions (N being the number of keywords)

Practical example

To convert a document we create a vector in our N dimensions feature space and the value of each dimension is the count of the word.

For example, this is an empty document in our 5 dimensions feature space: (0,0,0,0,0) (a 5 dimensions zero vector)

The sentence: “Father likes car” will be represented as (1,0,0,1,1)
The sentence: “Father likes car father likes mother” will be represented as (2,1,1,0,2)

Phrases

Another approach we can be used in conjunction with using one keyword, is use two or three keywords in the classification index, this means that the phrases are more detailed, for example we can add to the index (based on the examples):

Father likes – 6
Likes car – 7
Car father – 8
Likes mother – 9

Missing words

Since the index is built from the training set, all the words will be indexed, but when classifying a new document there may be words that are not the index, since the original training set doesn’t contain them, you should not add them to the feature space, unless you decide to add this document to the training set and retrain the algorithm.

SVM based URL classification – Part 1

How it works

SVM is a method used to determine the type of an object, and object can be anything like: web pages, text, images, hand writing.

The way it works (without getting into the math, if you do want to look at the math and go deep you can look at this: SVM guide) is that you give a classifier N amount of training sets (objects of the type you train the classifier to detect), and then you give the classifier another N of the same type of objects and you tweak the classifier to be more accurate.

What you would do is train a classifier for each category and then run a multi category classification using the SVM to try and detect the document.

How it works visual explanation

SVM looks at the objects in space (it’s called feature space and it can be from 2 dimensions to n-dimensions), for our example we will look at 2d space:

SVM

In the image you can see white circles and the algorithm needs to detect what is a white circle and what is black circle based on position in feature space, training the algorithm is needed to it can determined where does the boundary between the white and black circles (the solid black line in the image).

In the right image the algorithm uses is linear and the space between the dotted lines is determined when tweaking the classifier on the second run.

In the left image the algorithm is a kernel machine and is curve, again the space between the dotted line is determined by tweaking the algorithm.

Challenges

The first paragraph is overly simplistic; in reality SVMs are much more complex then magically training the classifier.

Challenge 1 – Algorithm

SVM can use number of detection algorithms: Linear algorithms, curved algorithms (Kernel machines) and each one has number of types, each category will benefit from a different algorithm. Once approach is to classify each category with number of algorithms and when trying to classify an object do a vote between classifiers of different algorithms.

Challenge 2 – Training set

The training set and tweaking set must be accurate because if for example on ‘porn’ training set you would put a news site by mistake, it will contaminate the sample and will cause the classifier to fail.

Another problem is the number of sites you need to provide, let say you have 100 categories and you need 100 sites for the first run and 100 sites for the tweak, you need to provide a total of 20,000 sites just for one language.

Challenge 3 – Training set coverage

Because there are so many types of sites in a single category you need to make sure the training set is broad as possible, for example let take the category ‘porn’, if you provided 100 sites of the same look and feel (for example a regular porn site) and then you tried to classify a different look and feel site (a forum with porn links) the classifier may not be able to correctly classify the site.

Challenge 4 – Representing a document

The example on the second paragraph with the 2D circles is pretty straight forward but with URL classification we use documents which can’t be represented in 2D, there are number of ways to convert a document, this will be covered in the next post.

Post continues in: SVM based URL classification – Part 2