SVM based URL classification – Part 2

In our previous article we discussed how SVM works, now it’s time to transit from theory to practice.

Processing the document

This post will ignore meta data inside a document (for example Title in a HTML page, it will be covered in later posts). First we need to normalize the document.

Stop words

First step would be to remove any stop keyword, stop keywords are keywords that will not help our classification attempt, for example here is a short list: a, the, on, of, in. You can find many sites that provide will stop words, in various languages.

Zipf’s law

Zipf’s law establishes relationship between word frequencies in various languages, basically what it says is that the speaker will and listener will try to work as easy as possible, which means that the speaker will speak in the easiest way to him, which means many sentences can be ambiguous and will force the listener to process the conversation “harder” in order to understand, on the other hand the listener will want that the speaker will work “harder” and will be detailed and non ambiguous in his speaking.

Zipf’s law explains why stop words exists and why they have minimal affect on the classification process.

Stemming

Stemming is the process of converting keywords to the root stem, for example the words: running, ran, run will be converted to the base stem of run, also it will convert plural form to singular for example: berries, berry will be converted to berry (singular form). Some sites can provide you with a stemming database in various languages.

Converting the document to feature space

Once the document is normalized we need to convert it to feature space, in our previous post we showed a 2d feature space, with documents the feature space is infinite.

To convert the document we compile an index of keywords, for example:

Father – 1
Mother – 2
Car – 3
Truck – 4
Likes – 5

The keywords are taken from the data set of the current category only and the index is relevant only to this category, so for the category ‘porn’ we will have a different index then category ‘news’. Once we compiled the index from the keywords we have a feature space with N dimensions (N being the number of keywords)

Practical example

To convert a document we create a vector in our N dimensions feature space and the value of each dimension is the count of the word.

For example, this is an empty document in our 5 dimensions feature space: (0,0,0,0,0) (a 5 dimensions zero vector)

The sentence: “Father likes car” will be represented as (1,0,0,1,1)
The sentence: “Father likes car father likes mother” will be represented as (2,1,1,0,2)

Phrases

Another approach we can be used in conjunction with using one keyword, is use two or three keywords in the classification index, this means that the phrases are more detailed, for example we can add to the index (based on the examples):

Father likes – 6
Likes car – 7
Car father – 8
Likes mother – 9

Missing words

Since the index is built from the training set, all the words will be indexed, but when classifying a new document there may be words that are not the index, since the original training set doesn’t contain them, you should not add them to the feature space, unless you decide to add this document to the training set and retrain the algorithm.