Weighted keyword based URL classification

Overview

Using the weighted keyword approach allows for better fine tuning over the classification of the web site if you compare it vs. the non weighted approach.

The way the weighted keyword method words is that you assign a value to each keyword, a keyword that is a high indicator of a category would get a high value and a more common word that may only indicate the category when repeated often will get a lower value.

Theoretical example

For example, for category of ‘porn’ the keyword ‘blow job’ would get a more higher value then the keyword ‘sensual’, the more exact the keyword is the higher the value, another example of a high value keyword would be ‘Arizona escorts’ which is very precise.

Practical example

Let take a few keyword under porn:

Sex – 50
Porn – 40
Adult – 10

And under news:

News – 20
Reporter – 20
Breaking news – 40

If we analyze this sentence: ‘our reporter just have breaking news about sex ring that was arrested, further in the news’

We can see that the ranking for category news would be: 80 and for porn it would be 50, so we can say this is a news category.

Categories relationships

Once you analyzed the document (for web sites, all keyword are not considered the same, for example you might give the title an extra weight then the body, this will be covered in a different post) you get a list of categories and score, the highest score category is usually the category of the document, the second category may indicate a secondary category, you will need to see based on your weights when to allow the second category, for example when over 50% of main category.

If you look at the previous example, we could say the main category is news and the secondary is porn.

Cross over threshold

You may want to define a cross over threshold which means that if a category passed that number, it will be considered as the main category, this is usually done with porn/adult, which means that even if the category is not first, it will still be considered porn/adult when crossing that boundary.

Advanced weighted options

Another usage for categories is to decide which categories are ‘bad’, for example all the non family categories, and if the sum of all the ‘bad’ categories are more then the sum of the ‘good’ categories, you dim the document ‘bad’ and will choose the highest ‘bad’ category score even if it’s not first.

Playing with the categories

Once you run your engine on a number of sites, patterns will start to emerge and you can see that sometimes having two main categories in different ratio usually indicates a third category, for example a site that has porn/adult and dating as the two main categories usually indicates this is an adult dating site (dating with sex), or entertainment and adult can indicate a gossip site.

Leave a Reply