Text & Data Mining by practical means: classifier

Showing posts with label classifier. Show all posts

Tuesday, March 18, 2014

Waiting Time Polynomials Document Classifier - Part III

Abstract
In the post is presented an innovative definition of polynomials associated to waiting time processes (analyzed in the former posts). Such polynomials are here successfully used as document classifier. Comparative tests with SVM show significant accuracy improvements.

Boolean Classification tests based on test set of 8k randomly generated documents composed using an alphabet of three words.

Introduction
To encourage who found quite intricate the formula I presented a couple of posts ago, I'm going to present you an its practical application that might be a good incentive to analyze the formal aspects with more attention :)
What I show to you today is one of several application of such approach: a document classifier having higher accuracy than traditional methods as SVM (trained with gaussian kernel) and Neural Networks back propagated.

Characteristics of the classifier

It's a supervised learning algorithm
It's completely non parametric
It can be used natively to classify multiple classes datasets.

The Algorithm
Let's assume to have a training set composed by two classes of documents: Cl_1, Cl_2.

Learning Phase: Estimation of Geometrically distributed Random Variables.

Define an alphabet of three words {w1,w2,w3} using frequencies criteria or more sophisticated techniques.
For each class of training set:

estimate parameters {p1, p2, p3} of the respective {X1(w1),X2(w2),X3(w3)} geometrically distributed random variables.

Calculate the polynomials associated to {X1(w1),X2(w2),X3(w3)} using:

Testing Phase: document classification

for each document Di of the test set:

Identify the number of occurrences of {w1,w2,w3}: {O_w1,O_w2,O_w3}
Select the polynomial for which:
{O_w1,O_w2,O_w3} =p1^O_w1 p2^O_w2 p3^O_w3.
Calculate the value of the polynomials P_Cl_1, P_Cl_2 using:

{p1, p2, p3} estimated for Cl_1
{p1, p2, p3} estimated for Cl_2

Classify the document:

If (P_Cl_1>P_Cl_2) Di belongs to Cl_1
Else Di belongs to Cl_2

Examples
1. How to select the polynomial

Let's consider the polynomials calculated using the aforementioned formula and assume that in document Di the word w1 is repeated 2 times, the word w2 is repeated 2 times and w3 is repeated 1 time.
Then, In the step 1 of the testing phase we have to choose the polynomial boxed in the below list:

Polynomials generated for O_w1+ O_w2+O_w3=5

2. Same polynomial, different values of p1, p2, p3.

- How many polynomials are generated for O_w1+ O_w2+O_w3 = 25?

The answer is straightforward: 276 that is, all the possible configurations of 3 addends for which the sum =25. In the Encyclopedia of Integer Sequences, there are exciting explanation on such series.

- How does the polynomial change for different values of p1, p2, p3?

It's quite impressive how two polynomials can be different despite the setting of {p1, p2, p3} is almost identical.

Look at this example for which:

On the left chart I plotted a polynomial with p1= 0.33, p2= 0.33, p3=0.34.
On the right chart I plotted a polynomial with p1= 0.4, p2= 0.3, p3=0.3.

In the first case the polynomial takes the maximum value for O_w1=8, O_w2= 8, O_w3 =9 ...not a big surprise!
In the second case the polynomial takes the maximum value for O_w1=15, O_w2= 9, O_w3 =1. In this case the result is more complicated to be explained!

Document classification comparative test

To test the accuracy of my method I performed a comparative test using a randomly generated document set.

The training set was composed by 100 documents (used to train a Gaussian Kernel SVM and to estimate params p_i for my method).
The test set was composed by:

Cl_1: 4000 documents randomly generated using a configuration of {p1, p2, p3}.
Cl_2: 4000 documents randomly generated using a different configuration of {p1, p2, p3}.

The accuracy has been tested using different configurations of {p1, p2, p3}, and considering different size of documents.

Results

First experiment:

just the first 25 words have been considered to estimate the params {p1, p2, p3}, to train the SVM and to test the the accuracy.

Accuracy results considering just the first 25 words.

Second experiment:

Same test as above using the first 35 words of the documents.

Accuracy results considering just the first 35 words.

All the results showed there are referred to accuracy achieved on average using 10 different randomly generated test set and trying to configure SVM params to maximize the accuracy.

Considerations

As you can see the method based on definition of "Waiting Time Polynomials" that I'm proposing, performs significantly better than SVM.

More comparative tests will be shown in the publication I'm writing on such topic.

Further notes

Processes based on waiting time or geometrically distributed random variables are extremely important for risk assessment and risk evaluation.

I'll show you in another post some application of such polynomials in this field.

As usual, stay tuned

cristian

Thursday, September 8, 2011

Classification Step: C 5.0 vs SVM

Finally, after many preambles we have all conditions to classify the documents. So far we have seen:

How to translate a document in a vectors understandable by algorithms;
How to analyze the data set with density analysis;
How to reduce the dimension of the problem (features reduction)

Well, now we have the all ingredients to classify the documents.

To show a couple of approach, I have extracted 4 classes from our REUTERS data set (choosing the criteria “TOPICS”, and considering the easier condition of documents belonging only to one class): “GOLD”,“GRAIN”,”OIL”,”TRADE”.

In this stage I put my self in a “comfortable zone” so, I considered the easiest way to classify the documents, that is, I built the vectors considering just ONE “bag of words”.

Basically I extracted the features set from the entire training set without any distinction of the specific class of the document.

In this approach we are delegating the entire complexity of the task to the classifier!

As we can see below, be lazy never pay!! (…I hope that it makes sense in English… ).

Let’s move on the results.

As I always recommend, never play with one toy! Be always curios and try different approach to solve the problem.

In this spirit for the time being I have done some tests using SVM (in LIBSVM implementation) and C.5 (the free version).

I embedded in Wolfram Mathematica the scripts to run both LIBSVM and C5 (contact me at cristian.mesiano@gmail.com if you desire the source); in this way all tasks (data preparing, features extraction, vector building and so on) can be done using only one interface.

Here you are the accuracy for SVM:

Notice that the training set has been rescaled and weighted with different scores.

Accuracy For SVM Classifier: Best results with Linear Kernel

As you can see, the accuracy measured on the training set is not brilliant! The OIL class is in the 50% of the case mislead. Moreover we have to consider that:

· we are working with an easy data set and with few classes.

· The accuracy has been measured only on the training set (…remember that the first error of the newbie is to focus much more on the training env.). The training set contains only points already observed by the classifier. And analysis done only on it can hide overfitting problems!!

Even trying with different tuning, and with different kernels, the accuracy doesn’t improve.

The situation doesn’t improve with C5: as you can see below, the accuracy is pretty much the same achieved via SVM.

The Decision Tree returned by C5 (plotted with Mathematica).

Accuracy Matrix Provided by C5: The results are pretty similar to the SVM computations

So Do we give up?

NEVER.

In the next posts we will see how improve the accuracy introducing boosting techniques.

…There are thousands of details to consider and to explain like:

linear kernels for SVM, how to measure the accuracy, how to improve the training set and avoid the over fitting…

We will see that in the next posts.

Stay Tuned

Cristian

Thursday, July 7, 2011

SVM classification, an intuitive explanation and some test with LIBSVM

In a previous post I focused on linear classification and on the kernel trick to make linear separable points not linear separable.
...In this post I would explain why this trick is the base to build one of the most powerful classifier: the support vectors machines (SVM).
Consider the following set of point red and blue:

As you can see, you cannot find a straight line to divide blue points from the red points.
...But as mentioned before using the kernel trick we can find a space having greater dimension where these points can be separate by straight line.
Formally,
we are looking for a straight line

xi ·w+b≥+1 for yi =+1 and xi ·w+b≤−1 for yi =−1

where xi represents the points (blue or red), yi =+1 to represent red points and yi =−1for blue points.

w and b represent the unknown parameters of the straight lines.

The above equations can be combined into one set of inequalities: yi(xi ·w+b)−1≥0 ∀i

The SVM works out the classification bringing the points in a "kernel space" where the points can be divided finding a simple hyperplane.

I won't enter in technical details (the theory is based on Lagrange multipliers, kernel space, and Karush-Kuhn-Tucker Conditions) but if you are interested you can find an easy but exhaustive description on "A Tutorial on Support Vector Machines for Pattern Recognition".

I implemented in Mathematica an easy routine to call the library LIBSVM to classify the above points (contact me to obtain the notebook):

To obtain this results (overall accuracy of classification: 99,2%) i trained the SVM using a gaussian kernel.

Another 2D sample (just to highlight the hyperplane I removed the color gradations):

2D points

Hyperplanes founds (in blue) for the above points

One of the interesting aspects of SVM is its vector notation: it allows a complete generalization of the problem: you can use the same algorithm to solve problem in any dimension!

For example in 3D scenario, you have:

3d Points

Hyperplanes found

...another example:

Now we are ready to jump into the real world and attempt to classify text documents!

In the next post I will explain a real case application of document classification using SVM.

As usual: STAY TUNED!!