Simple (naive) document clustering using tf-idf and k-mean

When i developed this blog (using my own client-server platform such as web server, back-end, front-end, etc., built from ash/scratch :) ), i simply designed it as a simple "note book" where i put my ideas or some stuffs that i have done. So, initially, there are no category no advance feature like post suggestion based on current post, etc. It is just a bunch of posts sorting by date. The thing is, i usually work on many different domains (robotic, IoT, backend, frontend platform design, etc.), so my posts are mixed up between different categories. It is fine for me, but is a real inconvenience for readers who want to follow up their interesting category on the blog. Of course, i could redesign the blog and add the missing features by messing around with the relational database design (i'm using SQLite btw), manually classifying the posts in the back-end, etc. But, i'm a kind of lazy people, so i've been thinking of a more automatic solution. How about an automatic document clustering feature based on a data mining approach ? Here we go!

Document clustering in a nutshell

Basically, document clustering is the task of automatic document organization, topic extraction and fast information retrieval or filtering. In our case, we are interested in a clustering method based on document content (aka. content-based clustering methods). Such method typically follows these common steps:

  1. Tokenization: is the process of parsing the input document into small units such as words or phrases. Bag-of-words is a commonly used tokenization method, which will be detailed later.
  2. ** Stemming** and lemmatization: replaces similar tokens by their base form using stemming and lemmatization dictionaries. For example "work", "works", "working" tokens can be represent by a single "work" token. This process depends on the nature language using in the document. I will skip this step in my implementation since it involves complex linguistic processing methods.
  3. Removing stop words (using a dictionary) and punctuation: stop words and punctuations appear many times in the document but are not helpful to identify the topic of the document. They need to be remove from the tokens list
  4. Calculating the weight of each token/term in document. These weights provide some clue about the topic of each document, and they are helpful in measuring the similarity between documents. TF-IDF, aka.** term frequency-Inverse document frequency**, is the commonly used method, we will detail about it later. This step will generate a feature vector for each document.
  5. Clustering: automatically group similar documents to a topic/category based on the features generated in the previous step. Here we will use the k-mean clustering method.
  6. Evaluation (optional)

Document tokenization using bag-of-words

A bag-of-word model, aka. a vector space model, as explained by its name, represents a document as a bag of its words disregarding of the grammar. Usually a bag of word is represented in key-value pairs form where each key is the word and each value is the occurrences of that word in the document. For example, given the text:

Basically document clustering is the task of automatic document organization, topic extraction and fast information retrieval or filtering

This text can be represented as a bag-of-words:

{
    "information": 1,
    "task": 1,
    "extraction": 1,
    "organization": 1,
    "retrieval": 1,
    "clustering": 1,
    "document": 2,
    "filtering": 1,
    "fast": 1,
    "basically": 1,
    "automatic": 1,
    "topic": 1
}

Note that, in this example, the stop worlds such as** is, of, and** and punctuation such as** ","** are removed from the bag.

From now on, when we talk about a document \(d\), we actually refers to its bag-of-words representation.

Extract feature vector from document using TF-IDF

A feature vector of a document represent the characteristic of that document, it provides some clue about the topic of the document. IT-IDF is a commonly used method to extract feature vector from a document content, it is based on the term frequency and the inverse document frequency of a term.

Term frequency TF

The term frequency, or the term weight, indicates how often a term occurs in each document. There are many ways to calculate the term weight, in the scope of this post, we choose the ** logarithmically scaled frequency** method:
$$tf(t,d)=log(1+cnt(t,d))$$

Where \(cnt(t,d)\) is the occurrences of the term \(t\) in document \(d\). Theoretically, the bigger the \(tf(t,d)\), the more important the term in the document. High weight terms can be used to identify the topic of a document. However, given a set of documents \(D\), there may be the case where a high weight term appears across all documents in \(D\), that is, the term is common on all documents and it cannot be used as a good keyword to distinguish relevant and non-relevant documents. Therefore the inverse document frequency is incorporated in the term frequency to deal with this problem.

Inverse document frequency IDF

While \(tf(t,d)\) measures whether a term is important in an individual document, the \(idf(t,D)\), in the other hand, measures whether the term \(t\) is common or rare across all documents in \(D\). It is obtained by:
$$idf(t,D) = log\frac{N}{|d\in D:t\in d|}$$

Where \(N = |D|\) is the number of documents, and \(|d\in D:t\in d|\) is the occurrences of term \(t\) across all documents in \(D\). The ratio inside the idf's log function is always greater than or equal to 1, thus the idf value is greater or equal to 0. The closer the \(idf(i,D)\) value to 0, the more common the term \(t\) across on all documents.

TF-IDF

To reduce the weight of the terms that occur very frequently in the document set \(D\) and increase the weight of terms that occur rarely across all \(d\in D\), we combine the TF and IDF together, as:
$$tfidf(t,d,D)=tf(t,d)*idf(t,D)$$

A high value of \(tfidf(t,d,D)\) indicates that (1) the term \(t\) is important to the document \(d\) and (2) it is rare across all documents in \(D\). Therefore, It can be used as a good keyword to distinguish relevant and non-relevant documents.

Feature vectors

Let \(T=\{t|t\in d:d\in D\}\) is the set of all terms in \(D\), a term \(t\in T\) is unique, that is \(t\) appears in \(T\) only once. A feature vector \(V_d\) of document \(d\in D\) is obtained by:
$$V_d=\{tfidf(t,d,D)|t\in T\}$$

Note that, all feature vectors has the same size \(|T|\) and with a term \(t\in T\) that occurs on a document other than \(d\), the \(V_d(t)\) of that term is simply set to 0.
The \(V_d\) is the numeric representation of the characteristic of the document \(d\).

Similarity between documents

Until now, documents can be vectorized as feature vectors. So, how can we measure the similarity of documents using these vectors? Given two documents \(d_1\) and \(d_2\) represented by two feature vectors \(V_1\) and \(V_2\). The similarity between two documents can be quantified by the distance between the two vectors. Normally, this distance could be the euclidean distance, but since our vector is in high-dimensional positive spaces, the euclidean will not be accurate in this case. Instead, the Cosin similarity is more adapted. This similarity metric measures the cosin of the two vectors, that is, it calculates the change in orientation of the two vectors instead of magnitude like euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. This metric is often used in document classification problem where each document is characterized by a high-dimensional positive vector. It gives a useful measure of how similar two documents are likely to be in terms of their subject matter:

$$similarity(d_1,d_2)=cos(V_1,V_2)=\frac{\sum_{i=1}^{n}a_ib_i}{\sqrt{\sum_{i=1}^{n}a_i^2}\sqrt{\sum_{i=1}^{n}b_i^2}}$$

With \(n=|T|=|V_1|=|V_2|\) and \(a_i\in V_1\), \(b_i\in V_2\). The \(similarity(d_1,d_2)\) value is between \([-1,1]\) where -1 indicates a complete dissimilarity while 1 shows that the two documents are the same.

Implementation note: based on this metric, i've implemented the post suggestion feature. The idea is that when a user reads a post, the blog back-end will calculate the similarity of the current post with all the posts available on the database. It then select the top \(k\) post with high similarity and display to the user as related posts at the end of the current post. Note that for this post, since it is the first post of its topic (data mining) on my blog, the suggestion score may not be good (0.06, close to 0, since there are no other post similar), but if you refer to this post, you can see at the end of the post, the suggestion is pretty accurate.

Document clustering using k-mean

The document clustering here is very straight forward using k-mean on the list of feature vectors. The algorithm takes the number of topic/category as input and tries to assign each document to a nearest topic/category while maximizing the similarity of all documents in each topic/category (instead of minimizing the error like in the original k-mean), the algorithm can be described as follow:

Input: (D: feature vectors, n: number of class)
C:  a set of n current centroids
C_old: a set of n last centroids
Pick up an initial centroid for each class C[i]=init_centroid(), C_old[i] = zeros vector, i = 1:n
s = similarity(C, C_old)
clusters: empty set
While (s != 1) do
    foreach D[i] in D do
        Calculate the similarity of each document and all current centroids L = similarity(D[i], C)
        Get the centroid k which have highest similarity in L, k= argmax(L)
        Assign the document i to the centroid k clusters[i] = k
    end
    C_old = C
    foreach i=1:n do
        Recalculate the centroid i: C[i] = mean vector of all feature vectors assigned to class i
    end
    s = similarity(C, C_old)
end
return clusters

The algorithm tries to assign each document to a nearest cluster (centroid) in each step. It converges when there is no change in all cluster, hence the similarity between C and C_old is 1. Since the problem here is computationally difficult (NP-hard), the algorithm here is heuristic, there is no guarantee that it will converge to the global optimum, and the result may depend on several factors:

  1. The input number of class: a wrong number of class may lead to the case where similar documents are assigned to different classes.
  2. The Initial clusters: a good initial centroids (via the function init_centroid() ) makes the algorithm faster to converge and the result is more accurate.

As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. To demonstrated the impact of the initial conditions, i've run the two tests on all posts of my blog using the proposed method, with three classes.

In the first test, the centroids is initialized as random vectors, the algorithm converges after 2 steps and the result is as follow:

[ [17,9,14,3], [2,12,11,5], [13,8,16,4] ]

Where each list is a cluster and each number is the id (primary key) of a post on my blog. The result, in this case, is not accurate, since, the posts 3,5 or 14,16 or 11,12,13 should be on three distinct categories/groups.

In the second test, the initial centroids is manually assigned to the feature vectors of three posts belong to three different categories, the algorithm converges after two steps with the result:

[ [9,8,3,2,5,4] , [13,12,11] , [14,16] ]

Which is pretty accurate regarding the content of each post in each group. So, the entire blog can be classified to three categories with the posts clustered as the result. Since there are not so many posts in my blog yet, the first test may not have enough input data to converge to a good solution. I will rerun that test when the posts on the blog increase. For now, just stick to the second solution (e.g. manually select initial centroids).

Related posts

Comments

The comment editor supports Markdown syntax. Your email is necessary to notify you of further updates on the discussion. It will be hidden from the public.
Powered by antd server, (c) 2017 - 2024 Dany LE.This site does not use cookie, but some third-party contents (e.g. Youtube, Twitter) may do.