When i developed this blog (using my own client-server platform such as web server, back-end, front-end, etc., built from ash/scratch :) ), i simply designed it as a simple "note book" where i put my ideas or some stuffs that i have done. So, initially, there are no category no advance feature like post suggestion based on current post, etc. It is just a bunch of posts sorting by date. The thing is, i usually work on many different domains (robotic, IoT, backend, frontend platform design, etc.), so my posts are mixed up between different categories. It is fine for me, but is a real inconvenience for readers who want to follow up their interesting category on the blog. Of course, i could redesign the blog and add the missing features by messing around with the relational database design (i'm using SQLite btw), manually classifying the posts in the back-end, etc. But, i'm a kind of lazy people, so i've been thinking of a more automatic solution. How about an automatic document clustering feature based on a data mining approach ? Here we go!
Basically, document clustering is the task of automatic document organization, topic extraction and fast information retrieval or filtering. In our case, we are interested in a clustering method based on document content (aka. content-based clustering methods). Such method typically follows these common steps:
A bag-of-word model, aka. a vector space model, as explained by its name, represents a document as a bag of its words disregarding of the grammar. Usually a bag of word is represented in key-value pairs form where each key is the word and each value is the occurrences of that word in the document. For example, given the text:
Basically document clustering is the task of automatic document organization, topic extraction and fast information retrieval or filtering
This text can be represented as a bag-of-words:
{
"information": 1,
"task": 1,
"extraction": 1,
"organization": 1,
"retrieval": 1,
"clustering": 1,
"document": 2,
"filtering": 1,
"fast": 1,
"basically": 1,
"automatic": 1,
"topic": 1
}
Note that, in this example, the stop worlds such as** is, of, and** and punctuation such as** ","** are removed from the bag.
From now on, when we talk about a document \(d\), we actually refers to its bag-of-words representation.
A feature vector of a document represent the characteristic of that document, it provides some clue about the topic of the document. IT-IDF is a commonly used method to extract feature vector from a document content, it is based on the term frequency and the inverse document frequency of a term.
The term frequency, or the term weight, indicates how often a term occurs in each document. There are many ways to calculate the term weight, in the scope of this post, we choose the ** logarithmically scaled frequency** method:
$$tf(t,d)=log(1+cnt(t,d))$$
Where \(cnt(t,d)\) is the occurrences of the term \(t\) in document \(d\). Theoretically, the bigger the \(tf(t,d)\), the more important the term in the document. High weight terms can be used to identify the topic of a document. However, given a set of documents \(D\), there may be the case where a high weight term appears across all documents in \(D\), that is, the term is common on all documents and it cannot be used as a good keyword to distinguish relevant and non-relevant documents. Therefore the inverse document frequency is incorporated in the term frequency to deal with this problem.
While \(tf(t,d)\) measures whether a term is important in an individual document, the \(idf(t,D)\), in the other hand, measures whether the term \(t\) is common or rare across all documents in \(D\). It is obtained by:
$$idf(t,D) = log\frac{N}{|d\in D:t\in d|}$$
Where \(N = |D|\) is the number of documents, and \(|d\in D:t\in d|\) is the occurrences of term \(t\) across all documents in \(D\). The ratio inside the idf's log function is always greater than or equal to 1, thus the idf value is greater or equal to 0. The closer the \(idf(i,D)\) value to 0, the more common the term \(t\) across on all documents.
To reduce the weight of the terms that occur very frequently in the document set \(D\) and increase the weight of terms that occur rarely across all \(d\in D\), we combine the TF and IDF together, as:
$$tfidf(t,d,D)=tf(t,d)*idf(t,D)$$
A high value of \(tfidf(t,d,D)\) indicates that (1) the term \(t\) is important to the document \(d\) and (2) it is rare across all documents in \(D\). Therefore, It can be used as a good keyword to distinguish relevant and non-relevant documents.
Let \(T=\{t|t\in d:d\in D\}\) is the set of all terms in \(D\), a term \(t\in T\) is unique, that is \(t\) appears in \(T\) only once. A feature vector \(V_d\) of document \(d\in D\) is obtained by:
$$V_d=\{tfidf(t,d,D)|t\in T\}$$
Note that, all feature vectors has the same size \(|T|\) and with a term \(t\in T\) that occurs on a document other than \(d\), the \(V_d(t)\) of that term is simply set to 0.
The \(V_d\) is the numeric representation of the characteristic of the document \(d\).
Until now, documents can be vectorized as feature vectors. So, how can we measure the similarity of documents using these vectors? Given two documents \(d_1\) and \(d_2\) represented by two feature vectors \(V_1\) and \(V_2\). The similarity between two documents can be quantified by the distance between the two vectors. Normally, this distance could be the euclidean distance, but since our vector is in high-dimensional positive spaces, the euclidean will not be accurate in this case. Instead, the Cosin similarity is more adapted. This similarity metric measures the cosin of the two vectors, that is, it calculates the change in orientation of the two vectors instead of magnitude like euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. This metric is often used in document classification problem where each document is characterized by a high-dimensional positive vector. It gives a useful measure of how similar two documents are likely to be in terms of their subject matter:
$$similarity(d_1,d_2)=cos(V_1,V_2)=\frac{\sum_{i=1}^{n}a_ib_i}{\sqrt{\sum_{i=1}^{n}a_i^2}\sqrt{\sum_{i=1}^{n}b_i^2}}$$
With \(n=|T|=|V_1|=|V_2|\) and \(a_i\in V_1\), \(b_i\in V_2\). The \(similarity(d_1,d_2)\) value is between \([-1,1]\) where -1 indicates a complete dissimilarity while 1 shows that the two documents are the same.
Implementation note: based on this metric, i've implemented the post suggestion feature. The idea is that when a user reads a post, the blog back-end will calculate the similarity of the current post with all the posts available on the database. It then select the top \(k\) post with high similarity and display to the user as related posts at the end of the current post. Note that for this post, since it is the first post of its topic (data mining) on my blog, the suggestion score may not be good (0.06, close to 0, since there are no other post similar), but if you refer to this post, you can see at the end of the post, the suggestion is pretty accurate.
The document clustering here is very straight forward using k-mean on the list of feature vectors. The algorithm takes the number of topic/category as input and tries to assign each document to a nearest topic/category while maximizing the similarity of all documents in each topic/category (instead of minimizing the error like in the original k-mean), the algorithm can be described as follow:
Input: (D: feature vectors, n: number of class)
C: a set of n current centroids
C_old: a set of n last centroids
Pick up an initial centroid for each class C[i]=init_centroid(), C_old[i] = zeros vector, i = 1:n
s = similarity(C, C_old)
clusters: empty set
While (s != 1) do
foreach D[i] in D do
Calculate the similarity of each document and all current centroids L = similarity(D[i], C)
Get the centroid k which have highest similarity in L, k= argmax(L)
Assign the document i to the centroid k clusters[i] = k
end
C_old = C
foreach i=1:n do
Recalculate the centroid i: C[i] = mean vector of all feature vectors assigned to class i
end
s = similarity(C, C_old)
end
return clusters
The algorithm tries to assign each document to a nearest cluster (centroid) in each step. It converges when there is no change in all cluster, hence the similarity between C and C_old is 1. Since the problem here is computationally difficult (NP-hard), the algorithm here is heuristic, there is no guarantee that it will converge to the global optimum, and the result may depend on several factors:
As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. To demonstrated the impact of the initial conditions, i've run the two tests on all posts of my blog using the proposed method, with three classes.
In the first test, the centroids is initialized as random vectors, the algorithm converges after 2 steps and the result is as follow:
[ [17,9,14,3], [2,12,11,5], [13,8,16,4] ]
Where each list is a cluster and each number is the id (primary key) of a post on my blog. The result, in this case, is not accurate, since, the posts 3,5 or 14,16 or 11,12,13 should be on three distinct categories/groups.
In the second test, the initial centroids is manually assigned to the feature vectors of three posts belong to three different categories, the algorithm converges after two steps with the result:
[ [9,8,3,2,5,4] , [13,12,11] , [14,16] ]
Which is pretty accurate regarding the content of each post in each group. So, the entire blog can be classified to three categories with the posts clustered as the result. Since there are not so many posts in my blog yet, the first test may not have enough input data to converge to a good solution. I will rerun that test when the posts on the blog increase. For now, just stick to the second solution (e.g. manually select initial centroids).