17/02/2021

# Data visualization: global view of blog posts relationship

As stated on a post where i talked about using tf-idf to detect similarity between two blog posts, my blog is just a bunch of posts sorted by date, no category, no fancy features like user interest tracking, post ranking, etc. I usually work on many different domains (robotic, IoT, backend, frontend platform design, etc.), so my posts are mixed up between these domains. This may be difficult for readers who want to follow up their interesting topic on my blog.

So what is a good strategy for navigating between posts on a blog ?

Currently, when reading a post, a reader has option to follow up the related content via the suggestion at the end of the post which is based on the tf-idf method mentioned above. Basically, this feature measures the similarity between the current post and all others posts available on the blog. The top 5 post with greatest similarity scores are selected as relevant content as shown in the figure bellow.

Another option is to browse posts by tags which is some sort of category. Tags are poorly handled on my blog, so this method is not very efficient.

The two mentioned options offers only a strict local view to a blog topic. How a bout a global view of the blog ?

when I'm interested in someone blog, I tend to construct a global view on their blog content:

• What are the main domains/categories of the blog ?
• What is the relationship between these domains/categories ?
• Which domains/categories are dominant on the blog in comparison to other domains/categories ?
• How to find something (post) interesting by navigating between posts ?

The blog organization should make it easily for readers to find the answers to these questions. On dedicated blogging system or Content Management System (CMS) such as Word Press,there are ton of plugins that make this obvious to reader. Sadly, on a DIY handmade blog like mine, these fancy features are not available.

## Theory

So to offer a global view of my blog, i've been thinking of a graph-based approached using data visualization, where nodes are blog posts and edges of graph represent the relationship between two connected nodes. Basically such a graph can be defined as:

$$G=(V,E)$$

Where $$V=\{v_i|i=1:n\}$$ is a collection of posts, and $$E=\{e_i|i=1:m\}$$ is a collection of edges where each edge is defined as:

$$e_{jk}=(v_j,v_k,s_{jk}|s_{jk}>\alpha)$$

This can be interpreted as: an edge $$e_{jk}$$ connects two nodes $$v_j$$ and $$v_k$$ if an only if the similarity score $$s_{jk}$$ of these nodes greater than a predefined $$\alpha$$ constant value. Or in short, the two posts are relevant if their similarity score is greater than the threshold $$\alpha$$.

The $$s_{jk}$$ similarity scores are calculated based on the Cosin similarity metric presented on the previously tf-idf post.

The two connected nodes $$v_j$$ and $$v_k$$ attract each other via a virtual force $$f_{jk}=fn(s_{jk})$$ which is a function of $$s_{jk}$$. The greater the similarity $$s_{jk}$$, the stronger the force $$f_{jk}$$.

With the force $$f_{jk}$$, nodes that share similar topic will tend to group themself into a cluster of strongly relevant posts.

Lastly, the radius $$r_i$$ of a node $$v_i$$ is defined as a function of the number of edges connected to that node:

$$r_i=g(|\{e_{jk}|(j==i||k==i)\}|)$$

This means that node that has more connected edges than others nodes tends to be visualized bigger. The bigger the node, the more important the topic represented by that node. The topic of big node tend to be the key topic of the nodes group it belong to.

## Implementation

This design of graph can be implemented and visualized using a Force-directed graph. The entire implementation is done in Javascript with the help of the d3js visualization library.

The image below shows the global view of my blog as a force-directed graph:

In the graph, similarity of two connected node is represented by the edge thickness and the edge distance. A large edge thickness and/or short edge distance show a strong similarity between the two connected nodes. The posts can be navigated by hovering the mouse on a node and following the node relationship (edges) to find your interesting topic.
Nodes that have the same color share the same number of connected edges.
The short description of a post can be read by clicking on a node represented that post.

As stated above, with the force $$f_{jk}$$, we can clearly see that similar posts start grouping themself to a small clusters as shown in the following figure:

Basically, according to this visualization, i'm frequently working on the following topics, which i find very accurate:

• AntOS, my virtual web desktop
• ROS, system middle-ware and algorithm
• Embedded system
• Jarvis, my DIY robot