TechVista-360: Introduction to Clustering using R

We all have heard of Big Data a lot! Want to try basic clustering techniques on it?

Here we go. Before beginning into the coding , lets familiarize ourselves with some basic terms and platforms to do this thing.

1. What data are we studying?
For this purpose we would be studying Twitter, there are well equipped APIs to study the tweets that too in varied languages, eg, twitteR in R (stastical tool), tweetPy in python.

2. Tools we would be using: R
Get yourself quickly familiarized with R by following these links:

http://www.r-bloggers.com/how-to-learn-r-a-flow-chart/
http://tryr.codeschool.com/

And while searching for the relevant links, I have found this course for you, you may want to attend:

https://www.coursera.org/course/rprog

3. What we aim to do?
Collect the list of friends and followers for any one person's twitter account.
We would make use of user class of twitteR for the same.
It has following fields:

name:Name of the user
screenName:Screen name of the user
id:ID value for this user
lastStatus:Last status update for the user
description:User’s description
statusesCount:Number of status updates this user has had
followersCount:Number of followers for this user
favoritesCount:Number of favorites for this user
friendsCount:Number of followees for this user
url:A URL associated with this user
created:When this user was created
protected:Whether or not this user is protected
verified:Whether or not this user is verified
location:Location of the user
listedCount:The number of times this user appears in public lists
followRequestSent:If authenticated via OAuth, will be TRUE if you’ve sent a friend request to this user
profileImageUrl:URL of the user’s profile image, if one exists

and several methods. To look into what all things twitteR API offers, check out this: http://cran.r-project.org/web/packages/twitteR/twitteR.pdf

So, what we intend to do on the basis of this information is quite simple (and not that useful :P ) but may serve as a good introduction to clustering.

We accumulate a set of users on twitter, by fetching a twitter user's friends and followers as User class objects . We take the union of these users. Now we plot a graph friends vs followers, where each dot is a 2 D / 3 D entity containing friends' count, followers' count and/ or statuses' count.

We use inbuilt K-means clustering algorithm in RStudio to do the clustering on this data.
Pondering what is K-means? Check : https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
or several explanatory videos available online.

So basically, here we have a feature vector of 3 features per user:
1. friend's count
2. follower's count
3. statuses' count

On the basis of similarity measure, we cluster the data points together on the basis of similarity wrt to these 3 features.

***Check out what similarity kmeans internally uses in R and answer in comments if you find one, I guess it is cosine similarity. ***

These clusters can be further colored to mark density.

Follow the code given in this link to do it by yourself:

http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html

I used and tweaked the code a bit to do kmeans clustering on "modi" as user and used 2 features : friends count and followers count for first and
3 features: friends count , followers count and statuses' count for the next, this is what we attained:

2 Features

3 Features

PS: It is a 2D graph still but the clusters have been redefined, I put k=5 for this, where k stands as the number of clusters we wish to generate using the kmeans algorithm.

So that's it for now. I hope it would have been interesting and useful.
There exists lots of exciting relevant stuff online, keep checking!

See you in next post, till then
Keep Hacking! :)

TechVista-360

Tuesday, 17 March 2015

Introduction to Clustering using R

No comments:

Post a Comment

A secret love message ~

Total Pageviews