We all have heard of Big Data a lot! Want to try basic clustering techniques on it?
Here we go. Before beginning into the coding , lets familiarize ourselves with some basic terms and platforms to do this thing.
1. What data are we studying?
For this purpose we would be studying Twitter, there are well equipped APIs to study the tweets that too in varied languages, eg, twitteR in R (stastical tool), tweetPy in python.
2. Tools we would be using: R
Get yourself quickly familiarized with R by following these links:
http://www.r-bloggers.com/how-to-learn-r-a-flow-chart/
http://tryr.codeschool.com/
And while searching for the relevant links, I have found this course for you, you may want to attend:
https://www.coursera.org/course/rprog
3. What we aim to do?
Collect the list of friends and followers for any one person's twitter account.
We would make use of user class of twitteR for the same.
It has following fields:
So, what we intend to do on the basis of this information is quite simple (and not that useful :P ) but may serve as a good introduction to clustering.
We accumulate a set of users on twitter, by fetching a twitter user's friends and followers as User class objects . We take the union of these users. Now we plot a graph friends vs followers, where each dot is a 2 D / 3 D entity containing friends' count, followers' count and/ or statuses' count.
We use inbuilt K-means clustering algorithm in RStudio to do the clustering on this data.
Pondering what is K-means? Check : https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
or several explanatory videos available online.
So basically, here we have a feature vector of 3 features per user:
1. friend's count
2. follower's count
3. statuses' count
On the basis of similarity measure, we cluster the data points together on the basis of similarity wrt to these 3 features.
***Check out what similarity kmeans internally uses in R and answer in comments if you find one, I guess it is cosine similarity. ***
These clusters can be further colored to mark density.
Follow the code given in this link to do it by yourself:
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html
I used and tweaked the code a bit to do kmeans clustering on "modi" as user and used 2 features : friends count and followers count for first and
3 features: friends count , followers count and statuses' count for the next, this is what we attained:
3 Features
PS: It is a 2D graph still but the clusters have been redefined, I put k=5 for this, where k stands as the number of clusters we wish to generate using the kmeans algorithm.
So that's it for now. I hope it would have been interesting and useful.
There exists lots of exciting relevant stuff online, keep checking!
See you in next post, till then
Keep Hacking! :)
Here we go. Before beginning into the coding , lets familiarize ourselves with some basic terms and platforms to do this thing.
1. What data are we studying?
For this purpose we would be studying Twitter, there are well equipped APIs to study the tweets that too in varied languages, eg, twitteR in R (stastical tool), tweetPy in python.
2. Tools we would be using: R
Get yourself quickly familiarized with R by following these links:
http://www.r-bloggers.com/how-to-learn-r-a-flow-chart/
http://tryr.codeschool.com/
And while searching for the relevant links, I have found this course for you, you may want to attend:
https://www.coursera.org/course/rprog
3. What we aim to do?
Collect the list of friends and followers for any one person's twitter account.
We would make use of user class of twitteR for the same.
It has following fields:
- name:Name of the user
- screenName:Screen name of the user
- id:ID value for this user
- lastStatus:Last status update for the user
- description:User’s description
- statusesCount:Number of status updates this user has had
- followersCount:Number of followers for this user
- favoritesCount:Number of favorites for this user
- friendsCount:Number of followees for this user
- url:A URL associated with this user
- created:When this user was created
- protected:Whether or not this user is protected
- verified:Whether or not this user is verified
- location:Location of the user
- listedCount:The number of times this user appears in public lists
- followRequestSent:If authenticated via OAuth, will be TRUE if you’ve sent a friend request to this user
- profileImageUrl:URL of the user’s profile image, if one exists
So, what we intend to do on the basis of this information is quite simple (and not that useful :P ) but may serve as a good introduction to clustering.
We accumulate a set of users on twitter, by fetching a twitter user's friends and followers as User class objects . We take the union of these users. Now we plot a graph friends vs followers, where each dot is a 2 D / 3 D entity containing friends' count, followers' count and/ or statuses' count.
We use inbuilt K-means clustering algorithm in RStudio to do the clustering on this data.
Pondering what is K-means? Check : https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
or several explanatory videos available online.
So basically, here we have a feature vector of 3 features per user:
1. friend's count
2. follower's count
3. statuses' count
On the basis of similarity measure, we cluster the data points together on the basis of similarity wrt to these 3 features.
***Check out what similarity kmeans internally uses in R and answer in comments if you find one, I guess it is cosine similarity. ***
These clusters can be further colored to mark density.
Follow the code given in this link to do it by yourself:
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html
I used and tweaked the code a bit to do kmeans clustering on "modi" as user and used 2 features : friends count and followers count for first and
3 features: friends count , followers count and statuses' count for the next, this is what we attained:
2 Features
PS: It is a 2D graph still but the clusters have been redefined, I put k=5 for this, where k stands as the number of clusters we wish to generate using the kmeans algorithm.
So that's it for now. I hope it would have been interesting and useful.
There exists lots of exciting relevant stuff online, keep checking!
See you in next post, till then
Keep Hacking! :)