Skip to main content

Introduction to Clustering using R

We all have heard of Big Data a lot! Want to try basic clustering techniques on it?

Here we go. Before beginning into the coding , lets familiarize ourselves with some basic terms and platforms to do this thing.

1. What data are we studying?
For this purpose we would be studying Twitter, there are well equipped APIs to study the tweets that too in varied languages, eg, twitteR in R (stastical tool), tweetPy in python.

2. Tools we would be using: R
Get yourself quickly familiarized with R by following these links:

http://www.r-bloggers.com/how-to-learn-r-a-flow-chart/
http://tryr.codeschool.com/

And while searching for the relevant links, I have found this course for you, you may want to attend:

https://www.coursera.org/course/rprog


3. What we aim to do?
Collect the list of friends and followers for any one person's twitter account.
We would make use of user class of twitteR for the same.
It has following fields:
  • name:Name of the user
  • screenName:Screen name of the user
  • id:ID value for this user
  • lastStatus:Last status update for the user
  • description:User’s description
  • statusesCount:Number of status updates this user has had
  • followersCount:Number of followers for this user
  • favoritesCount:Number of favorites for this user
  • friendsCount:Number of followees for this user
  • url:A URL associated with this user
  • created:When this user was created
  • protected:Whether or not this user is protected
  • verified:Whether or not this user is verified
  • location:Location of the user
  • listedCount:The number of times this user appears in public lists
  • followRequestSent:If authenticated via OAuth, will be TRUE if you’ve sent a friend request to this user
  • profileImageUrl:URL of the user’s profile image, if one exists
and several methods. To look into what all things twitteR API offers, check out this: http://cran.r-project.org/web/packages/twitteR/twitteR.pdf 

So, what we intend to do on the basis of this information is quite simple (and not that useful :P ) but may serve as a good introduction to clustering.

We accumulate a set of users on twitter, by fetching a twitter user's friends and followers as User class objects . We take the union of these users. Now we plot a graph friends vs followers, where each dot is a 2 D / 3 D entity containing friends' count, followers' count and/ or statuses' count.

We use inbuilt K-means clustering algorithm in RStudio to do the clustering on this data.
Pondering what is K-means? Check : https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
or several explanatory videos available online.

So basically, here we have a feature vector of 3 features per user:
1. friend's count
2. follower's count
3. statuses'  count

On the basis of similarity measure, we cluster the data points together on the basis of similarity wrt to these 3 features.

***Check out what similarity kmeans internally uses in R and answer in comments if you find one, I guess it is cosine similarity. ***

These clusters can be further colored to mark density.

Follow the code given in this link to do it by yourself:

http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html


I used and tweaked the code a bit to do kmeans clustering on "modi" as user and used 2 features : friends count and followers count for first and
3 features: friends count , followers count and statuses' count for the next, this is what we attained:

  
 2 Features

                                                 3 Features


PS: It is a 2D graph still but the clusters have been redefined, I put k=5 for this, where k stands as the number of clusters we wish to generate using the kmeans algorithm.


So that's it for now. I hope it would have been interesting and useful.
There exists lots of exciting relevant stuff online, keep checking!

See you in next post, till then
Keep Hacking! :)

Comments

Popular posts from this blog

Duh - Saves you the trouble to correct your command

Duh.

This is no more an expression for me but a command now. Thanks to the hack I have been doing for past couple of days.

What's it about? Well, here it goes.

How many times it happens that we screw up commands on terminal?
A typo, a syntax mistake or jumbled up arguments. The command doesn't run and then we spend time retyping it ensuring everything is in place this time.
Quite time consuming, eh?




My laziness simply denied me such a behaviour. So I coded up a powershell cmdlet which can do this for me.
Now if I mess up a command, I just have to type 'Duh' and the right command will be displayed on the prompt for you to check and execute (press ENTER).

How Duh operates internally?
Well, guess what. Answer lies in the "tries".
We have a trie and we do closest match using Leveinshtein Distance.
In short, how to figure out how close two strings are?  Find the no of letters you need to remove/insert/replace in order to attain string 2 from string 1.
This is wha…

How would you make an HTML Parser?

Hola folks,

Here we will be walking through how html is being parsed by a widely popular parser: Angle Sharp
You can find it here !
This is just a walkthrough and gives an idea on the breadth of issues  one has to deal with while designing a parser.

Let's sneak peek into what kind of data structures are used and how is the code structured.


It all begins with this line:
var parser = new HtmlParser();
We have folowing variations in constructing the HtmlParser object:

  public HtmlParser()
            : this(Configuration.Default)
        {
        }

        /// <summary>
        /// Creates a new parser with the custom options.
        /// </summary>
        /// <param name="options">The options to use.</param>
        public HtmlParser(HtmlParserOptions options)
            : this(options, Configuration.Default)
        {
        }

        /// <summary>
        /// Creates a new parser with the custom configuration.
        /// </summary&…

What emotions run through your playlist?

Hola !!

Long time. So I was upto creating this application one night , I named: "Playlist Emotions" , "What does the song says" :P , "PlayWithEmotions" and what not, each project of a failed experimentation.

Duh.

The project took way too long. Thanks to the noble idea of making and deploying it as a GWT application and then to improve upon the GUI of the app.
FYI , I tried but done neither of the things above. My Google Console developers trial account  supposedly has some issues with this app, yet to be resolved. So before the entire idea behind the app and the excitement of results it displays fades out , I thought, lemme write a blog post on the same.


So what is this all about?

The idea originated from a candid discussion with my friend on how songs influence our moods and also how our emotions affect the type of songs we listen to.

Being fascinated about knowing what kinds of songs I listen to, I thought of creating an application where I would just e…