Spring 2012 Week 1

Examples and notes are partially adapted from "Mining the Social Web" by Matthew A. Russel.

There is a tremendous amount of valuable data on Facebook, Twitter, and LinkedIn. How can you find what you're looking for in the social haystack?

= What if I don't know Python? = Python is already installed on the virtual machine. We'll walk you through most of the Python code that you'll need in the class, but it's always good to know more than necessary. Towards that end, here is a link to a nice tutorial: http://docs.python.org/tutorial/.

= Hacking Twitter Data = Throughout this course I'll rely heavily on the command line interface that linux provides. The main reason for this is because I can easily and clearly show you how to get things up and running. As an added bonus, you'll be introduced to the great world of linux!

Getting pip installed, which is a utility we will use to install additional software for Python.

sudo apt-get install python-pip

Now we can install some python code that we'll need: sudo pip install networkx

How do we now use this functionality, we can import it into Python and then access it: disc@disc-VirtualBox:~$ python Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import networkx

What is Twitter?
It is a highly social microblogging service that allows you to post short messages of 140 characters or less, called tweets. Twitter has an asymmetric network infrastructure of "friends" and "followers". It offers an extensive number of APIs.

Python IDE
sudo apt-get install spe

Twitter's API
Installing the Twitter API sudo pip install twitter

Let's find out what people are saying about Apple: import twitter, json twitter_search = twitter.Twitter(domain="api.twitter.com") search_results = [] for page in range(1,6): search_results.append(twitter_search.search(q="apple",rpp=100,page=page)) print json.dumps(search_results, sort_keys=True, indent=1)

We are asking the twitter API to search through the tweets with a query "apple". We asked for 100 records per page. What the heck was that output? It is in a format called JSON (JavaScript Object Notation). We can find the 50 most frequent words and the 50 most infrequent words. Before we do, we need to install the natural language processing toolkit. sudo pip install --upgrade distribute sudo pip install nltk

import twitter, json, nltk twitter_search = twitter.Twitter(domain="api.twitter.com") search_results = [] for page in range(1,6): search_results.append(twitter_search.search(q="apple",rpp=100,page=page))

tweets = [ r['text'] \ for result in search_results \ for r in result['results'] ]

words = [] for t in tweets: words += [ w for w in t.split ] len(words) # total words len(set(words)) # unique words 1.0*len(set(words))/len(words) # lexical diversity 1.0*sum([ len(t.split) for t in tweets ])/len(tweets) # avg words per tweet

freq_dist = nltk.FreqDist(words) print freq_dist.items[:50] # 50 most frequent tokens print freq_dist.items[-50:] # 50 least frequent tokens

How should we store this data over an extended period of time so we can query and look for patterns?

One answer: Relational Databases

JSON
http://en.wikipedia.org/wiki/JSON

Linux Command Line
For those of you who want to learn more about the linux command line, this is a good tutorial.