Spring 2012 DISC 210 Week 2

= Flat File Database = Definition: To encode a database model (usually a table) as a single file (such as .txt or .ini).

A flat file is a text file which usually contains one record per line. Within each record, the fields are separated by delimiters (e.g., commas or tabs). They may also be fields of fixed width.

There are no structural relationships between the records.

What are some examples of flat files?

 * Name and address list
 * passwd in the linux directory etc
 * group in the linux directory etc

How are flat file databases created?
If they are being created interactively, then they are usually created with a spreadsheet program or text editor. They are also commonly created and edited with script files or programs.

Data Transfer
They have in the past been used as data transfer tools to remote servers; however, in recent years they are replaced with XML files, which also describe the data. The criticism against XML's use has been the increase in file size. This is being solved today by XML file compression (Efficient XML Interchange - used by mobile devices).

Many believe that XML is gradually being replaced by JSON and YAML.

Example
The following example shows a basic flat file database, which consists of data in a series of columns and rows organized into a tabular format. If we wanted to create a flat file of teams and their members: id   name    team 1    Amy     Blues 2    Bob     Reds 3    Chuck   Blues 4    Dick    Blues 5    Ethel   Reds 6    Fred    Blues 7    Gilly   Blues 8    Hank    Reds

How would this file change if it was comma delimited?

Data types
Each column in a database table is usually restricted to a specific data type. These restrictions are usually established by convention and not formally indicated (or possibly enforced).

What are some different data types?

 * integers
 * booleans
 * characters
 * floating-point numbers
 * alphanumeric strings

The details of data types are specified by the system you are using (i.e., a MySQL database is going to have different data types than SQL server).

Our Twitter Flat File Database
We'll be downloading tweets and information about the users who tweeted them using four scripts. Before we can run the first script we must create a directory called "tweets" to store our flat file database.

First you must 'cd' to the directory containing the scripts. Then: mkdir tweets

Script to Download Tweets Every Hour
from __future__ import print_function import twitter, json, sys import datetime,time

if len(sys.argv) < 2: print("Usage: python get_tweets.py recent|popular") exit

q = sys.argv[1] result_type = sys.argv[2]

twitter_search = twitter.Twitter(domain="api.twitter.com") search_results = [] search_results.append(twitter_search.search(q=q,page=1,rpp=1000,lang='en',result_type=result_type))
 * 1) for page in range(1,6):


 * 1) print(json.dumps(search_results, sort_keys=True, indent=1))

while True: now = datetime.datetime.now outfile = open("tweets/" + q + "_" + result_type + "_" + str(now), 'w') print('to_user','id','created_at','from_user','text',sep='\t',file=outfile) for search_result in search_results: for tweet in search_result['results']: print(tweet['to_user'],tweet['id'],tweet['created_at'],tweet['from_user'],tweet['text'].encode('ascii','ignore').replace("\n",""),sep='\t',file=outfile) time.sleep(60*60)

Script to download user information for the users that tweeted
from __future__ import print_function import twitter, json, sys twitter_search = twitter.Twitter(domain="api.twitter.com")

if len(sys.argv) < 2: print("Usage: python get_user_info.py ")

file = open(sys.argv[1],'r') headers = file.readline.strip.split("\t") inx = headers.index("from_user") for line in file: fields = line.strip.split("\t") user = fields[inx] results = twitter_search.users.show(screen_name=user) print(results['created_at'],results['description'].encode('ascii','ignore').replace("\n",""),results['friends_count'],results['id'],results['screen_name'],results['time_zone'],results['statuses_count'],results['location'],sep='\t')

Script to download follower information for the users that tweeted
from __future__ import print_function import twitter, json, sys twitter_search = twitter.Twitter(domain="api.twitter.com")

if len(sys.argv) < 2: print("Usage: python get_user_followers.py ")

file = open(sys.argv[1],'r') headers = file.readline.strip.split("\t") inx = headers.index("from_user") print('user','follower_id') for line in file: fields = line.strip.split("\t") user = fields[inx] results = twitter_search.followers.ids(screen_name=user) for id in results['ids']: print(user,id,sep='\t')

Script to download friend information for the users that tweeted
from __future__ import print_function import twitter, json, sys twitter_search = twitter.Twitter(domain="api.twitter.com")

if len(sys.argv) < 2: print("Usage: python get_user_friends.py ")

file = open(sys.argv[1],'r') headers = file.readline.strip.split("\t") inx = headers.index("from_user") print('user','friend_id') for line in file: fields = line.strip.split("\t") user = fields[inx] results = twitter_search.friends.ids(screen_name=user) for id in results['ids']: print(user,id,sep='\t')

What are some problems that you see with our scripts?

 * Redundant information
 * How do we query it?
 * How can we link the different flat files?
 * Should we be storing entire tweets or individual words as tweets? How would you decide?