Spring 2012 Dataset Organization/Management

= Course Description = A course to introduce the structure of databases and the management of datasets for information extraction. Concepts include the relational and entity relationship models, and local and distributed storage and access. The preparation and management of datasets for analysis is covered, and includes data cleaning, reorganization and security.

= Facebook Group = http://www.facebook.com/groups/306618662716557/

= Virtual Machine(s) = http://dl.dropbox.com/u/2920856/ubuntu%20disc%20210.vdi

= Running your virtual machine = https://www.virtualbox.org/

The password is informatics.

= Extra Credit =
 * 1) Replicate what was done here http://www.ibm.com/developerworks/opensource/library/os-dataminingrubytwitter/index.html?ca=drs-
 * 2) Extend what was done in a new direction. Be creative!

Hard Deadline of April 25th

= Schedule = This schedule is tentative and subject to change as the course progresses. Please check back often to stay current on due dates and topics.

We will have several guest lectures throughout the semester. Dr. Gavin Naylor will join us to discuss bioinformatics and Dr. Jim Bowring will give us an introduction into geology datasets.

Travel Information
I will attending two conferences during the semester. During these times, we will have guest lectures, exams, and activities. They will be for the following dates:
 * 02/29/2012 - 03/03/2012
 * 03/29/2012 - 03/31/2012

Week 1 - Jan 9, 11, 13

 * Introduction to class and review of syllabus
 * Introduction to the Twitter API and JSON

Videos
1-11-2012 (download)

Week 2 - Jan 16 (Holiday), 18, 20

 * Flat file storage

[[Media:Spring_2012_DISC_210_HW_1.pdf|Homework #1 - Due Friday 16, Monday, or Tuesday]]

Videos
1-18-2012 (download)

1-20-2012 (download)

Week 3 - Jan 23, 25, 27

 * Entity relationship diagrams

[[Media:Spring_2012_DISC_210_HW_1.pdf|Homework #1 - Due Friday 16, Monday, or Tuesday]]

Read the Wikipedia entry on Entity Relationship Models before class on Jan. 23rd

Exam #1 on Friday, What should I study?

Videos
1-25-2012 (download)

1-25-2012 (download)

Week 4 - Jan 30, Feb 1, 3

 * Local relational databases & SQL

Homework #2 Due Friday, February 3rd to the 7th

Read the following by February 3rd:
 * http://en.wikipedia.org/wiki/Database_management_system
 * http://en.wikipedia.org/wiki/Relational_database

New software to install
sudo apt-get install mysql-server

Videos
1-30-2012 (download)

2-1-2012 (download)

2-3-2012 (download)

Week 5 - Feb 6, 8, 10

 * Local relational databases & SQL
 * [[Media:Spring_2012_DISC_210_Queries.pdf|Queries]]
 * [[Media:Spring_2012_DISC_210_Example_Queries.pdf|Example Queries]]

Exam #2, What should I know?

Videos
2-6-2012 (download)

2-8-2012 (download)

Week 6 - Feb 13, 15, 17

 * [[Media:Spring_2012_DISC_210_Queries.pdf|Queries]]
 * [[Media:Spring_2012_DISC_210_Example_Queries.pdf|Example Queries]]
 * In class queries

Homework #3 Due by Friday, Feb 17

Tips
sudo apt-get install mysql-query-browser sudo apt-get install gtk2-engines-pixbuf

I would recommend using the mysql-query-browser or the command line mysql client. To start the mysql-query-browser, open the terminal and enter mysql-query-browser. To run the mysql client, run mysql -u root -p. Then type use mydb to run your SQL statements. If you want to execute a script, you can use the source command (e.g., source create_twitter.txt).

Videos
2-13-2012 (download)

2-15-2012 (download)

Week 7 - Feb 20, 22, 24

 * [[Media:Spring_2012_DISC_210_Queries.pdf|Queries]]
 * [[Media:Spring_2012_DISC_210_Example_Queries.pdf|Example Queries]]
 * In class queries

Exam #3, [[Media:Spring_2012_DISC_210_Practice_Exam_3.pdf|Practice Exam 3]]

select Tweet.text from (   select count(*) as 'num_words', tweet_id from Word group by tweet_id ) as S, Tweet where S.num_words = (   select min(num_words) from     ( select count(*) as 'num_words', tweet_id from Word group by tweet_id ) as S2 ) and S.tweet_id = Tweet.id;

Solution to Practice Exam, Problem #11 select min(GPA) from STUDENT, CLASS where STUDENT.SID = CLASS.StudentID and CLASS.SectID = 8

Solution to Practice Exam, Problem #13 select STUDENT.LastName from (select avg(Grade) as 'AvgGrade',SectID from CLASS group by SectID) as S,   STUDENT, CLASS where STUDENT.SID = CLASS.StudentID and CLASS.SectID = S.SectID and CLASS.Grade > S.AvgGrade

S:

AvgGrade  SectID 83        1 81         2 75         3

CLASS: Grade   SectID    StudentID 81      1         123 79       1         456 74       2         123 65       3         789

STUDENT: SID    LastName 123    Anderson 456    Olmsted 789    Starr

After join and correct where statement AvgGrade  SectID     Grade    SectID    StudentID    SID     LastName 83        1          81       1         123          123     Anderson 83        1          79       1         456          456     Olmsted

Week 8 - Feb 27, 29, Mar 2

 * Data reorganization strategies

Week 9 - (Spring break) Mar 5, 7, 9

 * No class

Week 10 - Mar 12, 14, 16

 * Semantic Web
 * RDF Primer Primer
 * DBpedia
 * RDF Demo
 * This might also be useful
 * Protege
 * Protege/OWL Tutorial

Homework 4, Due on 3/12 - 3/13

Week 11 - Mar 19, 21, 23

 * Ontologies - Notes on Protege
 * Scavenger Hunt

Homework 5 Due on Monday in class

Exam #4 on Friday [[Media:Spring_2012_DISC_210_Practice_Exam_4.pdf|Practice Exam]]

Week 12 - Mar 26, 28, 30

 * SPARQL


 * Homework #6, due on Monday or Tuesday

Videos
3-26-2012 (download)

3-28-2012 (download)

Week 13 - Apr 2, 4, 6

 * Distributed databases
 * MySQL Cluster
 * [[Media:Spring_2012_DISC_210_Distributed_Databases.ppt|Distributed Databases]]
 * Cloud data storage
 * Google Cloud Storage

Videos
4-2-2012 (download)

Week 14 - Apr 9, 11, 13

 * Google Cloud Storage and NoSQL

SPARQL Homework (#7) due on Monday or Tuesday

Exam #5, [[Media:Spring_2012_DISC_210_Practice_Exam_5.pdf|Practice Exam]]

Week 15 - Apr 16, 18, 20

 * Data security
 * Data cleaning methodologies
 * Data warehouses

Week 16 - Apr 23
Review

Final Exam
Wed, April 25: 8 AM - 11 AM

[[Media:Spring_2012_DI_210_Take_Home_Final_Portion.pdf|Final Exam (Take home portion)]]

[[Media:Spring_2012_DISC_210_Practice_Final_Exam_v2.pdf|Practice Final Exam]]