Fall 2011 Introduction to Discovery Informatics

= Course Description = Introduction to the use of computer based tools for the analysis of large data sets for the purpose of knowledge discovery. Students will learn to understand the Discovery Informatics process and the difference between deductive hypothesis-driven and inductive data-driven modeling. Students will have hands-on experience with various on-line analytical processing and data mining software and complete a project using real data.

= Facebook Group = http://www.facebook.com/groups/267760973233950/

= A sample of data repositories =
 * http://www.nlm.nih.gov/hsrinfo/datasites.html
 * https://sctr.musc.edu/index.php/research-tools/clinical-data-warehouse
 * http://isc.sans.edu/diary.html?storyid=5728
 * http://www.ncbi.nlm.nih.gov/pubmed/11752295
 * http://www.treebase.org/treebase-web/home.html
 * http://www.ucmp.berkeley.edu/subway/phylo/phylodat.html
 * http://www.earth-policy.org/data_center/C26

= Tentative Schedule =

Week 1
Overview of Emerging Field

History background and motivations

Knowledge discovery overview

Dates: Aug 23, 25

Assessment: Nothing due

Assignment: Read the forward and the chapter entitled Jim Gray on eScience: A Transformed Scientific Method in The Fourth Paradigm. This book is LARGE, so I would recommend you print out specific chapters.

Week 2
Databases and Data Aggregation

Data models, databases and data warehouses

Lecture 1: Databases

Sample Access Database: MONDIAL.accdb

Supplemental to Lecture 1: [[Media:Data_Organization.ppt|Data Organization]]

Lecture 2: [[Media:Data_Warehouse.ppt|Data Warehouses]]

Supplemental to Lecture 2: Open Source Software: Infobright

Dates: Aug 30, Sept 1

Assessment: Quiz #1 on Thursday over prior reading assignment. [[Media:Quiz_1_Solutions.pdf|Solutions]]

Assignment: [[Media:HW_1.pdf|HW #1]]

Week 3
Storage and Retrieval

Files, spreadsheets, and databases

Querying, SQL, Search, and Filtering

[[Media:SQL_and_MapReduce.PDF‎|Lecture 1: SQL and MapReduce]]

Lecture 2: Information Theory

Dates: Sept 6, Sept 8

Assessment: Homework #1 Due

Assignment: Read the INTRODUCTION and the chapter titled THE HEALTHCARE SINGULARITY AND THE AGE OF SEMANTIC MEDICINE from the HEALTH AND WELLBEING Section. This section begins on page 55 of The Fourth Paradigm.

Week 4
Data, Information, and Knowledge

Information theory introduction

Link to animation

Introduction to Reasoning: Inductive and Deductive

[[Media:Reasoning.pptx|Reasoning.pptx]]

Dates: Sept 13, Sept 15

Assessment: Quiz on Thursday over the previous reading assignment

Assignment: [[Media:HW_2.pdf|HW #2]]

Week 5
Thomas Goetz: It's time to redesign medical data

[[Media:Introduction_Bayesian_Networks.ppt|Introduction to Bayesian Networks]]

[[Media:Bayesian_Networks.pdf|Bayesian Networks]]

Dates: Sept 20, Sept 22

Assessment: HW #2 Due on Thursday at the beginning of class (hand in hard copy). i.e., do not use OAKS

Assignment: [[Media:HW_3.pdf|HW #3]]

Week 6
More notes on Bayesian Networks

Dates: Sept 27, Sept 29

Assessment: HW #3 Due AND Exam #1 on Thursday Sept. 29

Assignment: None

Week 7
Data mining

Clustering and pattern exploration

Unsupervised methods: K-means and hierarchical clustering

[[Media:Introduction to unsupervised data mining.ppt|Introduction to unsupervised data mining]]

K-means and hierarchical clustering

[[Media:k-means-example.pdf|k-means example]]

[[Media:grades.arff|Grades data set (ARFF)]]

Dates: Oct 4, Oct 6

Assessment: None

Assignment: None

Videos
Weka k-means (download)

Week 8
Data Mining

Unsupervised methods: Hierarchical Clustering and [[Media:Principal_Component_Analysis.ppt|Principal Component Analysis]]

Dates: Oct 11, Oct 13

Assessment: None

Assignment: [[Media:HW_4.docx|HW #4]] - [[Media:DISC_101_Fall_2011_Homework_4_Solutions.pdf|HW_4_Solutions.pdf]] [[Media:nutrition.arff|nutrition.arff]] [[Media:diabetes.arff|diabetes.arff]]

Videos
Weka PCA (download)

Week 9
Data Mining (continued)

Decision Trees

Dates: Holiday, Oct 20

Assessment:

Assignment:

Week 10
Data Mining (continued)

Decision Trees Continued

Training Set versus Test Set

Overfitting

[[Media:Fall_2011_DISC_101_Decision_Tree_Worksheet.pdf|Decision Tree Worksheet]]

[[Media:Fall_2011_DISC_101_Notes_Decision_Tree_Worksheet.pdf|Notes from Decision Tree Worksheet]]

Dates: Oct 25, Oct 27

Assignment: Read http://radar.oreilly.com/2011/09/building-data-science-teams.html and Complete [[Media:DISC_101_Fall_2011_Homework_5_v2.pdf|Homework 5]]. Due Thursday November 3 at the beginning of class).

Videos
Rapid Miner - Getting Started (Miner - Getting Started.mp4 download)

Week 11
Supervised methods: [[Media:Fall_2011_DISC_101_Bayes_Classifier.ppt|Bayes Classifier, k-nearest neighbor]]

Text and image mining, data dredging approaches

[[Media:Fall_2011_CSCI_220_Homework_5_Solutions.pdf|Homework Solutions - Decision Trees]]

Dates: Nov 1, Nov 3

Assignment: [[Media:Fall_2011_CSCI_220_Homework_6.pdf|Homework_6.pdf]] - Due November 15th. [[Media:Fall_2011_CSCI_220_Homework_6_Solutions.pdf|Homework 6 Solutions]]

Week 12
Evolutionary Algorithms

[[Media:Fall_2011_DISC_101_GAs.ppt|Genetic algorithms, genetic programming, and swarm intelligence]]

Dates: Nov 8, Nov 10

Assessment: [[Media:Fall_2011_DISC_101_Exam_2.pdf|Exam #2]] on Tuesday Nov. 8

Assignment: Send me your topic for the final project write-up via e-mail before November 17

Week 13
Cross-validation

Dates: Nov 15, Nov 17

Assessment: Homework 6 Due

Assignment: Nothing

Videos
Bayes vs. Decision (download)

Week 14
Optimization
 * Hill climbers
 * Simulated Annealing
 * Genetic Algorithms

Dates: Nov 22, Holiday

Assessment: None

Assignment: TBA

Week 15
Optimization
 * Hill climbers
 * Simulated Annealing
 * Genetic Algorithms

[[Media:DISC_101_Fall_2011_Ethical_Social.ppt|Social, Ethical, and Legal Issues & The End Game: Why DI is not CS, Math, or Statistics.]]

Dates: Nov 29, Dec 1

Assignment: [[Media:DISC_101_Fall_2011_Homework_7_v2.pdf|Homework 7]] - Due at the beginning of the final exam

[[Media:Fall_2011_DISC_101_Final_Project.pdf|Final Project Write-up]] - Due Monday, December 12th by 5 PM in my mailbox outside my door.

Week 16
Final Exam

Dates: Final exam is 12 - 3 PM on Thursday, December 8th

Assessment: Final Project Write-up Due

[[Media:Fall_2011_DISC_101_Final_Exam.pdf|Final Exam.pdf]]

Additional Topics
Information Validation

Validity, precision, missing values

Data scrubbing

Artificial Neural Networks

Feed-forward neural network and self-organizing maps

Knowledge Representation

Semantic Web and RDF

Computers and Networks

Computation and computability

Distributed computation