Fall 2012 Introduction to Discovery Informatics

= Course Description = Introduction to the use of computer based tools for the analysis of large data sets for the purpose of knowledge discovery. Students will learn to understand the Discovery Informatics process and the difference between deductive hypothesis-driven and inductive data-driven modeling. Students will have hands-on experience with various on-line analytical processing and data mining software and complete a project using real data.

= Syllabus = [[Media:DI_101_Syllabus__Fall_2012.pdf|Download PDF version here]]

= Facebook Group = https://www.facebook.com/groups/156752447781753/

= Department of Computer Science = [[Media:Computer_Science_Guide.pdf|Guide to the Computer Science Department]]

= Project Groups =

Section 01

 * 1) Brendan Webb, Ziv Agasi, Max Lynn
 * 2) Thomas Brady, Nick Levitt, Amy Tevelowitz
 * 3) Dyanne Vaught, Mark Thornton, Andrew Trice
 * 4) Diana Manaker, Robby Hambrick, Michael Feliciano
 * 5) Adam Zhu, Steven Pilkenton, Parker Wise, Morgan
 * 6) Ronak Raithatha, DJ Duyer, Corey Bullock
 * 7) Nathan Bobart, Jon Simpson, Connor Carroll

Section 02

 * 1) David Schirduan, Catherine Claro, Tom Nash
 * 2) Daniel Sieger, Griffon Scott, Brandi Grebenc
 * 3) Chloe Fletcher, Patrick Collis, Jimmy Dendrinos, Lixin Yao
 * 4) Kelsey Yetsko, Rebecca Wiseman, Conor Templeten
 * 5) Nick Alonso, Joshua Voltin, Joseph Brinkley
 * 6) Tyrieke Morton, Peter Galagher, Alex Jacobs
 * 7) Gary Webb, Austin Bello, Trevor Shortt
 * 8) Victoria Newton, Raquel Jones, Darcy Alcorn
 * 9) Daniel Mulligan, Bennett Mackay, Stephen Rainey

= MATLAB Tutorial = http://www.mathworks.com/academia/student_center/tutorials/

= MATLAB Sample Code = Browse and download sample code from class

= DI 101 Course Notes = While there isn't a book for this class, I am in the process of compiling a set of notes that are meant to be viewed online.

Notes

= Tentative Course Schedule =

Lecture Videos
Browse here

Topics
Overview of Emerging Field
 * History background and motivations
 * Knowledge discovery overview

Assignments
Project agenda: Form groups, examine kaggle.com, and share a Dropbox folder among the group (www.dropbox.com).

Topics
Introduction to Kaggle.com

[[Media:01 Pattern Recognition.ppt|Introduction to Supervised Learning and Data Mining]]
 * k-nearest neighbor classifier
 * Naive Bayes classifier

Introduction to MATLAB
 * MATLAB Tutorials and Learning Resources

Assignments
Project agenda: Explore kaggle.com. Identify a project of interest to your group. Download the necessary data files and get them loaded into MATLAB.

[[Media:Fall_2012_DISC_101_Homework_1.pdf|Homework #1 Assigned]]

Announcements
Aug 27 - Last day to drop/add full semester class

Topics
Introduction to Unsupervised Learning
 * [[Media:Introduction to unsupervised data mining.ppt|Introduction to unsupervised data mining]]
 * K-means
 * [[Media:k-means-example.pdf|k-means example]]
 * [[Media:grades.arff|Grades data set (ARFF)]]

Assignments
Wednesday the 5th and Friday the 7th - Present project idea to the class (5 - 10 minute presentation)
 * You are required to submit a PDF or Powerpoint version of your presentation to me by midnight Tuesday the 4th. Prezi links are fine as well.

2 page project proposal due Friday the 7th by midnight.
 * Single Spaced
 * Minimum of 3 journal article citations (at least one from each group member)
 * The proposal should have the following clearly specified:
 * Dataset/challenge from www.kaggle.com
 * Background on the challenge. Why is it important? What are the specific challenges of this problem?
 * Each person is required to write a section describing what they plan on trying/implementing. There can be some overlap, but I want to make sure that everyone has something unique to try out. It is completely expected that what you propose here will probably not work as well as you would have hoped. That's fine. This proposal isn't a contract. It's just a plan. I fully expect that you'll adapt your approach as you learn more throughout this course.

Homework #1 Due by in class on Friday, September 7th

[[Media:Fall_2012_DISC_101_HW1_Solutions.pdf|Homework #1 Solutions]]

Topics
[[Media:Principal_Component_Analysis.ppt|Principal Component Analysis]]

Assignments
[[Media:Fall_2012_DISC_101_HW2.pdf|Homework #2 Assigned]]

Project agenda: Attempt to solve your problem

Topics
Improved Supervised and Unsupervised Learning
 * Decision Trees

Assignments
Project agenda: Continue working on your project

Homework #2 Due on Friday in class

Announcements
Quiz #1 on Friday. The quiz will cover up to and including PCA. The algorithms that will be covered include: A good portion of the quiz will resemble the homework. The remaining portion will consist of short answer questions related to the presentations.
 * k-nearest neighbor
 * Bayes Classification
 * k-means
 * PCA

[[Media:Fall_2012_DISC_101_Quiz_1_Solutions.pdf|Quiz #1 with Solutions]]

Topics
Improved Supervised and Unsupervised Learning
 * Neural Networks

Assignments
[[Media:Fall_2012_DISC_101_Homework_3.pdf|Homework #3 Assigned]]

Project agenda: Continue working on your project

Topics
More on Decision Trees

Announcements
Homework #3 due in class on Friday

[[Media:Fall_2012_Homework_3_Solutions.pdf|Homework 3 Solutions]]

Project agenda: Present preliminary results and 3 – 4 page write-up due, single spaced, 3 or more journal article citations
 * Update the introduction of your write-up
 * Write up your results up to this point. It is important that you have results to present. It is not valid to present problems with MATLAB as results. It is fine to present poor, preliminary results. They will give you a place to start towards improvement.
 * You do not need to include your code.
 * The presentation should include 2 introductory slides. One to describe the problem and one to describe the dataset.
 * Then you should describe your preliminary results. Begin by talking about your methods and then summarize your findings up to this point.

Assignments
Project agenda: In class workgroups

Announcements
Midterm

[[Media:Fall_2012_DISC_101_Practice_Midterm.pdf|Practice Midterm]]

Topics
Regression

Announcements
Oct 15 - Fall break

Topics
Linear Regression

Partial Least Squares Regression

Assignments
Project agenda: Continue working on projects

Week 11 (Oct 29 - Nov 2)
Image Processing (see notes)

Association Rule Mining

Assignments
[[Media:Fall_2012_DISC_101_Homework_4.pdf|Homework #4 Due Friday]]

Project agenda: Continue working on projects

Topics
Optimization
 * Linear Programming
 * Hill climbing
 * Simulated annealing
 * Evolutionary Algorithms, MATLAB GA

Assignments
[[Media:05 Association Rules and Optimization.pdf|Homework #5 Assigned]]

Project agenda: Continue working on projects

Announcements
Nov 6 - Election day (no classes)

Possible Topics
Databases and Data Aggregation
 * Data models, databases and data warehouses
 * Storage and Retrieval
 * Files, spreadsheets, and databases
 * Querying, SQL, Search, and Filtering
 * Who does this relate to Big Data?

Computation and computability

Distributed computation
 * Map-Reduce
 * Cloud-computing
 * Who does this relate to Big Data?

Knowledge Representation
 * Semantic Web and RDF
 * How does this relate to Big Data?

Assignments
Homework #5 Due

Project agenda: Continue working on projects

Topics
[[Media:Fall_2012_DISC_101_Quiz_2.pdf|Quiz #2 on Monday]]

Assignments
Homework #5 Due

Project agenda: Continue working on projects

Topics
Social, ethical, and legal issues

Assignments
Present final results on Wednesday and Friday. Write-up due on Friday. (4 – 6 page write-up, single spaced, 3 or more journal article citations)

Week 16 - 17
Final Exam

[[Media:Fall_2012_DISC_101_Practice_Final_Exam.pdf|Practice Final Exam]]

Dec 4 - Reading day

Section 02 - Exam on December 12 from 8 - 11 AM

Section 01 - Exam on December 5 from 12 - 3 PM