Spring 2014 CSCI 334

= Data Mining Syllabus = Course Description: A course covering data mining concepts, methodologies, and programming. Topics include decision tables and trees, classification and association rules, clustering, pattern analysis, and linear and statistical modeling. Additional topics may include data cleaning and warehousing and techniques for text and web mining.

Required Text: Machine Learning: An Algorithmic Perspective by Stephen Marsland. Required.

Prerequisite: CSCI 221, MATH 207, MATH 250

Contact Information

 * Professor: Dr. Paul Anderson
 * Office: 212 J.C. Long
 * Office Hours: My door is always open. Even if it isn't, please knock. I always love to hear from students. Tuesday and Thursday from 4 - 5 PM are my posted hours.
 * E-mail: andersonpe2@cofc.edu
 * Office Phone: 953-8151
 * Facebook: andersonpe2@cofc.edu
 * Facebook group: https://www.facebook.com/groups/239189286254333/
 * Web: http://birg.cs.cofc.edu/index.php/Spring_2014_CSCI_334

Course (learning) outcomes

 * 1) 	Know the meaning of data mining, some of the application areas and disciplines that use data mining, and understand some of the current major challenges in data mining.
 * 2) 	Recognize that data mining is part of a larger process and be able to describe the various stages of that process.
 * 3) 	Understand the need for and techniques for carrying out data cleaning and other data pre-processing activities and to apply them to real-world data sets.
 * 4) 	Understand and apply a wide range of the fundamental classification and prediction algorithms, including algorithms for decision trees and rule-based classifiers, Bayes classification methods, and other classification approaches such as logistic regression, k-nearest neighbor, and neural networks.
 * 5) 	Examine and apply metrics for classifier performance and selection.
 * 6) 	Examine and apply metrics for association pattern evaluation.
 * 7) 	Understand and apply several clustering algorithms including k-means clustering and BIRCH clustering.
 * 8) 	Examine and apply metrics for cluster evaluation such as clustering tendency, number of clusters, and clustering quality.
 * 9) 	Examine and apply metrics for attribute selection.
 * 10) 	Recognize some of the current data mining trends and research frontiers.
 * 11) 	Explore the use of data mining techniques on different datasets using software packages.

Grading Policy

 * 1) Exam 1 - 20%
 * 2) Exam 2 - 20%
 * 3) Exam 3 - 20%
 * 4) Homework - 10%
 * 5) Programming Assignments - 20%
 * 6) Final Code Review and Presentation - 10%

Grading Scale: A: 90-100; B: 80-89; C: 70-79; D: 65-69; F: <65. Plusses and minuses will be used at the discretion of the instructor.

Grading Guidelines: Submitted work requires Analysis, Evaluation, and Creation of ideas, concepts, and materials into various deliverables (e.g., see revised Bloom's Taxonomy and reference below).
 * The grade of A is for work that involves high-quality achievement in all three Bloom areas.
 * The grade of B is for work that involves high-quality achievement in at least two Bloom areas, and medium-level achievement in the other.
 * The grade of C is for work that involves high-quality achievement in at least one Bloom area, and medium-level achievement in the others.
 * The grade of F is for work that does not meet above criteria.

Reference: Errol Thompson, Andrew Luxton-Reilly, Jacqueline L. Whalley, Minjie Hu, and Phil Robbins. 2008. Bloom's taxonomy for CS assessment. In Proceedings of the tenth conference on Australasian computing education - Volume 78 (ACE '08), Simon Hamilton and Margaret Hamilton (Eds.), Vol. 78. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 155-161.

Homework Policy
Homework will be assigned each week and turned in every Friday. Written homework will placed under my office door by 5 PM on the due date. Cheating/sharing will result in a zero on the assignment and a report to the judicial board.

Programming Assignments
Programming assignments will be submitted through the Learn2Mine environment.

Exam Policy
Student performance will be assessed through three examinations spaced throughout the semester.

Honor Code

 * You must do your work alone (or with your teammates, for group assignments).
 * You must identify your sources of material and inspiration. It is a violation of the honor code to present someone else's work or ideas as your own.
 * In any course deliverable, you must always identify the person(s) that helped you (directly or indirectly), if any, and explain their contribution to your work.
 * Also see the College of Charleston Student Handbook, especially sections on The Honor Code (p. 11), and Student Code of Conduct (p. 12). There is other useful information there.

Classroom Policies

 * You are expected to take good notes during class.
 * You are expected to participate in class with questions and invited discussion.
 * You are expected to attend all classes. The grade 'WA' will be given for excessive (>= 3) absences. If you miss class, you must get an absence memo from the Associate Dean of Students Office; also, you are responsible for announcements made in class, assignment due dates, etc.
 * You should turn off all electronic devices (e.g., cell phones, pagers, etc.).
 * In summary, you should contribute positively to the classroom learning experience, and respect your classmates right to learn (see College of Charleston Student Handbook, section on Classroom Code of Conduct (p. 58)).

Late Policy
No late days will be allowed without an excuse. This course is an upper level course, and it will move very fast. Falling behind on assignments will make it difficult to achieve the learning outcomes of this course.

= Homework =
 * Homework 1 - Due January 24st by 5 PM
 * Solutions
 * Homework 2 - Due January 28st by 5 PM
 * Solutions
 * Homework 3 - Due February 10th by 5 PM
 * Solutions
 * AI solutions, which contains solutions to decision tree problems
 * Homework 4 - Due February 28th
 * Homework 5 - Due March 13th at 5 PM

= Schedule = You are responsible for coming prepared to class. This includes reading through the material before attending class. You will get a lot more out of the lectures and discussions in this manner. It is cliche, but true. The first part of the schedule is built around teaching you the foundations and theory of data mining and machine learning. This will be an intense treatment of difficult material. We will then move on to the application of this material through a lab and programming environment.

Class Time and Location
TR 1:40 - 2:55 PM in 221 JC Long

Week 1 (Jan 9)

 * Chapter 1 - Introduction

Week 2 (Jan 14, 16)

 * Chapter 2 - Linear Discriminants
 * Introduction to Data Mining
 * Vectors and an Introduction to Artificial Neural Networks

Required Reading

 * Chapter 2 - Linear Discriminants

Week 3 (Jan 21, 23)

 * Chapter 3 - Multi-Layer Perceptron
 * Lecture 01-21-2014
 * Lecture 01-23-2014

Required Reading

 * Chapter 3 - Multi-Layer Perceptron

Week 4 (Jan 28, 30)

 * Ice

Week 5 (Feb 4, 6)

 * Chapter 6 - Learning with Trees


 * Lecture 02-04-2014
 * Lecture 02-06-2014

Required Reading

 * Chapter 6 - Learning with Trees

Week 6 (Feb 11, 13)

 * Chapter 7 - Decision by Committee: Ensemble Learning (Ice)
 * Exam on Thursday
 * [[Media:CSCI 334 Spring 2014 Practice Exam.pdf|Practice Exam 1]]

Required Reading

 * Chapter 7 - Decision by Committee: Ensemble Learning

Week 7 (Feb 18, 20)

 * Chapter 8 - Probability and Learning

Required Reading

 * Chapter 8 - Probability and Learning
 * Bayesian Classification

Week 8 (Feb 25, 27)

 * Chapter 9 - Unsupervised Learning
 * k-means, hierarchical clustering
 * Chapter 10 - Dimensionality Reduction
 * PCA
 * Lecture from 02-25-2014

Required Reading

 * Chapter 9 - Unsupervised Learning
 * Chapter 10 - Dimensionality Reduction
 * Principal Component Analysis
 * Hierarchical Clustering
 * Dendrogram

Week 9 (Mar 4, 6)
Spring break

Week 10 (Mar 11, 13)

 * Exhaustive Search
 * Greedy Search
 * Hill Climbing
 * Simulated Annealing
 * Evolutionary Algorithms

Required Reading

 * Chapter 11 - Optimization and Search
 * Chapter 12 - Evolutionary Learning

Week 11 (Mar 18, 20)

 * Introduction to R

Tuesday

 * 1) In class assignment
 * 2) Complete this third party introduction to the R programming language: http://tryr.codeschool.com/
 * 3) Start on the following lessons
 * 4) Basic R
 * 5) R File I/O
 * 6) R Functions
 * 7) R Conditionals
 * 8) R Loops
 * 9) Complete the end of lab assignment
 * 10) Out of class assignment (due before class on Thursday)
 * 11) Finish the following lessons
 * 12) Basic R
 * 13) R File I/O
 * 14) R Functions
 * 15) R Conditionals
 * 16) R Loops

Thursday

 * 1) In class assignment
 * 2) Start on the following lesson
 * 3) Empirical Naive Bayes
 * 4) Complete the end of lab assignment
 * 5) Out of class assignment (due before class the following Tuesday)
 * 6) Finish the following lesson
 * 7) Empirical Naive Bayes

Week 13 (Apr 1, 3)

 * Implementation of Naive Bayes

Week 14 (Apr 8, 10)

 * Implementation of Unsupervised Learning

Week 15 (Apr 15, 17)

 * Implementation of Search and Genetic Algorithm

Week 16 (Apr 22, 24)

 * Last day of class on April 22nd
 * Reading day on April 24th
 * Final exams begin on April 25th

Final Exam
Thursday, May 1st from 12 - 3 PM