data mining studies the algorithms and computational paradigms that allow computers to discover knowledge and perform decision automatically using large and complex datasets. in this course, we will explain the fundamental principles, technical details, and real life applications of data mining techniques through lectures, case studies, and course projects. the core topics to be covered include data preproprocessing, classification, cluster analysis, association analysis, anomaly detection, neural networks, model evaluation, and applications like recommernder systems. the learning goals are to think sysematically of how data mining can solve analytical problems, to make better infromed decisions using various real-world data. as well as understand data mining process, algorithm development, and system deign to build a pathway to the career of a data scientist.
august 21 2023 introduction to data mining
covers what is the challenge and why should we do data mining. what is data mining: definition, process, and examples. data mining tasks: classification, clustering, association rules, anomaly detection
august 25 2023 data
- reading chapter 2.1 types data
reading chapter 2.2 data quality
covers continue on introduction of association rules, anomaly detection
august 28 2023 data
reading
chapter 2.3 data preprocessing
chapter 2.4 measures of similarity and dissimilarity
covers
- data preprocessing - aggregation, samping, dimensionality reduction, feature selection, feature creature, discretization and binarization, attribute transformation
september 01 2023 data
reading
chapter 2.4 measures of similarity and dissimilarity
covers
- general distance and similarity - euclidean and monkowski distance. smc / jaccard and cosine similarity, pearson's correlation. information based measures - mutual information, kl-divergence. ranking distance. distance of sets of data point
september 04 2023 decision tree classifier
reading
chapter 3.1 basic concepts
chapter 3.2 general framework for classification
chapter 3.3 decision tree classifier
covers
- supervised learning setup - regression, classification. decision tree - goal, algorithm, split, criterion, node impurity (gini index)
september 08 2023 decision tree classifier
- chapter 3.4 olap and multidimensional data analysis
- chapter 3.5 model selection
- decision tree - node impurity (entrypy, misclassification error)
- split for continuous attributes
- model overfitting - definition, example, generalization error, model pruning
september 11 2023 evaluation metrics
- python and pandas tutorial code
- chapter 3.6 model evaluation
- chapter 3.7 presence of hyper parameters
- chapter 3.8
- evaluation and metrics for classification models - confusion matrix, accuracy, precision, recall, F-measure, roc, auc, corss-validation
- tutorial of developing data mining project and setting up python enviroment using conda
september 15 2023 logistic regression
- scikit-learn tutorial: code
- chapter 4.6
- linear regression review - model, sum of squared errors loss function, gradient descent optimization
- logistic regression classifier - model, cross-entropy loss function, regularization
september 18 2023 naive bayes classifier
september 22 2023 bayesian network, k-nearest neighbor classifiers
september 25 2023 support vector machine (SVM)
september 29 2023 neural networks
october 02 2023 neural networks
october 06 2023 midterm review
october 09 2023 midterm exam (part 1)
october 13 2023 midterm exam (part 2)
october 16 2023 no class due to fall break
october 20 2023 neural networks
october 23 2023 clustering & k-means
october 27 2023 hierarchical clustering
october 30 2023 DBSCAN & cluster evaluation
november 03 2023 association rule mining
november 06 2023 association rule mining
november 10 2023 association rule mining
november 13 2023 association rule mining
november 17 2023 anomaly detection
november 20 2023 deep learning framework
november 24 2023 no class due to thanksgiving break
november 27 2023 ensemble methods and boosting
december 01 2023 final review
december 06 2023 data mining application: recommender systems
december 08 2023 stop day
- why big data? what is big data mining? why data mining? data mining processes, relation to budiness intelligence techniques
- introduction to data mining tasks (classification, clustering, association analysis, anomaly detection). what is a model? basic terminologies, predictive modeling
- real world data mining applications
- understanding of data, what is data? type of attributes, properties of attribute values, types of data, data quality
- sampling, data normalization, data cleaning, similarity measures
- feature selection / instance selection, the importance of feature selection / instance selectin in various big data scenarios
- decision-tree based approach (e.g. C4.5)
- rule-based approach (e.g. Ripper)
- instance based classifiers (e.g. k-nearest neighbor)
- support vector machines (svms)
- ensemble learning
- classification model selection and evaluation
- applications: b2b customer buying stage prediction, recommender systems
- apriori algorithm and its extensions
- association pattern evaluation
- sequantial patterns and frequent subgraph mining
- applications: b2b customer buying path analysis, medical informatics, telecommunication alarm diagnosis
- partitional and hierarchical clustering methods
- graph based methods
- density based methods
- cluster validation
- applications: customer profiling, market segmentation
- statistical based and density based methods
- neorons and network topology
- multi layer feed forward network
- big data analytics in mobile enviroments
- fraud detection and prevention with data mining techniques
- big data analystics in real world business
| unit | weight | 
|---|---|
| exam 1 | 20% | 
| exam 2 | 20% | 
| programming project | 20% | 
| assignments | 30% | 
| attendance | 10% |