Big Data Analytics and Database Design
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
、
COMP-4250 Big Data Analytics and Database Design
Objective
The objective of this project is to gain hands-on experience using Weka (a popular data mining
software) to build models from real world datasets. You will also evaluate different data mining
algorithms in terms of accuracy and run time. This project is composed of five different tasks. Each
task is explained in detail below. Note that task 5 is optional and you can get up to an extra 2% for
performing it.
Software
It is a free software that you can download and install on your machine. A guide on how to use
Weka's Explorer can be found in the attached ExplorerGuide.pdf. Another document about the ARFF
data format can be found in the attached Arff.pdf. Note that Weka is able to handle other data formats
as well (such as csv). However, ARFF is the default format for Weka.
After installing Weka, you can use "java -Xmx256m -jar weka.jar" to modify the heap size when
you invoke the program. You can increase the value of 256m if it is not enough. If not mentioned
explicitly, you should use the default parameters of Weka for each classification algorithm.
Dataset: Along with this file on Blackboard.
Submission
• A pdf file that contains your answers to the tasks.
• The programs, if any, that you write for solving the optional task 5, and a readme file showing
how to use these programs.
• Make a zip file from all of the above files, and name the file as LLL_FFF_DDD.zip, in which
LLL, FFF, and DDD are your last name, first name, and student ID, respectively. One
submission per team is enough, and you should name the team members at the top of your
submitted pdf file.
• Submit the zip file on Blackboard before the deadline.
Task 1 (2%)
Consider the attached lymphography dataset (lymph.arff) that describes 148 patients with 19
attributes. The last attribute is the class attribute that classifies a patient in one of the four categories
2
(normal, metastases, malign_lymph, and fibrosis). Detailed information about the attributes is given
in lymph_info.txt. The data set is in the ARFF format used by Weka.
Use the following learning methods (classification algorithms) that are provided in Weka to learn a
classification model from the dataset with all the attributes:
Ø C4.5 (weka.classifier.trees.J48)
Ø RIPPER (weka.classifier.rules.JRip)
For each learning method, report only the classification model learned from the dataset. Therefore,
copy and paste the “Classifier model (full training set)” from Weka output to your report. For C4.5,
it would be “J48 pruned tree”. For RIPPER, it would be “JRIP rules:”.
Task 2 (5%)
You are given a training dataset (monks-train.arff) and a test dataset (monks-test.arff) in which each
training example is represented by seven nominal (categorical) attributes. The last attribute is the
class attribute that classify each data point to one of the two classes (0 and 1). The attribute
information is given below:
Attribute Possible Values
A1 1, 2, 3
A2 1, 2, 3
A3 1, 2
A4 1, 2, 3
A5 1, 2, 3, 4
A6 1, 2
class 0, 1
Use the following learning methods provided in Weka to learn a classification model from the
training dataset and test the model on the test dataset:
Ø C4.5 (weka.classifier.trees.J48)
Ø RIPPER (weka.classifier.rules.JRip)
Ø k-Nearest Neighbor (weka.classifiers.lazy.IBk)
Ø Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
Ø Neural Networks (weka.classifiers.functions.MultilayerPerceptron)
Note that you have to use the “Supplied test set” option in the “Test options” box of Weka and pass
the test data file (monks-test.arff) to Weka.
Report the classification summary, classification accuracy, and confusion matrix of each algorithm
on test dataset. In other words, copy and paste the “Summary”, “Detailed Accuracy By Class”, and
“Confusion Matrix” from Weka output to your report. Also, briefly discuss your results in terms
of accuracy.
4
Task 3 (3%)
You are given a dataset on credit card application approval (credit.arff) in the ARFF format. The
dataset describes 690 customers with 16 attributes. The last attribute is the class attribute describing
whether the customer's application was approved or not. The dataset contains both symbolic and
continuous attributes. Some of the continuous attributes contain missing values (which are marked
by "?"). All attribute names and values have been changed to meaningless symbols to protect
confidentiality of the data.
Randomly split the dataset into a training set (70%) and a test set (30%). This can be done using the
"Percentage split" in the “Test option” box of Weka's "Classify" section (set the number to 70).
Apply each of the following classification algorithms to learn a classification model from the
training set and classify the examples in the test set.
Ø C4.5 (weka.classifier.trees.J48)
Ø Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
Ø Neural Networks (weka.classifiers.functions.MultilayerPerceptron)
Report the classification accuracy of each learning algorithm on the test dataset. In other words,
copy and paste the “Summary”, “Detailed Accuracy By Class”, and “Confusion Matrix” from Weka
output to your report.
Note that C4.5, Naive Bayesian Classification, and Neural Networks can automatically handle both
symbolic and continuous attributes as well as missing values of continuous attributes. Therefore,
you do not need to do any extra preprocessing on the data and can directly run the above learning
algorithms on the input dataset (credit.arff).