COSC 285: Introduction to Data Mining
Introduction to Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COSC 285: Introduction to Data Mining
Task: Neural Network Classification
Assignment Description:
Your assignment is building a Multilayer Neural Network classifier using feedforward back-
propagation as described in the class lectures & in your textbook. As it is a supervised algorithm,
you need to use a training data set for building your model. Use the provided data set, adult.arff
for this assignment. This file contains all of the data samples that you will train and test on, using
5-fold cross-validation (if you prefer to do 10-fold cross validation it is up to you but report).
Your classifier should classify the records based on Income attribute. (for the implementation
you can use various libraries; consult the FAQ file and/or ask on the discussion board to make
sure what you can or can not use))
Pre-processing:
• Replace missing values by replacing them by mean/mode per class (as in the previous
project)
• Normalize the numerical data using Z-score normalization (you can use any libraries for
this)
Here are the architecture and requirements of your system:
• Network Topology – create a fully connected network by creating connection from each
input node to each hidden layer nodes, and from each hidden layer nodes to each output
node(s).
• Input Layer: For categorical attributes you may decide to create one node per distinct value
or to group the domains per either domain knowledge or entropy and then create a node per
each group, in your input layer. For the numeric attributes create only one input node. (For
the age attribute your choice of handling it as numeric or discretize it into groups).
• Hidden Layer: Create one hidden layer; the number of nodes in this layer is a parameter
into your program, so that you perform experimentations and observe the result of
classification by changing this parameter (see the table for parameters!).
• Output Layer: Your choice of using either one or two nodes – specify and explain your
choice in your design document.
• Learning Rate: This is also a parameter to your program (see table for parameters!)
• Epochs: The number of iterations or Epochs is also a parameter to your program (see table!).
• Arc weights and biases: Initialize to a random number in [-0.5, 0.5] range.
Runs and Results:
You may come up with a set of runs and experimentations on your own. Experimenting with
various learning rate, various number of hidden layer nodes, to evaluate their effect. Try very
different values for the parameters so that you can do some observations and analysis. Collect the
statistics to report & analyze. Here is a suggested sample of what you can report. Make sure you
provide a good analysis on the statistics you collect:
Run (fold)#: _____
# of Epochs
(report if set
fixed or after
termination
report the
number of
iterations)
Learning
Rate
# of
Hidden
Layer
Nodes
Error @ output
node1 (after
first epoch)
Error @ output
node1 (after last
epoch)
Error @ output node 2
(after first epoch)
Error @ output node 2
(after last epoch)
EVALUATE
QUALITY
(Prec. Rec., F1)
Training
Time
1/t
(t is
iteration
number)
REQUIRED: Add additional rows in this table to report results by varying the number of nodes, learning rate, and number of
epochs. Pay attention how to plan your experimentation so that you can have a meaningful analysis.
When writing your analysis also give a comparison with the results of your project-1 (Naïve
Bayes); also compare against the majority class baseline.
Deliverables:
Cover page (1 pt): should contain the following in the exact order as specified:
a. Status of this assignment: Complete or Incomplete. If incomplete state clearly what is
incomplete.
b.Time spent on this assignment. Approx. number of hours.
c.Things you wish you had been told prior to being given the assignment.
Design (10 pts): No code should be included in the design document. No specific template is
provided. You may draw diagram(s) to show the architecture and the flow of your software
components, and/or to provide the write-up of your design decisions.
Your working system, Results & Analysis (89 pts): A working system, satisfying the
requirements is expected. Results should be used to provide a good analysis of your classifier. Thus,
you are expected to provide a good analysis along with your results. Your result is based on 5(10)-
fold cross validation. This means for each of the 5(10) runs, you will need to report the results.
Include a text output file named adult.out that contains all of the runtime information for each fold,
this is mandatory. For each of the 5 runs, provide the specified information in the provided
table. You may be asked to give a demo, demonstrating that all the requirements are
implemented and functional; and answer related questions.