Datafile: Download the dataset (.csv) from the SCADI .
Data Description: This dataset contains 206 attributes of 70 children with physical and motor disability based
on ICF-CY. For more information click this link.
1. Determine the number of subgroups from the dataset using attributes 3 to 205 i.e., exclude attributes 1,
2 and 206. Is this number same as number of classes presented by attribute 206? Explain and justify
your findings. 4 marks
2. Is this data facing curse of dimensionality? If so, then how to solve this problem. Explain with a two-
dimensional plot and report relevant loss of information. 4 marks
3. After applying principal component analysis (PCA) on a given dataset, it was found that the percentage
of variance for the first N components is X%. How is this percentage of variance computed? 2 marks
Background
Obesity has become a global epidemic that has doubled since 1980, with serious consequences for health in
children, teenagers, and adults. Obesity levels in individuals may relate to their eating habits and physical
condition. In this assessment, you will be analysing and creating ML models based on a given dataset that
contains attributes of individuals with relation to obesity levels.
Dataset filename: obesity_levels.csv
Dataset description: This dataset include data for the estimation of obesity levels in individuals based on their
eating habits and physical condition. The data contains 17 attributes and 2111 records.
Features and labels: The attribute names are listed below. The description of the attributes can be found in this
article (web-link).
I. Gender
II. Age
III. Height
IV. Weight
V. family_history_with_overweight (family history of overweight)
VI. FAVC (frequent high caloric food)
VII. FCVC (vegetables per meal)
VIII. NCP (number of main meals per day)
IX. CAEC (any food between meals)
X. SMOKE (smoking)
XI. CH2O (daily water intake)
XII. SCC (daily consumed calories)
XIII. FAF (frequency of physical activity)
XIV. TUE (technology usage)
XV. CALC (consumption of alcohol)
XVI. MTRANS (means of transport)
XVII. NObeyesdad (obesity levels, i.e. Insufficient Weight, Normal Weight, Overweight Level I, Overweight
Level II, Obesity Type I, Obesity Type II and Obesity Type III)
4. Create a machine learning (ML) model for predicting “weight” using all features except “NObeyesdad”
and report observed performance. Explain your results based on following criteria:
a. What model have you selected for solving this problem and why?
b. Have you made any assumption for the target variable? If so, then why?
c. What have you done with text variables? Explain.
d. Have you optimised any model parameters? What is the benefit of this action?
e. Have you applied any step for handling overfitting or underfitting issue? What is that?
5. Create a ML model for classifying subjects into two classes applying following constraints on above
dataset. 12 marks
• Use “NObeyesdad” as target variable and rest of them as predictor variables.
• drop samples with value “Insufficient Weight” for “NObeyesdad”
• Group Normal Weight, Overweight Level I, and Overweight Level II into a class, and the other three
labels (Obesity Type I, II, III) as the other class.
a. Report classification performance scores. Select scores that you think best for describing the model
performance with appropriate justification.
b. Have you taken any step to check generalisability of the model? What is that and how it ensures
generalisability.
c. Can you design and develop any other model for solving this problem? If so, then why have you used
the reported one? Give your justification.
N. B. Use of multiple models to compare results will increase your chances to get higher
marks. This part is for students who are targeting HD – Higher Distinction.