Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
HW4Anom
Question 1 (40 points)
In this question, you will model trac counts in Pittsburgh using Gaussian process (GP) regression.
The included dataset, "PittsburghTracCounts.csv", represents the average daily trac counts
computed by trac sensors at over 1,100 locations in Allegheny County, PA. The data was collected
from years 2012-2014 and compiled by Carnegie Mellon University’s Trac21 Institute; we have the
longitude, latitude, and average daily count for each sensor.
Given this dataset, your goal is to learn a model of trac count as a function of spatial location. To
do so, t a Gaussian Process regression model to the observed data. While you can decide on the
precise kernel specication, you should try to achieve a good model t, as quantied by a log
marginal likelihood value greater than (i.e., less negative than) -1400. Here are some hints for
getting a good model t:
We recommend that you take the logarithm of the trac counts, and then subtract the mean
of this vector, before tting the model.
Since the data is noisy, don't forget to include a noise term (WhiteKernel) in your model.
When tting a GP with RBF kernel on multidimensional data, you can learn a separate length
scale for each dimension, e.g., length_scale=(length_scale_x, length_scale_y).
Your Python code should provide the following ve outputs:
1) The kernel after parameter optimization and tting to the observed data. (10 pts)
2) The log marginal likelihood of the training data. (5 pts)
3) Show a 2-D plot of the model's predictions over a mesh grid of longitude/latitude (with color
corresponding to the model's predictions) and overlay a 2-D scatter plot of sensor locations (with
color corresponding to the observed values). (10 pts)
4) What percentage of sensors have average trac counts more than two standard deviations
higher or lower than the model predicts given their spatial location? (5 pts)
5) Show a 2-D scatter plot of the sensor locations, with three colors corresponding to observed
values a) more than two standard deviations higher than predicted, b) more then two standard
deviations lower than predicted, and c) within two standard deviations of the predicted values. (10
pts)
MLC HW 4
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive')
Longitude Latitude AvgDailyTrafficCount log
0 -80.278366 40.468606 84.0 4.430817
1 -80.162117 40.384598 95.0 4.553877
2 -80.221205 40.366778 97.0 4.574711
3 -80.142455 40.622084 111.0 4.709530
4 -80.131975 40.544915 125.0 4.828314
... ... ... ... ...
1110 -79.843684 40.498619 13428.0 9.505097
1111 -79.926842 40.425383 13713.0 9.526100
1112 -80.065730 40.397582 13822.0 9.534017
1113 -79.863848 40.429878 14172.0 9.559023
1114 -79.848609 40.479233 14891.0 9.608512
1115 rows × 4 columns
Data1
Given an unlabeled dataset with two real-valued attributes, we perform cluster-based anomaly
detection by running k-means, choosing the number of clusters k automatically using the Schwarz
criterion. Four clusters are formed:
A: 100 points, center (0, 0), standard deviation 0.1
B: 150 points, center (35, 5), standard deviation 5
C: 2 points, center (15, 20), standard deviation 1
Question 2: Cluster-based anomaly detection (10 points)
D: 200 points, center (10, 10), standard deviation 1
Given the four points below, which of these points are, and are not, likely to be anomalies? Choose
“Anomaly” or “Not Anomaly”, and provide a brief explanation, for each point. (Hint: your answers
should take into account the size and standard deviation of each cluster as well as the distances to
cluster centers.)
(1, 0) Anomaly / Not Anomaly
(35, 2) Anomaly / Not Anomaly
(15, 19) Anomaly / Not Anomaly
(10, 11) Anomaly / Not Anomaly
Your answer here
For this question, use the "County Health Indicators" dataset provided to identify the most
anomalous counties. Please list the top 5 most anomalous counties computed using each of the
following models. (We recommend that, as a pre-processing step, you drop na values, and make
sure all numeric values are treated as oats not strings.)
Part 1: Learn a Bayesian network structure using only the six features ["'% Smokers'","'%
Obese'","'Violent Crime Rate'","'80/20 Income Ratio'","'% Children in Poverty'","'Average Daily PM2.5'"].
Use pd.cut() to discretize each feature into 5 categories: 0,1,2,3,4.
(a) Use HillClimbSearch and BicScore to learn the Bayesian network structure (5 pts)
(b) Which 5 counties have the lowest (most negative) log-likelihood values? Please show a ranked
list of the top counties' names and log-likelihood values. (10 pts)
Part 2: Cluster based anomaly detection. Use all numeric features for this part, and do not
discretize.
(a) Clustering with k-means. Please use k=3 clusters. Compute each record's distance to the
nearest cluster center and report the ve counties which have the longest distances. (10 pts)
(b) Cluster with Gaussian Mixture. Please repeat (2)a but use log-likelihood for each record (rather
than distance) as the measure of anomalousness. (10 pts)
Part 3: Choose one more anomaly detection model you prefer and report the top 5 most anomalous
counties by the model you chose. (10 pts)
Question 3: Anomaly detection (50 points)