Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 1
DATA2001 – Data Science, Big
Data, and Data Diversity
Assignment Announcement
Presented by
School of Computer Science
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 2
Practical Assignment: Sydney Liveability Analysis
– Assignment specification available in Canvas (Canvas: Modules –> Assignment)
– Worth 20% of the final grade in DATA2001/DATA2901
– Due on Friday of Week 12
• Python/SQL notebook; brief report; team demo in tutorials of Week 12/13
– Main idea:
– Calculate a ‘liveability score’ per SA2 suburb
in Greater Sydney
• Also extend score with own data
specifically for City of Sydney
– Visualise and correlate with income data
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 3
Practical Assignment: Sydney Liveability Analysis
– Goal: Practical experience with data variety, data analysis, and presentation
– Technologies as covered in this course: Python, Jupyter notebooks, Web APIs, and SQL
– Three tasks:
– Data import, integration and database generation
• We provide census data and spatial data from NSW government and BOCSAR
• Needs to be loaded into database and combined, eg. via spatial join
• Extend with own datasets from “City of Sydney Open Data Hub”
• Milestone 1: Propose stakeholder and extra datasets by Week 11 tutes
– Liveability Analysis (Jupyter Notebook)
• Computation of risk score per neighbuorhood; example formula is provided
• When adding other datasets, feel free to adjust formula
• Correlation analysis to affluency of neighborhoods
– Documentation and (brief) report, including stakeholder pitch
– Additional ML task for teams in advanced DATA2901
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 4
Provided Datasets (cf. Canvas)
– ABS Data
– Census data on neighbourhoods (SA2-level areas) in Greater Sydney
such as population, land area, number of dwellings
– Business statistics per SA2-area
– Income and rent statistics to check for correlation with
– school_catchment Data
– shape data for primary, secondary and future Government schools catchments
– break_and_enter Data
– shape data of theft ’hotspots’ in NSW as determined by BOCSAR
– Note that SA2-level data from the ABS does not always match suburbs;
neither the ABS neighbourhoods nor the BPFL data contain actual shapes
– cf. tutorial this week on how to retrieve boundary data for neighbourhoods
– Adding more datasets from your side is explicitly encouraged (and gives points).
– Try different types and forms, not just CSV…
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 5
Assignment Rules
– Groupwork
– teams of 2 or 3 (unless odd-size class or other good reasons)
– All team members should be in the same tutorial
– Deliverables: Jupyter notebook with source code and a short report (PDF)
– See page 4 of the assignment handout
– Due on Friday of Week 12
– Submission page and marking rubric will be published in Canvas
– Only one member per team needs to submit for the whole group; they should submit
both a ZIP archive under ”Sydney Liveability Analysis Assignment" and also the PDF of
your report in the separate "TurnItIn Dropbox – Sydney Liveability Analysis"
– Late submissions: -5% of available marks per day late; 0 after more than 5 days
– Demo in Weeks 12 and 13
– There will be a short demo during the tutorials of the last two weeks to the tutors
– Individual grades can be scaled based on participation in project or demo
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 6
Tip: PostGIS
– Spatial database extension for PostgreSQL supporting geographic objects (OGC)
– Geometry types for Points, LineStrings, Polygons, MultiPoints, etc.
• including import/export from standard formats such as GeoJSON or KML
– Support for spatial reference systems and transformations between
– Spatial predicates on geometries using the 3x3 nine-intersection model
– Spatial operators for determining geospatial measurements like area, distance, length
and perimeter, and geospatial set operations, like union, difference etc.
– R-Tree indexing (over GiST)