Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA2001: Data Science
Ratio Data:
- A variable that might have a true value of zero and represents the total absence of a
variable being measured. For example, it makes sense to say a kelvin temperature of
100 is twice as hot as a kelvin temperature of 50 because it represents twice as much
thermal energy. (Can't do this with Fahrenheit temperatures of 100 and 50).
- Values encode differences
- Zero is defined
- Ratio meaningful
- Example: Length, weight, pressure, income
Text data or images require additional interpretation.
Data Acquisition:
- File Access
- You or your organization may already have a data set or you can download it
from the web from a data server.
- Typical formats include CSV, Excel, XML
- Programmatically
- Scraping the web (HTML)
- Or using APIs of web services (XML/JSON)
- Database Access
Real data is often dirty. So it's important to do some data cleaning and transforming first.
Typical cleaning steps involved:
- Type and name conversion
- Filtering of missing or inconsistent data
- Unifying semantic data representations
- Potentially also: Scaling, Normalization, Dimensionality Reduction
Approach 1: Specific Data Cleaning Tools
- Tools developed by third parties that make it easy to inspect and clean data.
- Very helpful, especially for smaller data sets.
Approach 2: Jupyter Notebooks and Python
- Write code with Python and its libraries to check for and deal with dirty data
- Before running code, always manually inspect the code.
- Always keep a copy of the original data just in case.
Pandas - Python Data Analysis Library
- Open source library providing data import and analysis functionality to python.
- Optimized data structures for data analysis: DataFrames, Time series data, Matrices.
- Provides loads of handling functions, as well as reader functions for the file formats.
Pandas - Data Structures:
- Two main data structures
- Series (1- dimensional, labeled, homogenous typed)
- DataFrame (2-dimensional, labeled, potentially heterogeneous columns)
- CSV reader imports a dataset as a DataFrame and most functions produce a DataFrame
as output.
- If a dataset is huge, it may be too much to want to process it all in a python project, and
may need to just take a subset of the data.
Pandas - Missing Data Handling
- Pandas functions for handling missing/ wrong data
- csv_read() where missing values are automatically replaced with NA
- DataFrame.dropna(), DataFrame.fillna(), DataFrame.replace()
Best to replace missing value placeholders during import to avoid later problems. Always deal
with null or None values right at the start.
Standard python csv module reads everything as string types. Pandas is a bit better but still will
fallback to string if it can't deduce the type from all values in a column. If so, we need to convert
to the appropriate type.
Converting Approach 1: Use Pandas
- astype() function on data series to convert to new types
- This fails if any entry in the series violates the new type
Converting Approach 2: Function to convert values in a given column
- We are more flexible by introducing our own clean() function in Python.
Descriptive Statistics with Pandas
DataFrame supports a wide variety of data analysis functions.
You can filter entries in a DataFrame using loc[]
- Allows to specify a boolean predicate where only those entries are selected in the
DataFrame for which the predicate is true
- Optionally also allows to select specific columns to keep in result
Frequency distributions are also able to be done with groupby() and size().
- Entries in a pandas dataframe can be grouped by a column.
Pandas is built on Numpy, which is another useful python library.
- Numpy offers various statistics for numerical data.
Matplotlib provides functionality for creating various plots. Pandas offers some easy to use
shortcut functions. Super important to be able to use these plots.
We can use boxplots to compare distributions:
- Mean and stdev are not informative when data is skewed.
- Boxplots summarise data based on 5 numbers
- Lower inner fence -Q1-1.5*IQR
- First Quartile (Q1) - Equivalent to 25th percentile
- Median (Q2) - Equivalent to 50th percentile
- Third Quartile (Q3) - Equivalent to 75th percentile
- Upper inner fence -Q3+1.5*IQR
- Values outside fences are outliers
- Sometimes include outer fences at 3*IQR
Accessing Data in RDBMS, SQL
Approaches
File Based:
- Just put the data in files (CSV or spreadsheet files)
- Analysis might be done by writing formulas or creating charts in a spreadsheet or by
writing python code (example: with pandas)
Quality issues with file based approaches:
- Data quality is left to the users to do the right thing.
- Example: keep metadata somewhere and keep it up to date
- Example: keep backups (redundancy), manage sharing
- Example: prevent changes that introduce inconsistency or violate integrity properties
Alternative: The Database Approach
- Central repository of shared data
- Data is managed by software: a database management system (DBMS)
- Data is accessed by applications or users, through DBMS always
DBMS manages data like an operating system manages hardware resources. (Django= DBMS)
A database is a shared collection of logically related data and its description. The database
represents entities (real world things), the attributes (their relevant properties), and the logical
relationships between the entities.
Advantages of database approach:
- Data is managed so quality can be enforced by the DBMS.
- Improved Data Sharing
- Different users get different views of the data
- Efficient concurrent access
- Enforcement of standards
- All data access is done in the same way
- Integrity constraints, data validation rules
- Better data accessibility/ responsiveness
- Use of SQL
- Security, backup/ recovery, concurrency
- Disaster recovery is easy
- Program data independence
- Metadata stored in DBMS, so applications don't need to worry about data
formats.
- Data queries/ updates managed by DBMS so programs don't need to process
data access routines
- Results in:
- Reduced application development time
- Increased maintenance productivity
- Efficient access.
Key Database Concepts:
- Table is an arrangement of related information stored in columns and rows.
- Field/ Attribute is a column in a table, containing a homogeneous set of data.
- Field data types is a kind of data that can be stored in a field. For example, a field whose
data type is text can store data consisting of either text or number characters, but a
number field can store only numerical data.
- Primary Key is a field in a table whose value uniquely identifies each record in the table.