Data Science

monitoring database search_insights data_exploration code

Data Science Basics

Multidisciplinary field that involves extracting insights and knowledge from data through various processes, techniques, and tools

Umbrella term for a more comprehensive set of fields that are focused on mining big data sets and discovering innovative new insights, trends, methods, and processes

Data Science ...

Data Science process

1. Problem Definition

Essential step for setting a clear direction for the data science project

By thoroughly understanding the business context, defining specific goals, identifying key questions, and assessing feasibility, you lay a solid foundation for the subsequent stages of data collection, analysis, and modeling

A well-defined problem ensures that the project stays focused, resources are used efficiently, and the final outcomes are aligned with the business objectives

Objective: Understand the business problem or research question
Goals: Define clear objectives and the desired outcomes
Stakeholders: Identify stakeholders and their needs

Detailed breakdown of the Problem Definition step:

1. Understanding the Business Context - Objective: Gain a comprehensive understanding of the business environment and the specific issues that need to be addressed

Business Goals: Identify the overarching goals of the business or organization. This might include increasing revenue, reducing costs, improving customer satisfaction, etc.
Current Challenges: Understand the specific challenges or pain points that the business is facing. This could be high customer churn, inefficient operations, low product sales, etc.
Stakeholder Needs: Engage with key stakeholders to understand their expectations, requirements, and how they define success

2. Defining the Problem Statement - Objective: Articulate the problem in a clear and concise manner

Specificity: The problem statement should be specific and focused. Avoid vague or broad statements
Measurability: Define how success will be measured. What metrics or KPIs will indicate that the problem has been solved or improved?
Constraints: Identify any constraints or limitations that need to be considered, such as budget, time, data availability, and technical limitations

3. Formulating Objectives and Goals - Objective: Break down the problem into manageable objectives and goals

Short-term and Long-term Goals: Differentiate between immediate objectives and long-term goals
SMART Goals: Ensure that the goals are Specific, Measurable, Achievable, Relevant, and Time-bound

4. Identifying Key Questions and Hypotheses - Objective: Outline the key questions that need to be answered and any hypotheses to be tested

Research Questions: What are the specific questions that need to be addressed to solve the problem? For example, “What factors are contributing to customer churn?”
Hypotheses: Formulate hypotheses based on domain knowledge and preliminary data analysis. For instance, “Customers are more likely to churn if they experience issues with the product in the first month.”

5. Determining the Scope of the Project - Objective: Clearly define the boundaries of the project

In-Scope vs. Out-of-Scope: Identify what will be included and what will be excluded from the project. This helps in managing expectations and resources
Data Requirements: Determine the types of data needed, sources of data, and the data collection methods. Ensure that the necessary data is accessible and of sufficient quality

6. Understanding the Audience - Objective: Identify the target audience for the results and how the insights will be communicated

Stakeholders: Understand who will be the end users of the insights (e.g., executives, marketing teams, product managers)
Communication Plan: Plan how the results will be presented. This includes deciding on the format (e.g., reports, dashboards, presentations) and the level of technical detail

7. Assessing Feasibility - Objective: Evaluate the feasibility of the project from a technical and operational perspective

Technical Feasibility: Assess the technical challenges and the resources required. Do you have the necessary tools, technologies, and skills?
Operational Feasibility: Consider the operational aspects such as timelines, budget, and team capacity. Ensure that the project is practical and can be completed within the given constraints

8. Establishing a Timeline - Objective: Develop a realistic timeline for the project

Milestones: Break down the project into key milestones and deliverables. Define what needs to be achieved at each stage
Deadlines: Set clear deadlines for each milestone to ensure the project stays on track

9. Risk Assessment - Objective: Identify potential risks and develop mitigation strategies

Risk Identification: List possible risks that could impact the project, such as data issues, technical challenges, or changes in business priorities
Mitigation Plans: Develop strategies to mitigate these risks. This could include having backup data sources, additional training for team members, or contingency plans for delays

10. Documenting the Plan - Objective: Create a comprehensive project plan that documents all aspects of the Problem Definition step

Project Charter: Develop a project charter or proposal that outlines the problem statement, objectives, scope, stakeholders, timeline, and risks
Stakeholder Approval: Ensure that all stakeholders review and approve the project plan. This alignment is crucial for project success

2. Data Collection

Foundational step for the success of a data science project

By meticulously identifying data requirements, leveraging appropriate sources, ensuring data quality, and securely storing the data, you set the stage for accurate and reliable analysis

Sources: Gather data from various sources such as databases, APIs, web scraping, surveys, etc.
Data Types: Collect different types of data, including structured, semi-structured, and unstructured data

Detailed breakdown of the Data Collection step:

1. Identifying Data Requirements - Objective: Determine what data is needed to address the problem and meet the project objectives

Types of Data: Identify the types of data required (e.g., numerical, categorical, text, images)
Data Sources: Determine the sources of data, such as internal databases, external datasets, APIs, web scraping, surveys, and more
Data Attributes: List the specific attributes or features needed (e.g., customer demographics, transaction history)

2. Data Sources - Objective: Explore various sources from where data can be collected

Internal Data Sources: Databases, data warehouses, CRM systems, ERP systems, logs, and any other internal repositories
External Data Sources: Public datasets, third-party APIs, government databases, social media, research publications, and web scraping
Primary Data Collection: Surveys, questionnaires, experiments, and direct observations

3. Data Collection Methods - Objective: Choose appropriate methods to collect the data

Automated Data Collection: Using scripts and tools to fetch data from APIs, web scraping, and logging systems
Manual Data Collection: Entering data manually through forms, surveys, and other manual processes
Real-time Data Collection: Streaming data from IoT devices, sensors, and live feeds

4. Ensuring Data Quality - Objective: Ensure the data collected is of high quality

Accuracy: Verify that the data correctly represents the real-world conditions or events
Completeness: Ensure all necessary data points are collected, with minimal missing values
Consistency: Ensure the data is consistent across different sources and formats
Timeliness: Ensure the data is up-to-date and relevant for the analysis
Validity: Ensure the data conforms to the defined business rules and constraints

5. Data Storage - Objective: Store the collected data securely and efficiently

Databases: Use relational databases (e.g., MySQL, PostgreSQL) or NoSQL databases (e.g., MongoDB) depending on the data structure
Data Lakes: Store large volumes of raw data in data lakes using technologies like Hadoop or AWS S3
Cloud Storage: Utilize cloud services (e.g., AWS, Google Cloud, Azure) for scalable and secure data storage
Data Warehouses: Use data warehouses (e.g., Amazon Redshift, Google BigQuery) for structured and processed data

6. Data Integration - Objective: Combine data from multiple sources into a cohesive dataset

ETL (Extract, Transform, Load): Implement ETL processes to extract data from various sources, transform it into a suitable format, and load it into a storage system
Data Cleaning: Address any inconsistencies, duplicates, and errors during the integration process
Schema Alignment: Ensure that data from different sources is compatible by aligning schemas, data types, and formats

7. Documenting Data Collection Process- Objective: Keep detailed documentation of the data collection process

Data Sources and Methods: Document where and how the data was collected
Data Quality Checks: Record any quality checks performed and issues encountered
Data Collection Dates: Keep track of when the data was collected to ensure its relevance and timeliness

double_arrow Import packages


# IMPORT PYTHON PACKAGES
# Data Science and Analysis
import numpy as np                # Numerical operations
import pandas as pd               # Data manipulation and analysis
import scipy                      # Scientific computing
import matplotlib.pyplot as plt   # Plotting and visualization
import seaborn as sns             # Statistical data visualization
import plotly.express as px       # Interactive data visualization
import statsmodels.api as sm      # Statistical modeling
import dask.dataframe as dd       # Parallel computing with Pandas-like DataFrames
import vaex                       # Out-of-core DataFrames for big data

# Machine Learning and AI
from sklearn import datasets, model_selection, preprocessing, metrics   # Machine learning tools
from sklearn import ensemble, decomposition, compose                    # Machine learning tools
import tensorflow as tf           # Deep learning framework
from tensorflow import keras      # High-level neural networks API
import torch                      # Deep learning framework
import xgboost as xgb             # Gradient boosting library
import lightgbm as lgb            # Light Gradient Boosting Machine
import catboost                   # Categorical features gradient boosting library

# Natural Language Processing (NLP)
import nltk                                                  # Natural language processing toolkit
import spacy                                                 # Advanced natural language processing
import gensim                                                # Topic modeling and document similarity
import transformers                                          # State-of-the-art NLP models
from sklearn.feature_extraction.text import TfidfVectorizer  # Text feature extraction
import textblob                                              # Text processing and NLP
import nltk.sentiment.vader as vader                         # Sentiment analysis

# Data Visualization
from bokeh.plotting import figure, output_file, show         # Interactive visualization
import altair as alt                                         # Declarative statistical visualization
import folium                                                # Interactive maps
import geopandas as gpd                                      # Geospatial data processing

# Web Development
from django.http import HttpResponse         # Django web framework
from flask import Flask, request             # Flask web framework
import fastapi                               # FastAPI web framework

# General-Purpose Libraries
import requests                   # HTTP requests
from bs4 import BeautifulSoup     # Web scraping
from PIL import Image             # Image processing
import json                       # JSON data manipulation
import os                         # Operating system interfaces
import sys                        # System-specific parameters and functions
import re                         # Regular expressions
import datetime                   # Date and time manipulation
import logging                    # Logging facility for Python
import itertools                  # Functions creating iterators for efficient looping
import collections                # Container datatypes

# Data Storage and Databases
import sqlite3               # SQLite database
import pymongo               # MongoDB
import sqlalchemy            # SQL toolkit and Object Relational Mapper
import psycopg2              # PostgreSQL database adapter

# Networking
import socket                # Low-level networking interface
import paramiko              # SSH protocol
import smtplib               # Simple Mail Transfer Protocol

double_arrow Read Data

pandas.read_csv() - function to read the csv file

In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address
The file path can be either an URL or your local file address
If the data does not include headers, we can add an argument header = None inside the read_csv() method so that pandas will not automatically set the first row as a header
dtype - data type's to apply to either the whole dataset or individual columns
coerce_float - attempt to force numbers into floats
parse_dates - list of columns to parse as dates
chunksize - number of rows to include in each chunk
pd.read_excel() - read excel file
pd.read_json() - read json file


# READING CSV FILES:
pd.set_option('display.max_columns', None)              # Show all available columns

path = "../1 datasets/iris_data.csv"                    # Set the path to your file
data_df = pd.read_csv(path)                             # Read the file (import the data)

data_df = pd.read_csv(path, sep="\t")                   # Different delimeters: tab separated file(.tsv)
data_df = pd.read_csv(path, delim_whitespace=True)      # Different delimeters: space separated file
data_df = pd.read_csv(path, header=None)                # Don't use first row for column names
data_df = pd.read_csv(path, names=["Name1", "Name2"])   # Specify column names
data_df = pd.read_csv(path, na_values=["NA", 99])       # Custom missing values


# READING SQL DATA:
import sqlite3 as sq3                               # Import sqlite3
path = "database path"                              # Set the path to your database
con = sq3.Connection(path)                          # Create connection SQL database with sqlite3
query = '''SELECT * FROM table_name;'''
data_df = pd.read_sql(query, con)                   # Execute query

observations_generator = pds.read_sql(query, con,
                coerce_float=True,
                parse_dates=['Release_Year'],       # Parse `Release_Year` as a date
                chunksize=5                         # Allows for streaming results as a series of shorter tables)

# READING noSQL DATA:
from pymongo import MongoClient
con = MongoClinet()                             # Create a Mongo connection
db = con.database_name                          # Choose database (con.list_database_names() will display available databases)
#cursor = db.collection_name.find(query)        # Create a cursor object using a query (replace query with {} to sellect all)
#data_df = pd.DataFrame(list(cursor))           # Expand cursor and construct DataFrame

double_arrow Add Headers

Pandas automatically set the header with an integer starting from 0
If we have to add headers manually, we create a list "headers" that include all column names in order
We use dataframe.columns = headers to replace the headers with the list we created


# CREATE  HEADERS:
headers = ["header_name1","header_name2"..."header_namen"]
data_df.columns = headers     # Replace headers

double_arrow Rename Columns or Row Indexes


# RENAME COLUMNS OR INDEXES USING A MAPPING:
data_df.rename(columns={"REF_DATE": "DATE", "Type of fuel": "TYPE"})     
data_df.rename(index={0: "x", 1: "y", 2: "z"})  

data_df.rename(str.lower, axis='columns')           # Lowercase the columns using axis-style parameters
data_df.rename({1: 2, 2: 4}, axis='index')          # Rename indexes using axis-style parameters

double_arrow Save Dataframe

df.to_csv() - save the dataset to csv
df.to_json() - save the dataset to json
df.to_excel() - save the dataset to excel
df.to_hdf() - save the dataset to hdf
df.to_sql() - save the dataset to sql


# SAVE DATAFRAME:
DataFrame.to_csv("data_df.csv", index=False)    # index=False means the row names will not be written

double_arrow Basic Insight of Data

pandas.df.head() - return the first n rows
pandas.df.shape() - shows how many entries there are in our dataset (parameter 0 for rows, parameter 1 for columns)
pandas.df.info() - provides a concise summary of your DataFrame (prints information about a DataFrame including the index dtype and columns, non-null values and memory usage)
pandas.df.dtypes() - returns a Series with the data type of each column
pandas.df.value_counts() - returns a Series containing counts of unique values (resulting object will be in descending order)

normalize - If True then the object returned will contain the relative frequencies of the unique values
sort - sort by frequencies when True
ascending - sort in ascending order (default False)

pandas.tolist() - return a list of the values
pandas.df.describe() - provides a generate descriptive statistics of each numeric-typed (int, float) column, excluding NaN (Not a Number) values

the count of that variable
the mean
the standard deviation (std)
the minimum value value
the IQR Interquartile Range: 25%, 50% and 75%
the maximum value
include = "all" argument provides the statistical summary of all the columns, including object-typed attributes
You can select the columns of a dataframe by indicating the name of each column and you can apply the method to get the statistics of those columns
Does have median, but it's called 50% quantile
Does not have range, in order to get the range you will need to create a new entry in the describe table, which is max - min


# SHOW FEW ROWS
data_df.head(5)                             # Show the first 5 rows of the dataframe
print(data_df.iloc[:5])                     # Print a few rows

# SHOW NUMBER OF ROWS AND COLUMNS 
print("There are: ", data_df.shape[0], " rows; ", data_df.shape[1], " columns") 

# INFO
data_df.info()               

# DATA TYPES OF THE COLUMNS
print(data_df.dtypes)           # Data types; included in info()
data_df.dtypes.value_counts()   # Counts columns according to data types

# UNIQUE VALUES
data_df[column].value_counts()                                      # Counts unique values
percentages = data_df['column'].value_counts(normalize=True) * 100  # Calculate percentages of value counts

# COLUMN NAMES
print(data_df.columns.tolist()) # Column names; included in info()

# DESCRIBE
data_df.describe()
data_df.describe(include = "all")                   # All the columns
data_df[['length','width','height']].describe()     # For selected columns
data_df.describe(include=['object'])                # To include type object

stats_df = data_df.describe()                                          # Get statistical summary
stats_df.loc['range'] = stats_df.loc['max'] - stats_df.loc['min']   # Calculate range (max-min)
out_fields = ['mean','25%','50%','75%','range']                     # Select just the rows desired from the 'describe' method
stats_df = stats_df.loc[out_fields]
stats_df.rename({'50%': 'median'}, inplace=True)                    # Add name median instead of 50%

3. Data Preparation

An iterative and interactive process that requires careful attention to detail

Sets the foundation for all subsequent steps in the data science process, as the quality and structure of the data directly impact the accuracy and reliability of the insights derived from it

Effective data preparation ensures that the data is clean, relevant, and ready for analysis, ultimately leading to better decision-making and more robust models

Cleaning: Handle missing values, remove duplicates, and correct inconsistencies
Transformation: Normalize, scale, and encode data to prepare it for analysis
Integration: Combine data from multiple sources and ensure consistency

Detailed breakdown of the Data Preparation step:

1. Data Cleaning - Objective: Improve the quality of the data by handling errors, inconsistencies, and missing values

Missing Values: Identify and handle missing data using techniques such as imputation (filling missing values with mean, median, mode, or using algorithms), deletion (removing rows or columns with missing values), or flagging (marking missing values for further analysis)
Duplicates: Detect and remove duplicate records to ensure each entry is unique
Outliers: Identify and handle outliers that may skew the analysis. Methods include removing them, transforming them, or using robust statistical methods
Data Types: Ensure all data is in the correct format (e.g., converting strings to dates, ensuring numerical data is in numerical format)

2. Data Transformation - Objective: Convert data into a format suitable for analysis

Normalization: Scale numerical features to a common range (e.g., 0 to 1) to ensure that no single feature dominates the model due to its scale
Standardization: Transform data to have a mean of 0 and a standard deviation of 1. This is often necessary for algorithms that assume data is normally distributed
Encoding Categorical Data: Convert categorical data into numerical format using techniques like one-hot encoding, label encoding, or target encoding
Feature Creation: Create new features from existing data to better capture the underlying patterns. This can include creating interaction terms, polynomial features, or aggregating data over time
Discretization: Convert continuous data into discrete bins or categories. This can be useful for certain types of analysis or when dealing with non-linear relationships

3. Data Integration - Objective: Combine data from multiple sources into a cohesive dataset

Merging: Combine datasets based on common keys (e.g., customer ID, product ID). This involves understanding the relationships between different data sources and ensuring consistent keys
Concatenation: Stack datasets vertically or horizontally when they have the same structure
Handling Conflicts: Resolve discrepancies and conflicts in data from different sources, such as different formats or definitions for the same attribute
Schema Alignment: Ensure that the data structures (schemas) from different sources are compatible. This may involve renaming columns, reordering columns, or aligning data types

4. Data Reduction - Objective: Reduce the volume of data while maintaining its integrity

Dimensionality Reduction: Reduce the number of features using techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or t-SNE. This helps in dealing with the "curse of dimensionality" and improves computational efficiency
Sampling: Reduce the size of the dataset by selecting a representative subset. This can be done through random sampling, stratified sampling, or cluster sampling
Aggregation: Summarize data to a higher level, such as aggregating daily data to monthly data. This helps in reducing the granularity of the data while retaining important patterns

5. Data Enrichment - Objective: Enhance the dataset by adding additional information

External Data: Incorporate external data sources to provide more context or additional features. Examples include adding weather data, economic indicators, or social media sentiment data
Feature Engineering: Create new features that capture domain knowledge or improve model performance. This involves domain expertise and creativity to derive meaningful features from the raw data

double_arrow Identify and Handle Missing Values

Identify missing data

pandas.df.isnull() - detect missing values (None or numpy.NaN). Empty strings " " are not considered NA values

Deal with missing data

pandas.df.dropna() - remove missing values
pandas.df.replace() - replace values given in to_replace with value. .replace(A, B, inplace = True) by mean, frequency, other
pandas.df.fillna() - fill NA/NaN values using the specified method

Correct data format. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'

pandas.df.astype() - convert data types into a proper format for each column


# IDENTIFY AND HANDLE MISSING VALUES:
data_df.isnull().sum().sort_values(ascending=False)         # Sort missing values in our dataset

missing_data = data_df.isnull()                             # False means there is no missing data in column 
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

data_df.dropna(inplace = True)              # Delete NaN
data_df.dropna(how='all')                   # Drop the rows where all elements are missing
data_df.dropna(axis='columns')              # Drop the columns where at least one element is missing
data_df.dropna(subset=["Lot Frontage"])     # Define in which columns to look for missing values
data_df.drop("Lot Frontage", axis=1)        # Drop the whole attribute (column) if contains missing values

data_df.price = data_df.price.replace('?',np.nan)                               # Replace ? sign with NaN
data_df['brand'] = data_df['brand'].replace(['vw', 'vokswagen'], 'volkswagen')  # Fixing typos in the names of the cars

median = data_df["Lot Frontage"].median()
data_df["Lot Frontage"].fillna(median, inplace = True)  # Replace the missing values with the median value of that column
mean = data_df["Mas Vnr Area"].mean()
data_df["Mas Vnr Area"].fillna(mean, inplace = True)    # Replace the missing values with the mean value of that column

data_df.price = data_df.price.astype('int64')           # Convert into the type int

double_arrow Removing Duplicates

pandas.df.duplicated() - check whether there are any duplicates in our data
pandas.df.drop_duplicates() - removes all duplicate rows based on all the columns by default
pandas.Index.is_unique() - alternative way to check if there are any duplicated Indexes in our dataset


# HANDLING DUPLICATES:
duplicate = data_df[data_df.duplicated(['PID'])]        # To check whether there are any duplicates in the column
sum(data_df.duplicated(subset = 'car_ID')) == 0         # To find and sum duplicates on specific column

dup_removed = data_df.drop_duplicates()                 # To remove the duplicates
removed_sub = data_df.drop_duplicates(subset=['Order']) # Remove duplicates on a specific column

data_df.index.is_unique                                 # To check if there are any duplicated Indexes in our dataset
data_df.brand.nunique()                                 # Count number of distinct elements in specified axis

double_arrow Splitting the columns

pandas.str.split() - splits the string records

pat - parameter can be used to split by other characters (if not specified, split on whitespace)
n - limit number of splits in output (0 and -1 will be interpreted as return all splits)
expand=True - returns a dataframe


# SPLITTING THE COLUMNS:
data_df[['City', 'Province']] = data_df['GEO'].str.split(',', n=1, expand=True) # Split GEO column to City and Province columns
data_df['brand'] = data_df.CarName.str.split(' ').str.get(0).str.lower()        # Get all first words of car names in lowercase

double_arrow Changing to datetime format

pandas.to_datetime() - transforms to date time format

We need to specify the format of datetime that we need
format='%b-%y' means that it will split into the name of a month and year
str.slice(stop=3) splits and outputs the first 3 letters of a month


# CHANGING TO DATETIME FORMAT:
data_df['DATE'] = pd.to_datetime(data_df['DATE'], format='%b-%y')
data_df['Month'] = data_df['DATE'].dt.month_name().str.slice(stop=3)
data_df['Year'] = data_df['DATE'].dt.year

# 
duration = list(data_df['Duration'])
for i in range(len(duration)):
    if len(duration[i].split()) != 2:
        if 'h' in duration[i]:
            duration[i] = duration[i].strip() + ' 0m'
        elif 'm' in duration[i] :
            duration[i] = '0h {}'.format(duration[i].strip())
dur_hours = []
dur_minutes = []  
 
for i in range(len(duration)) :
    dur_hours.append(int(duration[i].split()[0][:-1]))
    dur_minutes.append(int(duration[i].split()[1][:-1]))
     
data_df['Duration_hours'] = dur_hours
data_df['Duration_minutes'] =dur_minutes
data_df.loc[:,'Duration_hours'] *= 60
data_df['Duration_Total_mins']= data_df['Duration_hours']+data_df['Duration_minutes']

data_df["Dep_Hour"]= pd.to_datetime(data_df['Dep_Time']).dt.hour
data_df["Dep_Min"]= pd.to_datetime(data_df['Dep_Time']).dt.minute
data_df["Arrival_Hour"]= pd.to_datetime(data_df['Arrival_Time']).dt.hour
data_df["Arrival_Min"]= pd.to_datetime(data_df['Arrival_Time']).dt.minute

data_df['Month']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.month
data_df['Day']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.day
data_df['Year']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.year
data_df['day_of_week'] = pd.to_datetime(data_df['Date_of_Journey']).dt.day_name()

double_arrow Identify and handle outliers

In statistics, an outlier is an observation point that is distant from other observations (can be due to some mistakes in data collection or recording, or due to natural high variability of data points)

Outliers can markedly affect our models and can be a valuable source of information, providing us insights about specific behaviours

Visual methods for identifying outliers:

Box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles. Outliers may be plotted as individual points.
Scatter plot shows the distribution of data points, helping to spot outliers visually

Statistical methods for identifying outliers:

Z-score Analysis (how far a data point is from the mean in terms of standard deviations)

Z-score - signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured
Value that quantifies relationship between a data point and a standard deviation and mean values of a group of points
Data points which are too far from zero will be treated as the outliers. In most of the cases, a threshold of 3 or -3 is used


data_df['LQFSF_Stats'] = scipy.stats.zscore(data_df['Low Qual Fin SF'])
data_df[['Low Qual Fin SF','LQFSF_Stats']].describe().round(3)

Identify and handle outliers using the Interquartile Range (IQR) method

Data points below Q1 - 1.5 IQR or above Q3 + 1.5 IQR are considered outliers


# Detecting outliers: Statistics

# Calculate the interquartile range (Q1: 25th percentile and Q3: 75th percentile)
Q1, Q50, Q3 = np.percentile(data_df.SalePrice, [25, 50, 75])
IQR = Q3 - Q1

# Define lower and upper bounds
lower_bound = Q1 - 1.5*(IQR)
upper_bound = Q3 + 1.5*(IQR)

print("min:", lower_bound, "Q1:", Q1, "q50:", q50, "Q3:", Q3, "max:", upper_bound)

# Identify outliers
outliers = data_df[(data_df['SalePrice'] < lower_bound) | (data_df['SalePrice'] > upper_bound)]
print("Outliers:\n", outliers)

# Identify the points
[x for x in data_df["SalePrice"] if x > upper_bound]

# Remove outliers
df_cleaned = data_df[(data_df['SalePrice'] >= lower_bound) & (data_df['SalePrice'] <= upper_bound)]
print("Data after removing outliers:\n", df_cleaned)

Uni-variate (using one variable) Analysis


sns.boxplot(x=data_df['Lot Area'])

Bi-variate (using two or more variables) Analysis


price_area = data_df.plot.scatter(x='Gr Liv Area', y='SalePrice')

Deleting the Outliers


data_df.sort_values(by = 'Gr Liv Area', ascending = False)[:2]      # Sort values to find last 2 records (index: 1499,2181)
outliers_dropped = data_df.drop(data_df.index[[1499,2181]])         # If we want to delete some of the outliers

double_arrow Data Filtering

We can use the logical operators on column values to filter rows. First, we specify the name of our data, then, square brackets to select the name of the column, double 'equal' sign, == to select the name of a row group, in single or double quotation marks

If we want to exclude some entries (e.g. some locations), we would use the 'equal' and 'exclamation point' signs together, =!. We can also use < >, <= >= signs to select numeric information. We can also use | (or) and & (and) to select multiple columns and rows


calgary = data_df[data_df['GEO']=='Calgary, Alberta']           # select the Calgary, Alberta data
sel_years = data_df[data_df['Year']==2000]                      # select 2000 year

mult_loc = data_df[(data_df['GEO']=="Toronto, Ontario") | (data_df['GEO']=="Edmonton, Alberta")]                                                       # Select Toronto and Edmonton locations
cities = ['Calgary', 'Toronto', 'Edmonton']
CTE = data_df[data_df.City.isin(cities)]                                                                                                               # isin method to select multiple locations
mult_sel = data_df[(data_df['Year']==1990) & (data_df['TYPE']=="Household heating fuel") & (data_df['City']=='Vancouver')]                             # Select the data that shows the price of the 'household heating fuel', in Vancouver, in 1990
mult_sel = data_df[( data_df['Year']<=1979) | ( data_df['Year']==2021) & (data_df['TYPE']=="Household heating fuel") & (data_df['City']=='Vancouver')] # Select the data that shows the price of the 'household heating fuel', in Vancouver, in the years of 1979 and 2021

4. Exploratory Data Analysis (EDA)

Involves examining and visualizing the dataset to uncover patterns, spot anomalies, test hypotheses, and check assumptions

Helps in understanding the underlying structure of the data, guiding the next steps in the analysis or modeling process

Allows us to get an initial feel for the data, lets us to determine if the data makes sense (or if futher cleaning or more data is needed), helps to identify patterns and trends in the data

Descriptive Statistics: Calculate measures such as mean, median, mode, standard deviation, etc.
Visualization: Use charts, graphs, and plots to understand data distribution, patterns, and anomalies
Hypothesis Testing: Formulate and test hypotheses about the data

Detailed breakdown of the EDA step:

1. Understanding the Dataset - Objective: Gain an initial understanding of the dataset and its structure

Data Types: Identify the types of data (numerical, categorical, text, dates, etc.)
Dimensions: Understand the size and shape of the dataset (number of rows and columns)
Summary Statistics: Calculate basic statistics such as mean, median, standard deviation, min, max, and quartiles for numerical feature

2. Handling Missing Data - Objective: Identify and handle missing data

Missing Value Detection: Identify columns with missing values and the proportion of missing values
Imputation: Fill missing values using methods like mean, median, mode, or more sophisticated techniques like KNN imputation or regression
Removal: Drop rows or columns with missing values if they are not significant or cannot be imputed reliably

3. Analyzing Distributions - Objective: Understand the distribution of individual features

Histograms: Visualize the distribution of numerical features
Box Plots: Identify outliers and understand the spread and skewness of the data
Density Plots: Visualize the probability density function of a continuous variable

4. Analyzing Relationships Between Variables - Objective: Explore relationships between different features

Scatter Plots: Visualize the relationship between two numerical variables
Correlation Matrix: Compute and visualize the correlation between numerical features
Pair Plots: Visualize pairwise relationships in a dataset
Heatmaps: Visualize the correlation matrix

5. Analyzing Categorical Data - Objective: Understand the distribution and relationship of categorical variables

Bar Plots: Visualize the frequency distribution of categorical features
Count Plots: Count the occurrences of each category in a categorical feature
Chi-Square Test: Test for independence between categorical features

6. Feature Engineering and Transformation - Objective: Create new features or transform existing ones to better capture the information in the dataset

Log Transformation: Apply log transformation to reduce skewness
Binning: Convert continuous variables into categorical bins
Interaction Features: Create new features by combining existing ones (e.g., product or ratio of two features)
Encoding Categorical Variables: Convert categorical variables to numerical using one-hot encoding, label encoding, or target encoding

7. Dimensionality Reduction - Objective: Reduce the number of features while preserving the important information

Principal Component Analysis (PCA): Reduce dimensionality by projecting data onto principal components
t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduce dimensionality for visualization, especially in high-dimensional space

8. Summarizing and Documenting Findings - Objective: Document insights and findings from the EDA process

Insights Report: Create a report or presentation summarizing key insights, visualizations, and potential issues identified during EDA
Data Quality Issues: Document any data quality issues encountered and how they were addressed
Hypothesis Testing: Summarize results of hypothesis tests and their implications

double_arrow Correlation and Causation

Correlation is a measure of the extent of interdependence between variables, Causation is the relationship between cause and effect between two variables

Correlation does not imply causation. Determining correlation is much simpler then determining causation as causation may require independent experimentation

Pearson Correlation - measures the linear dependence between two variables X and Y

Pearson Correlation is the default method of the function pandas.corr()
Correlation coefficient can only be calculated on the numerical attributes (floats and integers)
The resulting coefficient is a value between -1 and 1 inclusive, where:

1: Perfect positive linear correlation
0: No linear correlation, the two variables most likely do not affect each other
-1: Perfect negative linear correlation

P-value

Probability value that the correlation between these two variables is statistically significant
Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant
Smallest significance level at which the null hypothesis would be rejected
By convention, when the

p-value is < 0.001: we say there is strong evidence that the correlation is significant
the p-value is < 0.05: there is moderate evidence that the correlation is significant
the p-value is < 0.1: there is weak evidence that the correlation is significant
the p-value is > 0.1: there is no evidence that the correlation is significant


corr_matrix = data_df.corr(numeric_only=True)
corr_matrix['SalePrice'].sort_values(ascending=False)                                   # List of top features that have high correlation coefficient

hous_num = data_df.select_dtypes(include = ['float64', 'int64'])                                            # Select only float and int data types
hous_num_corr = hous_num.corr()['SalePrice'][:-1]                                                           # -1 means that the latest row is SalePrice
top_features = hous_num_corr[abs(hous_num_corr) > 0.5].sort_values(ascending=False)                         # Displays pearsons correlation coefficient greater than 0.5
print("There is {} strongly correlated values with SalePrice:\n{}".format(len(top_features), top_features))
for i in range(0, len(hous_num.columns), 5):
    sns.pairplot(data=hous_num, x_vars=hous_num.columns[i:i+5], y_vars=['SalePrice'])

sns.set_context('talk')
sns.pairplot(data_df, hue='species');

double_arrow ANOVA - Analysis of Variance

Statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score - ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means
P-value - tells how statistically significant our calculated score value is

double_arrow Basics of Grouping

pandas.df.groupby - groups data by different categories

The data is grouped based on one or several variables, and analysis is performed on the individual groups
Most commonly, we use groupby to split the data into groups, this will apply some function to each of the groups (e.g. mean, median, min, max, count), then combine the results into a data structure

pandas.df.pivot - convert the dataframe to a pivot table

The grouped data is much easier to visualize when it is made into a pivot table
A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row

pandas.df.reset_index - resets the index


# Group by single column
grouped_by_product = data_df.groupby('product')['sales'].sum()

# Group by multiple columns
group_year = data_df.groupby(['Year'])['VALUE'].mean()                                                  # Calculate the mean of the prices per year
group_month = data_df.groupby(['Month'])['VALUE'].max()                                                 # Group by the maximum value of prices, for each month
group_city = data_df.groupby(['Year', 'City'])['VALUE'].median().reset_index(name ='Value').round(2)    # Group by the median value of prices, for each year and each city

# Using multiple aggregation functions
data_df.groupby('species').agg(['mean', 'median'])                      # Passing a list of recognized strings
data_df.groupby('species').agg([np.mean, np.median])                    # Passing a list of explicit aggregation functions

grouped_with_agg = data_df.groupby('product')['sales'].agg(['sum', 'mean', 'count'])

# Using pivot
data_df.pivot(index='col1', columns='col2', values='col3')

double_arrow Analyzing Individual Feature Patterns Using Visualization


# A simple scatter plot of sepal_length vs sepal_width
ax = plt.axes()
ax.scatter(data_df.sepal_length, data_df.sepal_width)
# Label the axes
ax.set(xlabel='Sepal Length (cm)', ylabel='Sepal Width (cm)', title='Sepal Length vs Width');

# A histogram of petal length
ax = plt.axes()
ax.hist(data_df.petal_length, bins=25);
ax.set(xlabel='Petal Length (cm)', ylabel='Frequency', title='Distribution of Petal Lengths');
ax = data_df.petal_length.plot.hist(bins=25)                                                    # Alternatively using Pandas plotting functionality

sns.set_context('notebook') 
ax = data_df.plot.hist(bins=25, alpha=0.5)                                                      # Single plot with histograms for each feature overlayed
ax.set_xlabel('Size (cm)');

# To create four separate plots, use Pandas `.hist` method
axList = data_df.hist(bins=25)
# Add some x- and y- labels to first column and last row
for ax in axList.flatten():
    if ax.get_subplotspec().is_last_row():
        ax.set_xlabel('Size (cm)')
        
    if ax.get_subplotspec().is_first_col():
        ax.set_ylabel('Frequency')

data_df.boxplot(by='species')                               # Boxplot of each petal and sepal measurement

# Single boxplot where the features are separated in the x-axis and species are colored with different hues
plot_data = (data_df
             .set_index('species')
             .stack()
             .to_frame()
             .reset_index()
             .rename(columns={0:'size', 'level_1':'measurement'}))
sns.set_style('white')
sns.set_context('notebook')
sns.set_palette('dark')
f = plt.figure(figsize=(6,4))
sns.boxplot(x='measurement', y='size', hue='species', data=plot_data);

5. Feature Engineering

Vital step in the data science process that significantly impacts the performance and accuracy of machine learning models

By creating new features, transforming existing ones, selecting the most relevant features, and applying domain knowledge, data scientists can enhance the predictive power of their models

Selection: identify and select the most relevant features (variables) for the model
Creation: create new features from existing ones to improve model performance
Reduction: use techniques like PCA (Principal Component Analysis) to reduce dimensionality

Detailed breakdown of the Feature Engineering step:

1. Understanding the Importance of Features - Objective: Recognize the role of features in influencing model performance

Predictive Power: Features are the input variables that models use to make predictions
Data Representation: The way features represent the underlying data can significantly affect the accuracy and robustness of the model

2. Types of Feature Engineering - Objective: Identify different types of feature engineering techniques

Feature Creation - Objective: Create new features that capture additional information from the data

Mathematical Transformations: Apply mathematical operations to create new features (e.g., log, square root, power)
Aggregation: Summarize information across groups (e.g., mean, sum, count of transactions per customer)
Date and Time Features: Extract information from date and time (e.g., day of the week, month, hour)

Feature Transformation - Objective: Transform existing features to improve their representation

Normalization: Scale numerical features to a common range (e.g., 0 to 1)
Standardization: Center features around zero with a standard deviation of one
Encoding Categorical Variables: Convert categorical features into numerical format using techniques like one-hot encoding, label encoding, or target encoding

Feature Selection - Objective: Select the most relevant features for modeling

Filter Methods: Use statistical techniques to select features (e.g., chi-square test, correlation threshold)
Wrapper Methods: Use machine learning models to evaluate the importance of features (e.g., recursive feature elimination)
Embedded Methods: Feature selection integrated into model training (e.g., Lasso, decision tree feature importance)

3. Domain Knowledge - Objective: Utilize domain expertise to create meaningful features

Industry-Specific Features: Create features based on industry knowledge (e.g., financial ratios in finance, health indicators in healthcare)
Business Logic: Incorporate business rules and logic into feature creation (e.g., customer segmentation based on purchasing behavior)

4. Interaction Features - Objective: Create features that capture interactions between existing features

Polynomial Features: Create polynomial terms to capture non-linear relationships
Product and Ratio Features: Combine features through multiplication, division, etc.

5. Handling Missing Values - Objective: Address missing data in features

Imputation: Fill missing values with mean, median, mode, or more advanced techniques
Indicator Variables: Create binary indicators for missing values

6. Future Reduction - Objective: Reduce the number of features while preserving the important information

Principal Component Analysis (PCA): Reduce dimensionality by projecting data onto principal components
t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduce dimensionality for visualization, especially in high-dimensional space

7. Feature Encoding - Objective: Encode categorical features to numerical values

One-Hot Encoding: Create binary columns for each category
Label Encoding: Convert categories to numerical labels
Target Encoding: Encode categories based on their target mean

8. Feature Scaling - Objective: Scale features to ensure they contribute equally to the model

Normalization: Scale features to a range (e.g., 0 to 1)
Standardization: Scale features to have mean 0 and standard deviation 1

9. Feature Selection - Objective: Select the most relevant features to improve model performance and reduce overfitting

Filter Methods: Select features based on statistical measures (e.g., correlation, chi-square)
Wrapper Methods: Use model performance to select features (e.g., recursive feature elimination)
Embedded Methods: Select features during model training (e.g., Lasso, decision trees)

10. Validation and Testing - Objective: Validate the impact of engineered features on model performance

Cross-Validation: Evaluate the model using cross-validation techniques
Model Comparison: Compare the performance of models with and without engineered features

double_arrow Feature Creation

Log Transformation

numpy.log() - function that perform log transform

Aggregation
Date and time features


# Log transformation
log_transformed = np.log(data_df['SalePrice'])

# Aggregation
data_df['total_purchases'] = data_df.groupby('customer_id')['purchase_amount'].transform('sum')

# Date and time features
data_df['purchase_date'] = pd.to_datetime(data_df['purchase_date'])
data_df['day_of_week'] = data_df['purchase_date'].dt.dayofweek
data_df['month'] = data_df['purchase_date'].dt.month

double_arrow Feature Transformation

Normalization or Min-max scaling is the process of transforming values of several variables into a similar range

MinMaxScaler from scikit-learn transform features by scaling each feature to a given range
Values are shifted and rescaled so they end up ranging from 0 to 1
This is done by subtracting the min value and dividing by the max minus min

Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison

StandardScaler from scikit-learn standardize features by removing the mean and scaling to unit variance
First it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation, so that the resulting distribution has unit variance

Encoding Categorical Variables

One-hot encoding (Indicator variable) - create binary columns for each category

pandas.get_dummies convert categorical variable into dummy/indicator variables
They are called dummies because the numbers themselves don't have inherent meaning
We use indicator variables so we can use categorical variables for regression analysis

Label encoding - convert categories to numerical labels
Target encoding - encode categories based on their target mean
Binning is the process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis

pandas.cut bin values into discrete intervals


from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, LabelEncoder

# Normalization
scaler = MinMaxScaler()
data_df['normalized_feature'] = scaler.fit_transform(data_df[['numerical_feature']])

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

# Standardization
scaler = StandardScaler()
data_df['standardized_feature'] = scaler.fit_transform(data_df[['numerical_feature']])

# z = (x - u) / s 

# One-hot encoding
data_df = pd.get_dummies(data=data_df, columns = ['Airline', 'Source', 'Destination'])

# Label encoding
label_encoder = LabelEncoder()
data_df['encoded_feature'] = label_encoder.fit_transform(data_df['categorical_feature'])

# Binning
data_df["Arrival_Hour"]= pd.to_datetime(data_df['Arrival_Time']).dt.hour
data_df['arr_timezone'] = pd.cut(data_df.Arrival_Hour, [0,6,12,18,24], labels=['Night','Morning','Afternoon','Evening'])

data_df['binned_feature'] = pd.cut(data_df['numerical_feature'], bins=5, labels=False)

double_arrow Feature Selection

Filter Methods Use statistical techniques to select features (e.g., chi-square test, correlation threshold)
Wrapper Methods Use machine learning models to evaluate the importance of features (e.g., recursive feature elimination)
Embedded Methods Feature selection integrated into model training (e.g., Lasso, decision tree feature importance)


from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.linear_model import LogisticRegression

# Filter method
selector = SelectKBest(chi2, k=10)
selected_features = selector.fit_transform(data_df.drop(columns=['target']), data_df['target'])

# Wrapper method
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=10)
selected_features = rfe.fit_transform(data_df.drop(columns=['target']), data_df['target'])

6. Modeling

Where the core of data science work happens, involves selecting the right algorithms, training models, tuning hyperparameters, and evaluating performance to ensure the model meets the defined objectives

By carefully executing each sub-step and iterating based on feedback, data scientists can develop robust models that provide valuable insights and predictions

Algorithms: choose appropriate algorithms based on the problem (e.g., regression, classification, clustering)
Training: train the model using the prepared data
Validation: evaluate the model's performance using techniques like cross-validation

Detailed breakdown of the Modeling step:

1. Selecting the Modeling Approach - Objective: Choose the appropriate type of model based on the problem and data

Types of Machine Learning Models

Supervised Learning: Models are trained on labeled data to make predictions or classifications

Regression: Predict continuous outcomes (e.g., Linear Regression, Ridge Regression)
Classification: Predict discrete classes (e.g., Logistic Regression, Decision Trees, Random Forests, SVMs)

Unsupervised Learning: Models are used on unlabeled data to find patterns or groupings

Clustering: Group similar data points (e.g., K-Means, Hierarchical Clustering)
Dimensionality Reduction: Reduce the number of features while preserving essential information (e.g., PCA, t-SNE)
Reinforcement Learning: Models learn by interacting with an environment to maximize cumulative rewards (e.g., Q-Learning, Deep Q-Networks)

2. Splitting the Data - Objective: Divide the data into training and testing sets to evaluate model performance

Training Set: Used to train the model
Testing Set: Used to evaluate the model’s performance on unseen data
Validation Set: Sometimes used to tune hyperparameters and prevent overfitting (often through cross-validation)

3. Training the Model - Objective: Fit the model to the training data

Fitting: Adjust the model parameters to minimize the error on the training data
Hyperparameters: Set parameters that are not learned from the data but are set before training (e.g., number of trees in Random Forest, learning rate in Gradient Boosting)

4. Hyperparameter Tuning - Objective: Optimize model performance by finding the best hyperparameters

Grid Search: Exhaustively search over a specified parameter grid
Random Search: Randomly sample from a parameter space
Bayesian Optimization: Use probabilistic models to find optimal hyperparameters
Cross-Validation: Use cross-validation to evaluate different hyperparameter settings

5. Model Evaluation - Objective: Assess the model’s performance using metrics and validate its generalization ability

Evaluation Metrics: Choose metrics based on the problem type (e.g., accuracy, precision, recall, F1-score for classification; mean squared error, R² for regression)
Confusion Matrix: For classification problems, show the performance across different classes
ROC Curve and AUC: Evaluate classification models’ ability to discriminate between classes

6. Model Interpretation - Objective: Understand and interpret the model’s behavior and predictions

Feature Importance: Evaluate which features are most influential in the model
Coefficients: Examine the coefficients in linear models to understand feature impacts
SHAP and LIME: Use techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) to explain complex models

7. Model Validation - Objective: Validate the model’s performance and robustness

Cross-Validation: Perform k-fold cross-validation to assess the model’s stability and performance across different data splits
Holdout Set: Use a separate holdout set if available to validate the model’s performance on truly unseen data

8. Model Deployment - Objective: Deploy the model into a production environment for real-world use

Model Serialization: Save the trained model using serialization libraries like pickle or joblib
Integration: Integrate the model with existing systems or applications
Monitoring: Monitor the model’s performance over time and update it as necessary

9. Iteration and Improvement - Objective: Refine and improve the model based on feedback and new data

Model Refinement: Adjust model parameters or try different algorithms to improve performance
Feature Engineering: Revisit feature engineering to enhance model input
Data Augmentation: Collect more data or use techniques to augment existing data

double_arrow Selecting the Modeling Approach


from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

# Classification model
model = LogisticRegression()

# Clustering model
kmeans = KMeans(n_clusters=3)

double_arrow Splitting the Data


from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target']), df['target'], test_size=0.2, random_state=42)

double_arrow Training the Model


# Train the model
model.fit(X_train, y_train)

double_arrow Hyperparameter Tuning


from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'saga']}

# Grid search
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)

double_arrow Model Evaluation


from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
print(confusion_matrix(y_test, y_pred))

double_arrow Model Interpretation


import shap

# SHAP values
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

double_arrow Model Validation


from sklearn.model_selection import cross_val_score

# Cross-validation
scores = cross_val_score(model, df.drop(columns=['target']), df['target'], cv=5)
print(f'Cross-Validation Scores: {scores}')
print(f'Mean Score: {scores.mean()}')

7. Evaluation

Essential for validating the performance and effectiveness of your machine learning models, proper evaluation helps in making informed decisions and improving model reliability and robustness

By using appropriate metrics, performing cross-validation, comparing different models, analyzing errors, and interpreting results, you ensure that the model not only performs well on training data but also generalizes effectively to new, unseen data

Metrics: use evaluation metrics such as accuracy, precision, recall, F1 score, RMSE (Root Mean Square Error), etc.
Comparison: compare different models and select the best-performing one
Interpretation: interpret the model results in the context of the business problem

Detailed breakdown of the Evaluation step:

1. Define Evaluation Metrics - Objective: Select appropriate metrics based on the type of problem (e.g., classification, regression)

Classification Metrics:

Accuracy: The proportion of correctly classified instances out of all instances
Precision: The proportion of true positive predictions out of all positive predictions (true positives + false positives)
Recall: The proportion of true positive predictions out of all actual positive instances (true positives + false negatives)
F1 Score: The harmonic mean of precision and recall, balancing both metrics
ROC Curve: A graphical plot showing the true positive rate against the false positive rate
AUC (Area Under the Curve): The area under the ROC curve, indicating the model’s ability to discriminate between classes

Regression Metrics:

Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values
Mean Squared Error (MSE): The average of the squared differences between predicted and actual values
Root Mean Squared Error (RMSE): The square root of MSE, providing error magnitude in the same units as the target variable
R-squared (R²): The proportion of variance in the target variable that is predictable from the features

2. Cross-Validation - Objective: Assess the model’s performance on different subsets of the data to ensure it generalizes well

K-Fold Cross-Validation: Split the data into k subsets (folds) and train/test the model k times, each time using a different fold as the test set and the remaining k-1 folds as the training set
Stratified K-Fold: Ensure that each fold has a representative proportion of each class (for classification problems)

3. Model Comparison - Objective: Compare the performance of different models to select the best one

Benchmarking: Compare models against each other using the same evaluation metrics
Statistical Tests: Use statistical tests to determine if differences in model performance are significant

4. Model Validation - Objective: Validate the model’s performance to ensure it meets the project’s goals and requirements

Holdout Validation: Use a separate holdout set (if available) to test the model’s performance on data it has never seen before
Real-World Testing: Test the model in a real-world setting or a production-like environment to ensure it performs as expected

5. Error Analysis - Objective: Analyze model errors to understand where the model is making mistakes and how it can be improved

Residual Analysis: For regression, analyze residuals (differences between predicted and actual values) to check for patterns
Confusion Matrix: For classification, examine the confusion matrix to understand misclassifications

6. Model Interpretation - Objective: Interpret and understand the model’s predictions and decision-making process

Feature Importance: Identify which features are most important for the model’s predictions
Partial Dependence Plots: Visualize the effect of individual features on the model’s predictions
SHAP and LIME: Use tools like SHAP and LIME to provide local explanations for individual predictions

7. Document and Report Findings - Objective: Document the evaluation results, insights, and decisions

Evaluation Report: Create a comprehensive report summarizing the model’s performance, strengths, weaknesses, and recommendations
Visualizations: Include charts and plots to illustrate the model’s performance and error analysis

double_arrow Define Evaluation Metrics


# CLASSIFICATION
# Model predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Probabilities for ROC

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Precision, Recall, F1 Score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC: {roc_auc}')

# REGRESSION
# Model predictions
y_pred = model.predict(X_test)

# MAE, MSE, RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# R-squared
r2 = r2_score(y_test, y_pred)
print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'R-squared: {r2}')

double_arrow Cross-Validation


# Cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # for classification
print(f'Cross-Validation Scores: {scores}')
print(f'Mean Score: {scores.mean()}')

double_arrow Model Comparison


# Compare multiple models
from sklearn.model_selection import cross_val_score
model1 = LogisticRegression()
model2 = RandomForestClassifier()
scores1 = cross_val_score(model1, X, y, cv=5)
scores2 = cross_val_score(model2, X, y, cv=5)
print(f'Model 1 Scores: {scores1.mean()}')
print(f'Model 2 Scores: {scores2.mean()}')

double_arrow Error Analysis


# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')

double_arrow Model Interpretation


# SHAP values
import shap
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

8. Deployment

Ensures that the machine learning model is not only functional but also reliable and secure in a real-world environment

By carefully managing serialization, integration, monitoring, updating, documentation, and security, you can successfully deploy a model that delivers valuable insights and predictions to end-users

Implementation: deploy the model in a production environment
Integration: integrate the model with existing systems and workflows
Monitoring: continuously monitor the model’s performance and update as needed

Detailed breakdown of the Deployment step:

1. Model Serialization - Objective: Save the trained model in a format that can be easily loaded and used in production

Pickle: A Python library used to serialize and deserialize Python objects
Joblib: A library optimized for serializing large numpy arrays

2. Model Integration - Objective: Integrate the model into the application or system where it will be used

Batch Processing - Objective: Apply the model to a batch of data periodically

Scripts: Write scripts to load the model and process new data batches
Scheduled Jobs: Use cron jobs or task schedulers to automate batch processing

Real-Time Processing - Objective: Use the model to make predictions in real-time

APIs: Deploy the model behind a web API using frameworks like Flask, FastAPI, or Django
Microservices: Use containerization tools like Docker and orchestration platforms like Kubernetes

3. Model Monitoring - Objective: Continuously monitor the model’s performance and health in production

Performance Monitoring - Objective: Track key performance metrics to detect degradation

Accuracy Tracking: Monitor metrics like accuracy, precision, recall, etc.
Latency Tracking: Measure the time taken for the model to make predictions

Data Drift Detection - Objective: Detect changes in the input data distribution that may affect model performance

Statistical Tests: Use statistical methods to compare the current data distribution with the training data distribution
Monitoring Tools: Utilize tools and platforms like Evidently, DataRobot, or custom scripts

4. Model Updating and Retraining - Objective: Update and retrain the model to maintain or improve performance

Scheduled Retraining: Retrain the model at regular intervals using new data
Triggered Retraining: Retrain the model when performance metrics indicate degradation

5. Documentation and Reporting - Objective: Document the deployment process, including setup, configurations, and usage

Deployment Guide: Create a guide detailing the steps to deploy and run the model
API Documentation: Provide clear documentation for any APIs, including endpoints, request/response formats, and examples
Monitoring Reports: Regularly report on model performance, issues, and updates

6. Security and Compliance - Objective: Ensure the deployed model adheres to security and compliance standards

Data Privacy: Implement measures to protect sensitive data
Access Control: Restrict access to the model and data to authorized users
Compliance: Ensure the deployment complies with relevant regulations (e.g., GDPR, HIPAA)

double_arrow Model Serialization


import pickle
import joblib

# Using pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Using joblib
joblib.dump(model, 'model.joblib')

# Loading the model
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

double_arrow Model Integration


# Script to load model and process data
def process_batch(data_batch):
    model = joblib.load('model.joblib')
    predictions = model.predict(data_batch)
    return predictions
# Example of a cron job to run the script daily
# 0 0 * * * /usr/bin/python /path/to/your/script.py

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    prediction = model.predict([data['features']])
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(debug=True)

double_arrow Model Monitoring


from scipy.stats import ks_2samp

# Function to detect data drift
def detect_drift(new_data, reference_data):
    p_value = ks_2samp(new_data, reference_data).pvalue
    return p_value < 0.05  # If p-value is less than 0.05, data drift is detected

# Monitor data drift
is_drifted = detect_drift(new_data['feature'], reference_data['feature'])
print(f'Data Drift Detected: {is_drifted}')

double_arrow Model Updating and Retraining


def retrain_model(new_data, new_labels):
    # Load existing model
    model = joblib.load('model.joblib')
    
    # Retrain model with new data
    model.fit(new_data, new_labels)
    
    # Save the retrained model
    joblib.dump(model, 'model.joblib')

# Example of retraining the model
new_data = ...  # Load new data
new_labels = ...  # Load new labels
retrain_model(new_data, new_labels)

double_arrow Security and Compliance


from flask import Flask, request, jsonify
import joblib
import jwt  # For authentication tokens

app = Flask(__name__)
model = joblib.load('model.joblib')

def authenticate(token):
    # Implement token authentication logic
    try:
        payload = jwt.decode(token, 'your-secret-key', algorithms=['HS256'])
        return payload['user'] == 'authorized_user'
    except jwt.ExpiredSignatureError:
        return False
    except jwt.InvalidTokenError:
        return False

@app.route('/predict', methods=['POST'])
def predict():
    token = request.headers.get('Authorization')
    if not authenticate(token):
        return jsonify({'error': 'Unauthorized'}), 401
    
    data = request.get_json(force=True)
    prediction = model.predict([data['features']])
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(debug=True)

9. Communication

Ensures that the results of the data science process are effectively conveyed to stakeholders in a clear, concise, and actionable manner

By tailoring the message to the audience, using appropriate formats and visualizations, and gathering feedback, data scientists can ensure that their work has a meaningful impact on decision-making and business outcomes

Reporting: create reports and visualizations to communicate findings to stakeholders
Presentation: present insights and recommendations in a clear and actionable manner

Detailed breakdown of the Communication step:

1. Identify the Audience - Objective: Tailor the communication to the knowledge level, interests, and needs of different stakeholders

Executives: Focus on high-level insights, business impact, and strategic recommendations
Technical Teams: Provide detailed methodology, technical findings, and implications for system integration
End Users: Highlight practical benefits and usability of the results

2. Structure the Message - Objective: Organize the communication in a clear and logical manner

Executive Summary - Objective: Summarize the key findings, implications, and recommendations

Key Insights: Highlight the most critical findings
Business Impact: Explain how the findings affect the business
Recommendations: Provide actionable suggestions based on the analysis

Detailed Analysis - Objective: Provide a thorough explanation of the data, methodology, and results

Data Overview: Describe the data sources, volume, and characteristics
Methodology: Explain the analytical techniques and models used
Results: Present detailed findings with supporting evidence

Visualizations - Objective: Use visual aids to enhance understanding and engagement

Graphs and Charts: Use bar charts, line graphs, scatter plots, etc., to illustrate key points
Tables: Present detailed numerical data in a clear and organized manner
Dashboards: Create interactive dashboards for real-time exploration of the results

3. Deliver the Message - Objective: Choose the appropriate format and medium for communication

Reports - Objective: Provide a comprehensive document detailing the entire analysis process and findings

Written Report: Include sections for executive summary, detailed analysis, visualizations, and appendices
Technical Documentation: Provide detailed technical explanations and code snippets for reproducibility

Presentations - Objective: Summarize and present key findings in a concise and engaging format

Slides: Use presentation software (e.g., PowerPoint, Keynote) to create slides
Storytelling: Build a narrative around the findings to maintain interest and relevance

Interactive Dashboards - Objective: Allow stakeholders to explore the data and findings interactively

Tools: Use tools like Tableau, Power BI, or web-based applications (e.g., Dash, Streamlit) to create dashboards
Customization: Tailor dashboards to show relevant metrics and allow filtering and drilling down into the data

4. Feedback and Iteration - Objective: Gather feedback from stakeholders to refine and improve the communication and the model itself

Feedback Sessions: Hold meetings or workshops to present findings and gather feedback
Surveys: Use surveys to collect structured feedback on the clarity and usefulness of the communication
Iteration: Revise the analysis, models, and communication materials based on the feedback received

5. Documentation and Archiving - Objective: Ensure that all materials are properly documented and archived for future reference

Version Control: Use version control systems (e.g., Git) to track changes in code and documentation
Documentation: Maintain comprehensive documentation of the analysis process, code, and findings
Archiving: Store reports, presentations, and datasets in a centralized and accessible location

double_arrow

10. Maintenance

Ensures that the machine learning model remains effective and relevant over time

By continuously monitoring performance, retraining as necessary, versioning models, ensuring compliance, and keeping stakeholders informed, you can maintain the reliability and accuracy of the model in a dynamic environment

Proper maintenance helps in sustaining the value provided by the model and adapting to changes in data and business needs

Updating: regularly update the model with new data to maintain accuracy
Monitoring: monitor for any changes in data patterns and adjust the model accordingly

Detailed breakdown of the Maintenance step:

1. Continuous Monitoring - Objective: Continuously track the model’s performance and health in production to detect issues early

Performance Metrics: - Objective: Monitor key metrics that indicate the model's performance

Accuracy Metrics: Track metrics like accuracy, precision, recall, F1 score, etc., for classification models
Error Metrics: Monitor metrics like MAE, MSE, RMSE, etc., for regression models
Custom Metrics: Use business-specific metrics that align with the goals of the model

Data Quality: Objective: Ensure the incoming data is of high quality and consistent with the training data

Data Integrity: Check for missing values, outliers, and anomalies
Data Distribution: Monitor the distribution of incoming data to detect shifts or drifts

2. Retraining and Updating - Objective: Regularly update the model to incorporate new data and improve performance

Scheduled Retraining - Objective: Retrain the model at regular intervals using new data

Frequency: Determine an appropriate retraining schedule (e.g., weekly, monthly, quarterly)
Automation: Automate the retraining process using scripts and scheduling tools

Triggered Retraining - Objective: Retrain the model when performance metrics fall below a certain threshold

Performance Thresholds: Set thresholds for key metrics that trigger retraining
Monitoring: Continuously monitor performance metrics and trigger retraining when necessary

3. Model Versioning - Objective: Keep track of different versions of the model to manage updates and rollbacks

Version Control: Use version control systems to track changes in the model, code, and data
Metadata: Maintain metadata about each version, including training data, hyperparameters, performance metrics, and deployment dates

4. Model Governance - Objective: Ensure the model adheres to regulatory and compliance requirements

Compliance - emObjective: Maintain compliance with relevant regulations (e.g., GDPR, HIPAA)

Data Privacy: Ensure that data used for training and inference complies with data privacy regulations
Audit Trails: Maintain logs of model training, updates, and predictions for audit purposes

Ethical Considerations - Objective: Ensure the model is fair and does not introduce bias

Bias Detection: Implement techniques to detect and mitigate bias in the model
Fairness Metrics: Monitor metrics like demographic parity, equal opportunity, and disparate impact

5. Communication and Documentation - Objective: Keep stakeholders informed about the model’s performance, updates, and any issues

Regular Reports: Provide periodic reports on model performance, issues detected, and updates made
Documentation Updates: Continuously update documentation to reflect changes in the model, data, and processes

double_arrow

Useful links

------

EDA + Analyzing Relationships Between Variables (Scatter Plots (Visualize the relationship between two numerical variables), Pair Plots (Visualize pairwise relationships in a dataset), Heatmaps (Visualize the correlation matrix))
EDA + Analyzing Categorical Data (Bar Plots (Visualize the frequency distribution of categorical features), Count Plots (Count the occurrences of each category in a categorical feature), Chi-Square Test (Test for independence between categorical features))
skewness?