Data Science Basics

Multidisciplinary field that involves extracting insights and knowledge from data through various processes, techniques, and tools

Umbrella term for a more comprehensive set of fields that are focused on mining big data sets and discovering innovative new insights, trends, methods, and processes

  • Data Science ...

Data Science process

1. Problem Definition

Essential step for setting a clear direction for the data science project

By thoroughly understanding the business context, defining specific goals, identifying key questions, and assessing feasibility, you lay a solid foundation for the subsequent stages of data collection, analysis, and modeling

A well-defined problem ensures that the project stays focused, resources are used efficiently, and the final outcomes are aligned with the business objectives

  • Objective: Understand the business problem or research question
  • Goals: Define clear objectives and the desired outcomes
  • Stakeholders: Identify stakeholders and their needs

Detailed breakdown of the Problem Definition step:

  • 1. Understanding the Business Context - Objective: Gain a comprehensive understanding of the business environment and the specific issues that need to be addressed
    • Business Goals: Identify the overarching goals of the business or organization. This might include increasing revenue, reducing costs, improving customer satisfaction, etc.
    • Current Challenges: Understand the specific challenges or pain points that the business is facing. This could be high customer churn, inefficient operations, low product sales, etc.
    • Stakeholder Needs: Engage with key stakeholders to understand their expectations, requirements, and how they define success
  • 2. Defining the Problem Statement - Objective: Articulate the problem in a clear and concise manner
    • Specificity: The problem statement should be specific and focused. Avoid vague or broad statements
    • Measurability: Define how success will be measured. What metrics or KPIs will indicate that the problem has been solved or improved?
    • Constraints: Identify any constraints or limitations that need to be considered, such as budget, time, data availability, and technical limitations
  • 3. Formulating Objectives and Goals - Objective: Break down the problem into manageable objectives and goals
    • Short-term and Long-term Goals: Differentiate between immediate objectives and long-term goals
    • SMART Goals: Ensure that the goals are Specific, Measurable, Achievable, Relevant, and Time-bound
  • 4. Identifying Key Questions and Hypotheses - Objective: Outline the key questions that need to be answered and any hypotheses to be tested
    • Research Questions: What are the specific questions that need to be addressed to solve the problem? For example, “What factors are contributing to customer churn?”
    • Hypotheses: Formulate hypotheses based on domain knowledge and preliminary data analysis. For instance, “Customers are more likely to churn if they experience issues with the product in the first month.”
  • 5. Determining the Scope of the Project - Objective: Clearly define the boundaries of the project
    • In-Scope vs. Out-of-Scope: Identify what will be included and what will be excluded from the project. This helps in managing expectations and resources
    • Data Requirements: Determine the types of data needed, sources of data, and the data collection methods. Ensure that the necessary data is accessible and of sufficient quality
  • 6. Understanding the Audience - Objective: Identify the target audience for the results and how the insights will be communicated
    • Stakeholders: Understand who will be the end users of the insights (e.g., executives, marketing teams, product managers)
    • Communication Plan: Plan how the results will be presented. This includes deciding on the format (e.g., reports, dashboards, presentations) and the level of technical detail
  • 7. Assessing Feasibility - Objective: Evaluate the feasibility of the project from a technical and operational perspective
    • Technical Feasibility: Assess the technical challenges and the resources required. Do you have the necessary tools, technologies, and skills?
    • Operational Feasibility: Consider the operational aspects such as timelines, budget, and team capacity. Ensure that the project is practical and can be completed within the given constraints
  • 8. Establishing a Timeline - Objective: Develop a realistic timeline for the project
    • Milestones: Break down the project into key milestones and deliverables. Define what needs to be achieved at each stage
    • Deadlines: Set clear deadlines for each milestone to ensure the project stays on track
  • 9. Risk Assessment - Objective: Identify potential risks and develop mitigation strategies
    • Risk Identification: List possible risks that could impact the project, such as data issues, technical challenges, or changes in business priorities
    • Mitigation Plans: Develop strategies to mitigate these risks. This could include having backup data sources, additional training for team members, or contingency plans for delays
  • 10. Documenting the Plan - Objective: Create a comprehensive project plan that documents all aspects of the Problem Definition step
    • Project Charter: Develop a project charter or proposal that outlines the problem statement, objectives, scope, stakeholders, timeline, and risks
    • Stakeholder Approval: Ensure that all stakeholders review and approve the project plan. This alignment is crucial for project success

2. Data Collection

Foundational step for the success of a data science project

By meticulously identifying data requirements, leveraging appropriate sources, ensuring data quality, and securely storing the data, you set the stage for accurate and reliable analysis

  • Sources: Gather data from various sources such as databases, APIs, web scraping, surveys, etc.
  • Data Types: Collect different types of data, including structured, semi-structured, and unstructured data

Detailed breakdown of the Data Collection step:

  • 1. Identifying Data Requirements - Objective: Determine what data is needed to address the problem and meet the project objectives
    • Types of Data: Identify the types of data required (e.g., numerical, categorical, text, images)
    • Data Sources: Determine the sources of data, such as internal databases, external datasets, APIs, web scraping, surveys, and more
    • Data Attributes: List the specific attributes or features needed (e.g., customer demographics, transaction history)
  • 2. Data Sources - Objective: Explore various sources from where data can be collected
    • Internal Data Sources: Databases, data warehouses, CRM systems, ERP systems, logs, and any other internal repositories
    • External Data Sources: Public datasets, third-party APIs, government databases, social media, research publications, and web scraping
    • Primary Data Collection: Surveys, questionnaires, experiments, and direct observations
  • 3. Data Collection Methods - Objective: Choose appropriate methods to collect the data
    • Automated Data Collection: Using scripts and tools to fetch data from APIs, web scraping, and logging systems
    • Manual Data Collection: Entering data manually through forms, surveys, and other manual processes
    • Real-time Data Collection: Streaming data from IoT devices, sensors, and live feeds
  • 4. Ensuring Data Quality - Objective: Ensure the data collected is of high quality
    • Accuracy: Verify that the data correctly represents the real-world conditions or events
    • Completeness: Ensure all necessary data points are collected, with minimal missing values
    • Consistency: Ensure the data is consistent across different sources and formats
    • Timeliness: Ensure the data is up-to-date and relevant for the analysis
    • Validity: Ensure the data conforms to the defined business rules and constraints
  • 5. Data Storage - Objective: Store the collected data securely and efficiently
    • Databases: Use relational databases (e.g., MySQL, PostgreSQL) or NoSQL databases (e.g., MongoDB) depending on the data structure
    • Data Lakes: Store large volumes of raw data in data lakes using technologies like Hadoop or AWS S3
    • Cloud Storage: Utilize cloud services (e.g., AWS, Google Cloud, Azure) for scalable and secure data storage
    • Data Warehouses: Use data warehouses (e.g., Amazon Redshift, Google BigQuery) for structured and processed data
  • 6. Data Integration - Objective: Combine data from multiple sources into a cohesive dataset
    • ETL (Extract, Transform, Load): Implement ETL processes to extract data from various sources, transform it into a suitable format, and load it into a storage system
    • Data Cleaning: Address any inconsistencies, duplicates, and errors during the integration process
    • Schema Alignment: Ensure that data from different sources is compatible by aligning schemas, data types, and formats
  • 7. Documenting Data Collection Process- Objective: Keep detailed documentation of the data collection process
    • Data Sources and Methods: Document where and how the data was collected
    • Data Quality Checks: Record any quality checks performed and issues encountered
    • Data Collection Dates: Keep track of when the data was collected to ensure its relevance and timeliness

    double_arrow Import packages

      
      # IMPORT PYTHON PACKAGES
      # Data Science and Analysis
      import numpy as np                # Numerical operations
      import pandas as pd               # Data manipulation and analysis
      import scipy                      # Scientific computing
      import matplotlib.pyplot as plt   # Plotting and visualization
      import seaborn as sns             # Statistical data visualization
      import plotly.express as px       # Interactive data visualization
      import statsmodels.api as sm      # Statistical modeling
      import dask.dataframe as dd       # Parallel computing with Pandas-like DataFrames
      import vaex                       # Out-of-core DataFrames for big data
      
      # Machine Learning and AI
      from sklearn import datasets, model_selection, preprocessing, metrics   # Machine learning tools
      from sklearn import ensemble, decomposition, compose                    # Machine learning tools
      import tensorflow as tf           # Deep learning framework
      from tensorflow import keras      # High-level neural networks API
      import torch                      # Deep learning framework
      import xgboost as xgb             # Gradient boosting library
      import lightgbm as lgb            # Light Gradient Boosting Machine
      import catboost                   # Categorical features gradient boosting library
      
      # Natural Language Processing (NLP)
      import nltk                                                  # Natural language processing toolkit
      import spacy                                                 # Advanced natural language processing
      import gensim                                                # Topic modeling and document similarity
      import transformers                                          # State-of-the-art NLP models
      from sklearn.feature_extraction.text import TfidfVectorizer  # Text feature extraction
      import textblob                                              # Text processing and NLP
      import nltk.sentiment.vader as vader                         # Sentiment analysis
      
      # Data Visualization
      from bokeh.plotting import figure, output_file, show         # Interactive visualization
      import altair as alt                                         # Declarative statistical visualization
      import folium                                                # Interactive maps
      import geopandas as gpd                                      # Geospatial data processing
      
      # Web Development
      from django.http import HttpResponse         # Django web framework
      from flask import Flask, request             # Flask web framework
      import fastapi                               # FastAPI web framework
      
      # General-Purpose Libraries
      import requests                   # HTTP requests
      from bs4 import BeautifulSoup     # Web scraping
      from PIL import Image             # Image processing
      import json                       # JSON data manipulation
      import os                         # Operating system interfaces
      import sys                        # System-specific parameters and functions
      import re                         # Regular expressions
      import datetime                   # Date and time manipulation
      import logging                    # Logging facility for Python
      import itertools                  # Functions creating iterators for efficient looping
      import collections                # Container datatypes
      
      # Data Storage and Databases
      import sqlite3               # SQLite database
      import pymongo               # MongoDB
      import sqlalchemy            # SQL toolkit and Object Relational Mapper
      import psycopg2              # PostgreSQL database adapter
      
      # Networking
      import socket                # Low-level networking interface
      import paramiko              # SSH protocol
      import smtplib               # Simple Mail Transfer Protocol
                      

    double_arrow Read Data

    pandas.read_csv() - function to read the csv file

    • In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address
    • The file path can be either an URL or your local file address
    • If the data does not include headers, we can add an argument header = None inside the read_csv() method so that pandas will not automatically set the first row as a header
    • dtype - data type's to apply to either the whole dataset or individual columns
    • coerce_float - attempt to force numbers into floats
    • parse_dates - list of columns to parse as dates
    • chunksize - number of rows to include in each chunk
    • pd.read_excel() - read excel file
    • pd.read_json() - read json file
    • 
      # READING CSV FILES:
      pd.set_option('display.max_columns', None)              # Show all available columns
      
      path = "../1 datasets/iris_data.csv"                    # Set the path to your file
      data_df = pd.read_csv(path)                             # Read the file (import the data)
      
      data_df = pd.read_csv(path, sep="\t")                   # Different delimeters: tab separated file(.tsv)
      data_df = pd.read_csv(path, delim_whitespace=True)      # Different delimeters: space separated file
      data_df = pd.read_csv(path, header=None)                # Don't use first row for column names
      data_df = pd.read_csv(path, names=["Name1", "Name2"])   # Specify column names
      data_df = pd.read_csv(path, na_values=["NA", 99])       # Custom missing values
      
      
      # READING SQL DATA:
      import sqlite3 as sq3                               # Import sqlite3
      path = "database path"                              # Set the path to your database
      con = sq3.Connection(path)                          # Create connection SQL database with sqlite3
      query = '''SELECT * FROM table_name;'''
      data_df = pd.read_sql(query, con)                   # Execute query
      
      observations_generator = pds.read_sql(query, con,
                      coerce_float=True,
                      parse_dates=['Release_Year'],       # Parse `Release_Year` as a date
                      chunksize=5                         # Allows for streaming results as a series of shorter tables)
      
      # READING noSQL DATA:
      from pymongo import MongoClient
      con = MongoClinet()                             # Create a Mongo connection
      db = con.database_name                          # Choose database (con.list_database_names() will display available databases)
      #cursor = db.collection_name.find(query)        # Create a cursor object using a query (replace query with {} to sellect all)
      #data_df = pd.DataFrame(list(cursor))           # Expand cursor and construct DataFrame
                              

    double_arrow Add Headers

    • Pandas automatically set the header with an integer starting from 0
    • If we have to add headers manually, we create a list "headers" that include all column names in order
    • We use dataframe.columns = headers to replace the headers with the list we created
    • 
      # CREATE  HEADERS:
      headers = ["header_name1","header_name2"..."header_namen"]
      data_df.columns = headers     # Replace headers 
                              

    double_arrow Rename Columns or Row Indexes

      
      # RENAME COLUMNS OR INDEXES USING A MAPPING:
      data_df.rename(columns={"REF_DATE": "DATE", "Type of fuel": "TYPE"})     
      data_df.rename(index={0: "x", 1: "y", 2: "z"})  
      
      data_df.rename(str.lower, axis='columns')           # Lowercase the columns using axis-style parameters
      data_df.rename({1: 2, 2: 4}, axis='index')          # Rename indexes using axis-style parameters
                              

    double_arrow Save Dataframe

    • df.to_csv() - save the dataset to csv
    • df.to_json() - save the dataset to json
    • df.to_excel() - save the dataset to excel
    • df.to_hdf() - save the dataset to hdf
    • df.to_sql() - save the dataset to sql
    • 
      # SAVE DATAFRAME:
      DataFrame.to_csv("data_df.csv", index=False)    # index=False means the row names will not be written
                              

    double_arrow Basic Insight of Data

    • pandas.df.head() - return the first n rows
    • pandas.df.shape() - shows how many entries there are in our dataset (parameter 0 for rows, parameter 1 for columns)
    • pandas.df.info() - provides a concise summary of your DataFrame (prints information about a DataFrame including the index dtype and columns, non-null values and memory usage)
    • pandas.df.dtypes() - returns a Series with the data type of each column
    • pandas.df.value_counts() - returns a Series containing counts of unique values (resulting object will be in descending order)
      • normalize - If True then the object returned will contain the relative frequencies of the unique values
      • sort - sort by frequencies when True
      • ascending - sort in ascending order (default False)
    • pandas.tolist() - return a list of the values
    • pandas.df.describe() - provides a generate descriptive statistics of each numeric-typed (int, float) column, excluding NaN (Not a Number) values
      • the count of that variable
      • the mean
      • the standard deviation (std)
      • the minimum value value
      • the IQR Interquartile Range: 25%, 50% and 75%
      • the maximum value
      • include = "all" argument provides the statistical summary of all the columns, including object-typed attributes
      • You can select the columns of a dataframe by indicating the name of each column and you can apply the method to get the statistics of those columns
      • Does have median, but it's called 50% quantile
      • Does not have range, in order to get the range you will need to create a new entry in the describe table, which is max - min
      
      # SHOW FEW ROWS
      data_df.head(5)                             # Show the first 5 rows of the dataframe
      print(data_df.iloc[:5])                     # Print a few rows
      
      # SHOW NUMBER OF ROWS AND COLUMNS 
      print("There are: ", data_df.shape[0], " rows; ", data_df.shape[1], " columns") 
      
      # INFO
      data_df.info()               
      
      # DATA TYPES OF THE COLUMNS
      print(data_df.dtypes)           # Data types; included in info()
      data_df.dtypes.value_counts()   # Counts columns according to data types
      
      # UNIQUE VALUES
      data_df[column].value_counts()                                      # Counts unique values
      percentages = data_df['column'].value_counts(normalize=True) * 100  # Calculate percentages of value counts
      
      # COLUMN NAMES
      print(data_df.columns.tolist()) # Column names; included in info()
      
      # DESCRIBE
      data_df.describe()
      data_df.describe(include = "all")                   # All the columns
      data_df[['length','width','height']].describe()     # For selected columns
      data_df.describe(include=['object'])                # To include type object
      
      stats_df = data_df.describe()                                          # Get statistical summary
      stats_df.loc['range'] = stats_df.loc['max'] - stats_df.loc['min']   # Calculate range (max-min)
      out_fields = ['mean','25%','50%','75%','range']                     # Select just the rows desired from the 'describe' method
      stats_df = stats_df.loc[out_fields]
      stats_df.rename({'50%': 'median'}, inplace=True)                    # Add name median instead of 50%
                                  

3. Data Preparation

An iterative and interactive process that requires careful attention to detail

Sets the foundation for all subsequent steps in the data science process, as the quality and structure of the data directly impact the accuracy and reliability of the insights derived from it

Effective data preparation ensures that the data is clean, relevant, and ready for analysis, ultimately leading to better decision-making and more robust models

  • Cleaning: Handle missing values, remove duplicates, and correct inconsistencies
  • Transformation: Normalize, scale, and encode data to prepare it for analysis
  • Integration: Combine data from multiple sources and ensure consistency

Detailed breakdown of the Data Preparation step:

  • 1. Data Cleaning - Objective: Improve the quality of the data by handling errors, inconsistencies, and missing values
    • Missing Values: Identify and handle missing data using techniques such as imputation (filling missing values with mean, median, mode, or using algorithms), deletion (removing rows or columns with missing values), or flagging (marking missing values for further analysis)
    • Duplicates: Detect and remove duplicate records to ensure each entry is unique
    • Outliers: Identify and handle outliers that may skew the analysis. Methods include removing them, transforming them, or using robust statistical methods
    • Data Types: Ensure all data is in the correct format (e.g., converting strings to dates, ensuring numerical data is in numerical format)
  • 2. Data Transformation - Objective: Convert data into a format suitable for analysis
    • Normalization: Scale numerical features to a common range (e.g., 0 to 1) to ensure that no single feature dominates the model due to its scale
    • Standardization: Transform data to have a mean of 0 and a standard deviation of 1. This is often necessary for algorithms that assume data is normally distributed
    • Encoding Categorical Data: Convert categorical data into numerical format using techniques like one-hot encoding, label encoding, or target encoding
    • Feature Creation: Create new features from existing data to better capture the underlying patterns. This can include creating interaction terms, polynomial features, or aggregating data over time
    • Discretization: Convert continuous data into discrete bins or categories. This can be useful for certain types of analysis or when dealing with non-linear relationships
  • 3. Data Integration - Objective: Combine data from multiple sources into a cohesive dataset
    • Merging: Combine datasets based on common keys (e.g., customer ID, product ID). This involves understanding the relationships between different data sources and ensuring consistent keys
    • Concatenation: Stack datasets vertically or horizontally when they have the same structure
    • Handling Conflicts: Resolve discrepancies and conflicts in data from different sources, such as different formats or definitions for the same attribute
    • Schema Alignment: Ensure that the data structures (schemas) from different sources are compatible. This may involve renaming columns, reordering columns, or aligning data types
  • 4. Data Reduction - Objective: Reduce the volume of data while maintaining its integrity
    • Dimensionality Reduction: Reduce the number of features using techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or t-SNE. This helps in dealing with the "curse of dimensionality" and improves computational efficiency
    • Sampling: Reduce the size of the dataset by selecting a representative subset. This can be done through random sampling, stratified sampling, or cluster sampling
    • Aggregation: Summarize data to a higher level, such as aggregating daily data to monthly data. This helps in reducing the granularity of the data while retaining important patterns
  • 5. Data Enrichment - Objective: Enhance the dataset by adding additional information
    • External Data: Incorporate external data sources to provide more context or additional features. Examples include adding weather data, economic indicators, or social media sentiment data
    • Feature Engineering: Create new features that capture domain knowledge or improve model performance. This involves domain expertise and creativity to derive meaningful features from the raw data

    double_arrow Identify and Handle Missing Values

    • Identify missing data
      • pandas.df.isnull() - detect missing values (None or numpy.NaN). Empty strings " " are not considered NA values
    • Deal with missing data
    • Correct data format. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'
    • 
      # IDENTIFY AND HANDLE MISSING VALUES:
      data_df.isnull().sum().sort_values(ascending=False)         # Sort missing values in our dataset
      
      missing_data = data_df.isnull()                             # False means there is no missing data in column 
      for column in missing_data.columns.values.tolist():
          print(column)
          print (missing_data[column].value_counts())
          print("") 
      
      data_df.dropna(inplace = True)              # Delete NaN
      data_df.dropna(how='all')                   # Drop the rows where all elements are missing
      data_df.dropna(axis='columns')              # Drop the columns where at least one element is missing
      data_df.dropna(subset=["Lot Frontage"])     # Define in which columns to look for missing values
      data_df.drop("Lot Frontage", axis=1)        # Drop the whole attribute (column) if contains missing values
      
      data_df.price = data_df.price.replace('?',np.nan)                               # Replace ? sign with NaN
      data_df['brand'] = data_df['brand'].replace(['vw', 'vokswagen'], 'volkswagen')  # Fixing typos in the names of the cars
      
      median = data_df["Lot Frontage"].median()
      data_df["Lot Frontage"].fillna(median, inplace = True)  # Replace the missing values with the median value of that column
      mean = data_df["Mas Vnr Area"].mean()
      data_df["Mas Vnr Area"].fillna(mean, inplace = True)    # Replace the missing values with the mean value of that column
      
      data_df.price = data_df.price.astype('int64')           # Convert into the type int
                              

    double_arrow Removing Duplicates

    • pandas.df.duplicated() - check whether there are any duplicates in our data
    • pandas.df.drop_duplicates() - removes all duplicate rows based on all the columns by default
    • pandas.Index.is_unique() - alternative way to check if there are any duplicated Indexes in our dataset
    • 
      # HANDLING DUPLICATES:
      duplicate = data_df[data_df.duplicated(['PID'])]        # To check whether there are any duplicates in the column
      sum(data_df.duplicated(subset = 'car_ID')) == 0         # To find and sum duplicates on specific column
      
      dup_removed = data_df.drop_duplicates()                 # To remove the duplicates
      removed_sub = data_df.drop_duplicates(subset=['Order']) # Remove duplicates on a specific column
      
      data_df.index.is_unique                                 # To check if there are any duplicated Indexes in our dataset
      data_df.brand.nunique()                                 # Count number of distinct elements in specified axis
                              

    double_arrow Splitting the columns

    • pandas.str.split() - splits the string records
      • pat - parameter can be used to split by other characters (if not specified, split on whitespace)
      • n - limit number of splits in output (0 and -1 will be interpreted as return all splits)
      • expand=True - returns a dataframe
      
      # SPLITTING THE COLUMNS:
      data_df[['City', 'Province']] = data_df['GEO'].str.split(',', n=1, expand=True) # Split GEO column to City and Province columns
      data_df['brand'] = data_df.CarName.str.split(' ').str.get(0).str.lower()        # Get all first words of car names in lowercase   
                              

    double_arrow Changing to datetime format

    • pandas.to_datetime() - transforms to date time format
      • We need to specify the format of datetime that we need
      • format='%b-%y' means that it will split into the name of a month and year
      • str.slice(stop=3) splits and outputs the first 3 letters of a month
      
      # CHANGING TO DATETIME FORMAT:
      data_df['DATE'] = pd.to_datetime(data_df['DATE'], format='%b-%y')
      data_df['Month'] = data_df['DATE'].dt.month_name().str.slice(stop=3)
      data_df['Year'] = data_df['DATE'].dt.year
      
      # 
      duration = list(data_df['Duration'])
      for i in range(len(duration)):
          if len(duration[i].split()) != 2:
              if 'h' in duration[i]:
                  duration[i] = duration[i].strip() + ' 0m'
              elif 'm' in duration[i] :
                  duration[i] = '0h {}'.format(duration[i].strip())
      dur_hours = []
      dur_minutes = []  
       
      for i in range(len(duration)) :
          dur_hours.append(int(duration[i].split()[0][:-1]))
          dur_minutes.append(int(duration[i].split()[1][:-1]))
           
      data_df['Duration_hours'] = dur_hours
      data_df['Duration_minutes'] =dur_minutes
      data_df.loc[:,'Duration_hours'] *= 60
      data_df['Duration_Total_mins']= data_df['Duration_hours']+data_df['Duration_minutes']
      
      data_df["Dep_Hour"]= pd.to_datetime(data_df['Dep_Time']).dt.hour
      data_df["Dep_Min"]= pd.to_datetime(data_df['Dep_Time']).dt.minute
      data_df["Arrival_Hour"]= pd.to_datetime(data_df['Arrival_Time']).dt.hour
      data_df["Arrival_Min"]= pd.to_datetime(data_df['Arrival_Time']).dt.minute
      
      data_df['Month']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.month
      data_df['Day']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.day
      data_df['Year']= pd.to_datetime(data_df["Date_of_Journey"], format="%d/%m/%Y").dt.year
      data_df['day_of_week'] = pd.to_datetime(data_df['Date_of_Journey']).dt.day_name()
                              

    double_arrow Identify and handle outliers

    In statistics, an outlier is an observation point that is distant from other observations (can be due to some mistakes in data collection or recording, or due to natural high variability of data points)

    Outliers can markedly affect our models and can be a valuable source of information, providing us insights about specific behaviours

    • Visual methods for identifying outliers:
      • Box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles. Outliers may be plotted as individual points.
      • Scatter plot shows the distribution of data points, helping to spot outliers visually
    • Statistical methods for identifying outliers:
      • Z-score Analysis (how far a data point is from the mean in terms of standard deviations)
        • Z-score - signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured
        • Value that quantifies relationship between a data point and a standard deviation and mean values of a group of points
        • Data points which are too far from zero will be treated as the outliers. In most of the cases, a threshold of 3 or -3 is used
        
        data_df['LQFSF_Stats'] = scipy.stats.zscore(data_df['Low Qual Fin SF'])
        data_df[['Low Qual Fin SF','LQFSF_Stats']].describe().round(3)
                                
      • Identify and handle outliers using the Interquartile Range (IQR) method
        • Data points below Q1 - 1.5 IQR or above Q3 + 1.5 IQR are considered outliers
        
        # Detecting outliers: Statistics
        
        # Calculate the interquartile range (Q1: 25th percentile and Q3: 75th percentile)
        Q1, Q50, Q3 = np.percentile(data_df.SalePrice, [25, 50, 75])
        IQR = Q3 - Q1
        
        # Define lower and upper bounds
        lower_bound = Q1 - 1.5*(IQR)
        upper_bound = Q3 + 1.5*(IQR)
        
        print("min:", lower_bound, "Q1:", Q1, "q50:", q50, "Q3:", Q3, "max:", upper_bound)
        
        # Identify outliers
        outliers = data_df[(data_df['SalePrice'] < lower_bound) | (data_df['SalePrice'] > upper_bound)]
        print("Outliers:\n", outliers)
        
        # Identify the points
        [x for x in data_df["SalePrice"] if x > upper_bound]
        
        # Remove outliers
        df_cleaned = data_df[(data_df['SalePrice'] >= lower_bound) & (data_df['SalePrice'] <= upper_bound)]
        print("Data after removing outliers:\n", df_cleaned)
                                
    • Uni-variate (using one variable) Analysis
    • 
      sns.boxplot(x=data_df['Lot Area'])   
                              
    • Bi-variate (using two or more variables) Analysis
    • 
      price_area = data_df.plot.scatter(x='Gr Liv Area', y='SalePrice')
                              
    • Deleting the Outliers
    • 
      data_df.sort_values(by = 'Gr Liv Area', ascending = False)[:2]      # Sort values to find last 2 records (index: 1499,2181)
      outliers_dropped = data_df.drop(data_df.index[[1499,2181]])         # If we want to delete some of the outliers
                              

    double_arrow Data Filtering

    We can use the logical operators on column values to filter rows. First, we specify the name of our data, then, square brackets to select the name of the column, double 'equal' sign, == to select the name of a row group, in single or double quotation marks

    If we want to exclude some entries (e.g. some locations), we would use the 'equal' and 'exclamation point' signs together, =!. We can also use < >, <= >= signs to select numeric information. We can also use | (or) and & (and) to select multiple columns and rows

      
      calgary = data_df[data_df['GEO']=='Calgary, Alberta']           # select the Calgary, Alberta data
      sel_years = data_df[data_df['Year']==2000]                      # select 2000 year
      
      mult_loc = data_df[(data_df['GEO']=="Toronto, Ontario") | (data_df['GEO']=="Edmonton, Alberta")]                                                       # Select Toronto and Edmonton locations
      cities = ['Calgary', 'Toronto', 'Edmonton']
      CTE = data_df[data_df.City.isin(cities)]                                                                                                               # isin method to select multiple locations
      mult_sel = data_df[(data_df['Year']==1990) & (data_df['TYPE']=="Household heating fuel") & (data_df['City']=='Vancouver')]                             # Select the data that shows the price of the 'household heating fuel', in Vancouver, in 1990
      mult_sel = data_df[( data_df['Year']<=1979) | ( data_df['Year']==2021) & (data_df['TYPE']=="Household heating fuel") & (data_df['City']=='Vancouver')] # Select the data that shows the price of the 'household heating fuel', in Vancouver, in the years of 1979 and 2021
                              

4. Exploratory Data Analysis (EDA)

Involves examining and visualizing the dataset to uncover patterns, spot anomalies, test hypotheses, and check assumptions

Helps in understanding the underlying structure of the data, guiding the next steps in the analysis or modeling process

Allows us to get an initial feel for the data, lets us to determine if the data makes sense (or if futher cleaning or more data is needed), helps to identify patterns and trends in the data

  • Descriptive Statistics: Calculate measures such as mean, median, mode, standard deviation, etc.
  • Visualization: Use charts, graphs, and plots to understand data distribution, patterns, and anomalies
  • Hypothesis Testing: Formulate and test hypotheses about the data

Detailed breakdown of the EDA step:

  • 1. Understanding the Dataset - Objective: Gain an initial understanding of the dataset and its structure
    • Data Types: Identify the types of data (numerical, categorical, text, dates, etc.)
    • Dimensions: Understand the size and shape of the dataset (number of rows and columns)
    • Summary Statistics: Calculate basic statistics such as mean, median, standard deviation, min, max, and quartiles for numerical feature
  • 2. Handling Missing Data - Objective: Identify and handle missing data
    • Missing Value Detection: Identify columns with missing values and the proportion of missing values
    • Imputation: Fill missing values using methods like mean, median, mode, or more sophisticated techniques like KNN imputation or regression
    • Removal: Drop rows or columns with missing values if they are not significant or cannot be imputed reliably
  • 3. Analyzing Distributions - Objective: Understand the distribution of individual features
    • Histograms: Visualize the distribution of numerical features
    • Box Plots: Identify outliers and understand the spread and skewness of the data
    • Density Plots: Visualize the probability density function of a continuous variable
  • 4. Analyzing Relationships Between Variables - Objective: Explore relationships between different features
    • Scatter Plots: Visualize the relationship between two numerical variables
    • Correlation Matrix: Compute and visualize the correlation between numerical features
    • Pair Plots: Visualize pairwise relationships in a dataset
    • Heatmaps: Visualize the correlation matrix
  • 5. Analyzing Categorical Data - Objective: Understand the distribution and relationship of categorical variables
    • Bar Plots: Visualize the frequency distribution of categorical features
    • Count Plots: Count the occurrences of each category in a categorical feature
    • Chi-Square Test: Test for independence between categorical features
  • 6. Feature Engineering and Transformation - Objective: Create new features or transform existing ones to better capture the information in the dataset
    • Log Transformation: Apply log transformation to reduce skewness
    • Binning: Convert continuous variables into categorical bins
    • Interaction Features: Create new features by combining existing ones (e.g., product or ratio of two features)
    • Encoding Categorical Variables: Convert categorical variables to numerical using one-hot encoding, label encoding, or target encoding
  • 7. Dimensionality Reduction - Objective: Reduce the number of features while preserving the important information
    • Principal Component Analysis (PCA): Reduce dimensionality by projecting data onto principal components
    • t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduce dimensionality for visualization, especially in high-dimensional space
  • 8. Summarizing and Documenting Findings - Objective: Document insights and findings from the EDA process
    • Insights Report: Create a report or presentation summarizing key insights, visualizations, and potential issues identified during EDA
    • Data Quality Issues: Document any data quality issues encountered and how they were addressed
    • Hypothesis Testing: Summarize results of hypothesis tests and their implications

    double_arrow Correlation and Causation

    Correlation is a measure of the extent of interdependence between variables, Causation is the relationship between cause and effect between two variables

    Correlation does not imply causation. Determining correlation is much simpler then determining causation as causation may require independent experimentation

    • Pearson Correlation - measures the linear dependence between two variables X and Y
      • Pearson Correlation is the default method of the function pandas.corr()
      • Correlation coefficient can only be calculated on the numerical attributes (floats and integers)
      • The resulting coefficient is a value between -1 and 1 inclusive, where:
        • 1: Perfect positive linear correlation
        • 0: No linear correlation, the two variables most likely do not affect each other
        • -1: Perfect negative linear correlation
    • P-value
      • Probability value that the correlation between these two variables is statistically significant
      • Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant
      • Smallest significance level at which the null hypothesis would be rejected
      • By convention, when the
        • p-value is < 0.001: we say there is strong evidence that the correlation is significant
        • the p-value is < 0.05: there is moderate evidence that the correlation is significant
        • the p-value is < 0.1: there is weak evidence that the correlation is significant
        • the p-value is > 0.1: there is no evidence that the correlation is significant
      
      corr_matrix = data_df.corr(numeric_only=True)
      corr_matrix['SalePrice'].sort_values(ascending=False)                                   # List of top features that have high correlation coefficient
      
      hous_num = data_df.select_dtypes(include = ['float64', 'int64'])                                            # Select only float and int data types
      hous_num_corr = hous_num.corr()['SalePrice'][:-1]                                                           # -1 means that the latest row is SalePrice
      top_features = hous_num_corr[abs(hous_num_corr) > 0.5].sort_values(ascending=False)                         # Displays pearsons correlation coefficient greater than 0.5
      print("There is {} strongly correlated values with SalePrice:\n{}".format(len(top_features), top_features))
      for i in range(0, len(hous_num.columns), 5):
          sns.pairplot(data=hous_num, x_vars=hous_num.columns[i:i+5], y_vars=['SalePrice'])
      
      sns.set_context('talk')
      sns.pairplot(data_df, hue='species');
                              

    double_arrow ANOVA - Analysis of Variance

    Statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

    • F-test score - ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means
    • P-value - tells how statistically significant our calculated score value is

    double_arrow Basics of Grouping

    • pandas.df.groupby - groups data by different categories
      • The data is grouped based on one or several variables, and analysis is performed on the individual groups
      • Most commonly, we use groupby to split the data into groups, this will apply some function to each of the groups (e.g. mean, median, min, max, count), then combine the results into a data structure
    • pandas.df.pivot - convert the dataframe to a pivot table
      • The grouped data is much easier to visualize when it is made into a pivot table
      • A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row
    • pandas.df.reset_index - resets the index
    • 
      # Group by single column
      grouped_by_product = data_df.groupby('product')['sales'].sum()
      
      # Group by multiple columns
      group_year = data_df.groupby(['Year'])['VALUE'].mean()                                                  # Calculate the mean of the prices per year
      group_month = data_df.groupby(['Month'])['VALUE'].max()                                                 # Group by the maximum value of prices, for each month
      group_city = data_df.groupby(['Year', 'City'])['VALUE'].median().reset_index(name ='Value').round(2)    # Group by the median value of prices, for each year and each city
      
      # Using multiple aggregation functions
      data_df.groupby('species').agg(['mean', 'median'])                      # Passing a list of recognized strings
      data_df.groupby('species').agg([np.mean, np.median])                    # Passing a list of explicit aggregation functions
      
      grouped_with_agg = data_df.groupby('product')['sales'].agg(['sum', 'mean', 'count'])
      
      # Using pivot
      data_df.pivot(index='col1', columns='col2', values='col3')
                              

    double_arrow Analyzing Individual Feature Patterns Using Visualization

      
      # A simple scatter plot of sepal_length vs sepal_width
      ax = plt.axes()
      ax.scatter(data_df.sepal_length, data_df.sepal_width)
      # Label the axes
      ax.set(xlabel='Sepal Length (cm)', ylabel='Sepal Width (cm)', title='Sepal Length vs Width');
      
      # A histogram of petal length
      ax = plt.axes()
      ax.hist(data_df.petal_length, bins=25);
      ax.set(xlabel='Petal Length (cm)', ylabel='Frequency', title='Distribution of Petal Lengths');
      ax = data_df.petal_length.plot.hist(bins=25)                                                    # Alternatively using Pandas plotting functionality
      
      sns.set_context('notebook') 
      ax = data_df.plot.hist(bins=25, alpha=0.5)                                                      # Single plot with histograms for each feature overlayed
      ax.set_xlabel('Size (cm)');
      
      # To create four separate plots, use Pandas `.hist` method
      axList = data_df.hist(bins=25)
      # Add some x- and y- labels to first column and last row
      for ax in axList.flatten():
          if ax.get_subplotspec().is_last_row():
              ax.set_xlabel('Size (cm)')
              
          if ax.get_subplotspec().is_first_col():
              ax.set_ylabel('Frequency')
      
      data_df.boxplot(by='species')                               # Boxplot of each petal and sepal measurement
      
      # Single boxplot where the features are separated in the x-axis and species are colored with different hues
      plot_data = (data_df
                   .set_index('species')
                   .stack()
                   .to_frame()
                   .reset_index()
                   .rename(columns={0:'size', 'level_1':'measurement'}))
      sns.set_style('white')
      sns.set_context('notebook')
      sns.set_palette('dark')
      f = plt.figure(figsize=(6,4))
      sns.boxplot(x='measurement', y='size', hue='species', data=plot_data);
                              

5. Feature Engineering

Vital step in the data science process that significantly impacts the performance and accuracy of machine learning models

By creating new features, transforming existing ones, selecting the most relevant features, and applying domain knowledge, data scientists can enhance the predictive power of their models

  • Selection: identify and select the most relevant features (variables) for the model
  • Creation: create new features from existing ones to improve model performance
  • Reduction: use techniques like PCA (Principal Component Analysis) to reduce dimensionality

Detailed breakdown of the Feature Engineering step:

  • 1. Understanding the Importance of Features - Objective: Recognize the role of features in influencing model performance
    • Predictive Power: Features are the input variables that models use to make predictions
    • Data Representation: The way features represent the underlying data can significantly affect the accuracy and robustness of the model
  • 2. Types of Feature Engineering - Objective: Identify different types of feature engineering techniques
    • Feature Creation - Objective: Create new features that capture additional information from the data
      • Mathematical Transformations: Apply mathematical operations to create new features (e.g., log, square root, power)
      • Aggregation: Summarize information across groups (e.g., mean, sum, count of transactions per customer)
      • Date and Time Features: Extract information from date and time (e.g., day of the week, month, hour)
    • Feature Transformation - Objective: Transform existing features to improve their representation
      • Normalization: Scale numerical features to a common range (e.g., 0 to 1)
      • Standardization: Center features around zero with a standard deviation of one
      • Encoding Categorical Variables: Convert categorical features into numerical format using techniques like one-hot encoding, label encoding, or target encoding
    • Feature Selection - Objective: Select the most relevant features for modeling
      • Filter Methods: Use statistical techniques to select features (e.g., chi-square test, correlation threshold)
      • Wrapper Methods: Use machine learning models to evaluate the importance of features (e.g., recursive feature elimination)
      • Embedded Methods: Feature selection integrated into model training (e.g., Lasso, decision tree feature importance)
  • 3. Domain Knowledge - Objective: Utilize domain expertise to create meaningful features
    • Industry-Specific Features: Create features based on industry knowledge (e.g., financial ratios in finance, health indicators in healthcare)
    • Business Logic: Incorporate business rules and logic into feature creation (e.g., customer segmentation based on purchasing behavior)
  • 4. Interaction Features - Objective: Create features that capture interactions between existing features
    • Polynomial Features: Create polynomial terms to capture non-linear relationships
    • Product and Ratio Features: Combine features through multiplication, division, etc.
  • 5. Handling Missing Values - Objective: Address missing data in features
    • Imputation: Fill missing values with mean, median, mode, or more advanced techniques
    • Indicator Variables: Create binary indicators for missing values
  • 6. Future Reduction - Objective: Reduce the number of features while preserving the important information
    • Principal Component Analysis (PCA): Reduce dimensionality by projecting data onto principal components
    • t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduce dimensionality for visualization, especially in high-dimensional space
  • 7. Feature Encoding - Objective: Encode categorical features to numerical values
    • One-Hot Encoding: Create binary columns for each category
    • Label Encoding: Convert categories to numerical labels
    • Target Encoding: Encode categories based on their target mean
  • 8. Feature Scaling - Objective: Scale features to ensure they contribute equally to the model
    • Normalization: Scale features to a range (e.g., 0 to 1)
    • Standardization: Scale features to have mean 0 and standard deviation 1
  • 9. Feature Selection - Objective: Select the most relevant features to improve model performance and reduce overfitting
    • Filter Methods: Select features based on statistical measures (e.g., correlation, chi-square)
    • Wrapper Methods: Use model performance to select features (e.g., recursive feature elimination)
    • Embedded Methods: Select features during model training (e.g., Lasso, decision trees)
  • 10. Validation and Testing - Objective: Validate the impact of engineered features on model performance
    • Cross-Validation: Evaluate the model using cross-validation techniques
    • Model Comparison: Compare the performance of models with and without engineered features

    double_arrow Feature Creation

    • Log Transformation
    • Aggregation
    • Date and time features
    • 
      # Log transformation
      log_transformed = np.log(data_df['SalePrice'])
      
      # Aggregation
      data_df['total_purchases'] = data_df.groupby('customer_id')['purchase_amount'].transform('sum')
      
      # Date and time features
      data_df['purchase_date'] = pd.to_datetime(data_df['purchase_date'])
      data_df['day_of_week'] = data_df['purchase_date'].dt.dayofweek
      data_df['month'] = data_df['purchase_date'].dt.month
                              

    double_arrow Feature Transformation

    • Normalization or Min-max scaling is the process of transforming values of several variables into a similar range
      • MinMaxScaler from scikit-learn transform features by scaling each feature to a given range
      • Values are shifted and rescaled so they end up ranging from 0 to 1
      • This is done by subtracting the min value and dividing by the max minus min
    • Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison
      • StandardScaler from scikit-learn standardize features by removing the mean and scaling to unit variance
      • First it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation, so that the resulting distribution has unit variance
    • Encoding Categorical Variables
      • One-hot encoding (Indicator variable) - create binary columns for each category
        • pandas.get_dummies convert categorical variable into dummy/indicator variables
        • They are called dummies because the numbers themselves don't have inherent meaning
        • We use indicator variables so we can use categorical variables for regression analysis
      • Label encoding - convert categories to numerical labels
      • Target encoding - encode categories based on their target mean
      • Binning is the process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis
      
      from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, LabelEncoder
      
      # Normalization
      scaler = MinMaxScaler()
      data_df['normalized_feature'] = scaler.fit_transform(data_df[['numerical_feature']])
      
      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
      X_scaled = X_std * (max - min) + min
      
      # Standardization
      scaler = StandardScaler()
      data_df['standardized_feature'] = scaler.fit_transform(data_df[['numerical_feature']])
      
      # z = (x - u) / s 
      
      # One-hot encoding
      data_df = pd.get_dummies(data=data_df, columns = ['Airline', 'Source', 'Destination'])
      
      # Label encoding
      label_encoder = LabelEncoder()
      data_df['encoded_feature'] = label_encoder.fit_transform(data_df['categorical_feature'])
      
      # Binning
      data_df["Arrival_Hour"]= pd.to_datetime(data_df['Arrival_Time']).dt.hour
      data_df['arr_timezone'] = pd.cut(data_df.Arrival_Hour, [0,6,12,18,24], labels=['Night','Morning','Afternoon','Evening'])
      
      data_df['binned_feature'] = pd.cut(data_df['numerical_feature'], bins=5, labels=False)
                              

    double_arrow Feature Selection

    • Filter Methods Use statistical techniques to select features (e.g., chi-square test, correlation threshold)
    • Wrapper Methods Use machine learning models to evaluate the importance of features (e.g., recursive feature elimination)
    • Embedded Methods Feature selection integrated into model training (e.g., Lasso, decision tree feature importance)
    • 
      from sklearn.feature_selection import SelectKBest, chi2, RFE
      from sklearn.linear_model import LogisticRegression
      
      # Filter method
      selector = SelectKBest(chi2, k=10)
      selected_features = selector.fit_transform(data_df.drop(columns=['target']), data_df['target'])
      
      # Wrapper method
      model = LogisticRegression()
      rfe = RFE(model, n_features_to_select=10)
      selected_features = rfe.fit_transform(data_df.drop(columns=['target']), data_df['target'])
                                  

6. Modeling

Where the core of data science work happens, involves selecting the right algorithms, training models, tuning hyperparameters, and evaluating performance to ensure the model meets the defined objectives

By carefully executing each sub-step and iterating based on feedback, data scientists can develop robust models that provide valuable insights and predictions

  • Algorithms: choose appropriate algorithms based on the problem (e.g., regression, classification, clustering)
  • Training: train the model using the prepared data
  • Validation: evaluate the model's performance using techniques like cross-validation

Detailed breakdown of the Modeling step:

  • 1. Selecting the Modeling Approach - Objective: Choose the appropriate type of model based on the problem and data
    • Types of Machine Learning Models
      • Supervised Learning: Models are trained on labeled data to make predictions or classifications
        • Regression: Predict continuous outcomes (e.g., Linear Regression, Ridge Regression)
        • Classification: Predict discrete classes (e.g., Logistic Regression, Decision Trees, Random Forests, SVMs)
      • Unsupervised Learning: Models are used on unlabeled data to find patterns or groupings
        • Clustering: Group similar data points (e.g., K-Means, Hierarchical Clustering)
        • Dimensionality Reduction: Reduce the number of features while preserving essential information (e.g., PCA, t-SNE)
        • Reinforcement Learning: Models learn by interacting with an environment to maximize cumulative rewards (e.g., Q-Learning, Deep Q-Networks)
  • 2. Splitting the Data - Objective: Divide the data into training and testing sets to evaluate model performance
    • Training Set: Used to train the model
    • Testing Set: Used to evaluate the model’s performance on unseen data
    • Validation Set: Sometimes used to tune hyperparameters and prevent overfitting (often through cross-validation)
  • 3. Training the Model - Objective: Fit the model to the training data
    • Fitting: Adjust the model parameters to minimize the error on the training data
    • Hyperparameters: Set parameters that are not learned from the data but are set before training (e.g., number of trees in Random Forest, learning rate in Gradient Boosting)
  • 4. Hyperparameter Tuning - Objective: Optimize model performance by finding the best hyperparameters
    • Grid Search: Exhaustively search over a specified parameter grid
    • Random Search: Randomly sample from a parameter space
    • Bayesian Optimization: Use probabilistic models to find optimal hyperparameters
    • Cross-Validation: Use cross-validation to evaluate different hyperparameter settings
  • 5. Model Evaluation - Objective: Assess the model’s performance using metrics and validate its generalization ability
    • Evaluation Metrics: Choose metrics based on the problem type (e.g., accuracy, precision, recall, F1-score for classification; mean squared error, R² for regression)
    • Confusion Matrix: For classification problems, show the performance across different classes
    • ROC Curve and AUC: Evaluate classification models’ ability to discriminate between classes
  • 6. Model Interpretation - Objective: Understand and interpret the model’s behavior and predictions
    • Feature Importance: Evaluate which features are most influential in the model
    • Coefficients: Examine the coefficients in linear models to understand feature impacts
    • SHAP and LIME: Use techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) to explain complex models
  • 7. Model Validation - Objective: Validate the model’s performance and robustness
    • Cross-Validation: Perform k-fold cross-validation to assess the model’s stability and performance across different data splits
    • Holdout Set: Use a separate holdout set if available to validate the model’s performance on truly unseen data
  • 8. Model Deployment - Objective: Deploy the model into a production environment for real-world use
    • Model Serialization: Save the trained model using serialization libraries like pickle or joblib
    • Integration: Integrate the model with existing systems or applications
    • Monitoring: Monitor the model’s performance over time and update it as necessary
  • 9. Iteration and Improvement - Objective: Refine and improve the model based on feedback and new data
    • Model Refinement: Adjust model parameters or try different algorithms to improve performance
    • Feature Engineering: Revisit feature engineering to enhance model input
    • Data Augmentation: Collect more data or use techniques to augment existing data

    double_arrow Selecting the Modeling Approach

      
      from sklearn.linear_model import LogisticRegression
      from sklearn.cluster import KMeans
      
      # Classification model
      model = LogisticRegression()
      
      # Clustering model
      kmeans = KMeans(n_clusters=3)
                              

    double_arrow Splitting the Data

      
      from sklearn.model_selection import train_test_split
      
      # Split the data
      X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target']), df['target'], test_size=0.2, random_state=42)
                              

    double_arrow Training the Model

      
      # Train the model
      model.fit(X_train, y_train)
                              

    double_arrow Hyperparameter Tuning

      
      from sklearn.model_selection import GridSearchCV
      
      # Define parameter grid
      param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'saga']}
      
      # Grid search
      grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
      grid_search.fit(X_train, y_train)
      
      # Best parameters
      print(grid_search.best_params_)
                              

    double_arrow Model Evaluation

      
      from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
      
      # Predict on test set
      y_pred = model.predict(X_test)
      
      # Evaluate performance
      accuracy = accuracy_score(y_test, y_pred)
      print(f'Accuracy: {accuracy}')
      
      # Classification report
      print(classification_report(y_test, y_pred))
      
      # Confusion matrix
      print(confusion_matrix(y_test, y_pred))
                              

    double_arrow Model Interpretation

      
      import shap
      
      # SHAP values
      explainer = shap.Explainer(model, X_train)
      shap_values = explainer(X_test)
      shap.summary_plot(shap_values, X_test)
                              

    double_arrow Model Validation

      
      from sklearn.model_selection import cross_val_score
      
      # Cross-validation
      scores = cross_val_score(model, df.drop(columns=['target']), df['target'], cv=5)
      print(f'Cross-Validation Scores: {scores}')
      print(f'Mean Score: {scores.mean()}')
                              

7. Evaluation

Essential for validating the performance and effectiveness of your machine learning models, proper evaluation helps in making informed decisions and improving model reliability and robustness

By using appropriate metrics, performing cross-validation, comparing different models, analyzing errors, and interpreting results, you ensure that the model not only performs well on training data but also generalizes effectively to new, unseen data

  • Metrics: use evaluation metrics such as accuracy, precision, recall, F1 score, RMSE (Root Mean Square Error), etc.
  • Comparison: compare different models and select the best-performing one
  • Interpretation: interpret the model results in the context of the business problem

Detailed breakdown of the Evaluation step:

  • 1. Define Evaluation Metrics - Objective: Select appropriate metrics based on the type of problem (e.g., classification, regression)
    • Classification Metrics:
      • Accuracy: The proportion of correctly classified instances out of all instances
      • Precision: The proportion of true positive predictions out of all positive predictions (true positives + false positives)
      • Recall: The proportion of true positive predictions out of all actual positive instances (true positives + false negatives)
      • F1 Score: The harmonic mean of precision and recall, balancing both metrics
      • ROC Curve: A graphical plot showing the true positive rate against the false positive rate
      • AUC (Area Under the Curve): The area under the ROC curve, indicating the model’s ability to discriminate between classes
    • Regression Metrics:
      • Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values
      • Mean Squared Error (MSE): The average of the squared differences between predicted and actual values
      • Root Mean Squared Error (RMSE): The square root of MSE, providing error magnitude in the same units as the target variable
      • R-squared (R²): The proportion of variance in the target variable that is predictable from the features
  • 2. Cross-Validation - Objective: Assess the model’s performance on different subsets of the data to ensure it generalizes well
    • K-Fold Cross-Validation: Split the data into k subsets (folds) and train/test the model k times, each time using a different fold as the test set and the remaining k-1 folds as the training set
    • Stratified K-Fold: Ensure that each fold has a representative proportion of each class (for classification problems)
  • 3. Model Comparison - Objective: Compare the performance of different models to select the best one
    • Benchmarking: Compare models against each other using the same evaluation metrics
    • Statistical Tests: Use statistical tests to determine if differences in model performance are significant
  • 4. Model Validation - Objective: Validate the model’s performance to ensure it meets the project’s goals and requirements
    • Holdout Validation: Use a separate holdout set (if available) to test the model’s performance on data it has never seen before
    • Real-World Testing: Test the model in a real-world setting or a production-like environment to ensure it performs as expected
  • 5. Error Analysis - Objective: Analyze model errors to understand where the model is making mistakes and how it can be improved
    • Residual Analysis: For regression, analyze residuals (differences between predicted and actual values) to check for patterns
    • Confusion Matrix: For classification, examine the confusion matrix to understand misclassifications
  • 6. Model Interpretation - Objective: Interpret and understand the model’s predictions and decision-making process
    • Feature Importance: Identify which features are most important for the model’s predictions
    • Partial Dependence Plots: Visualize the effect of individual features on the model’s predictions
    • SHAP and LIME: Use tools like SHAP and LIME to provide local explanations for individual predictions
  • 7. Document and Report Findings - Objective: Document the evaluation results, insights, and decisions
    • Evaluation Report: Create a comprehensive report summarizing the model’s performance, strengths, weaknesses, and recommendations
    • Visualizations: Include charts and plots to illustrate the model’s performance and error analysis

    double_arrow Define Evaluation Metrics

      
      # CLASSIFICATION
      # Model predictions
      from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
      y_pred = model.predict(X_test)
      y_prob = model.predict_proba(X_test)[:, 1]  # Probabilities for ROC
      
      # Accuracy
      accuracy = accuracy_score(y_test, y_pred)
      
      # Precision, Recall, F1 Score
      precision = precision_score(y_test, y_pred)
      recall = recall_score(y_test, y_pred)
      f1 = f1_score(y_test, y_pred)
      
      # ROC Curve
      fpr, tpr, _ = roc_curve(y_test, y_prob)
      roc_auc = auc(fpr, tpr)
      print(f'Accuracy: {accuracy}')
      print(f'Precision: {precision}')
      print(f'Recall: {recall}')
      print(f'F1 Score: {f1}')
      print(f'ROC AUC: {roc_auc}')
      
      # REGRESSION
      # Model predictions
      y_pred = model.predict(X_test)
      
      # MAE, MSE, RMSE
      from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
      mae = mean_absolute_error(y_test, y_pred)
      mse = mean_squared_error(y_test, y_pred)
      rmse = np.sqrt(mse)
      
      # R-squared
      r2 = r2_score(y_test, y_pred)
      print(f'MAE: {mae}')
      print(f'MSE: {mse}')
      print(f'RMSE: {rmse}')
      print(f'R-squared: {r2}')
                              

    double_arrow Cross-Validation

      
      # Cross-validation
      from sklearn.model_selection import cross_val_score
      scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # for classification
      print(f'Cross-Validation Scores: {scores}')
      print(f'Mean Score: {scores.mean()}')
                              

    double_arrow Model Comparison

      
      # Compare multiple models
      from sklearn.model_selection import cross_val_score
      model1 = LogisticRegression()
      model2 = RandomForestClassifier()
      scores1 = cross_val_score(model1, X, y, cv=5)
      scores2 = cross_val_score(model2, X, y, cv=5)
      print(f'Model 1 Scores: {scores1.mean()}')
      print(f'Model 2 Scores: {scores2.mean()}')
                              

    double_arrow Error Analysis

      
      # Confusion matrix
      from sklearn.metrics import confusion_matrix
      conf_matrix = confusion_matrix(y_test, y_pred)
      print(f'Confusion Matrix:\n{conf_matrix}')
                              

    double_arrow Model Interpretation

      
      # SHAP values
      import shap
      explainer = shap.Explainer(model, X_train)
      shap_values = explainer(X_test)
      shap.summary_plot(shap_values, X_test)
                              

8. Deployment

Ensures that the machine learning model is not only functional but also reliable and secure in a real-world environment

By carefully managing serialization, integration, monitoring, updating, documentation, and security, you can successfully deploy a model that delivers valuable insights and predictions to end-users

  • Implementation: deploy the model in a production environment
  • Integration: integrate the model with existing systems and workflows
  • Monitoring: continuously monitor the model’s performance and update as needed

Detailed breakdown of the Deployment step:

  • 1. Model Serialization - Objective: Save the trained model in a format that can be easily loaded and used in production
    • Pickle: A Python library used to serialize and deserialize Python objects
    • Joblib: A library optimized for serializing large numpy arrays
  • 2. Model Integration - Objective: Integrate the model into the application or system where it will be used
    • Batch Processing - Objective: Apply the model to a batch of data periodically
      • Scripts: Write scripts to load the model and process new data batches
      • Scheduled Jobs: Use cron jobs or task schedulers to automate batch processing
    • Real-Time Processing - Objective: Use the model to make predictions in real-time
      • APIs: Deploy the model behind a web API using frameworks like Flask, FastAPI, or Django
      • Microservices: Use containerization tools like Docker and orchestration platforms like Kubernetes
  • 3. Model Monitoring - Objective: Continuously monitor the model’s performance and health in production
    • Performance Monitoring - Objective: Track key performance metrics to detect degradation
      • Accuracy Tracking: Monitor metrics like accuracy, precision, recall, etc.
      • Latency Tracking: Measure the time taken for the model to make predictions
    • Data Drift Detection - Objective: Detect changes in the input data distribution that may affect model performance
      • Statistical Tests: Use statistical methods to compare the current data distribution with the training data distribution
      • Monitoring Tools: Utilize tools and platforms like Evidently, DataRobot, or custom scripts
  • 4. Model Updating and Retraining - Objective: Update and retrain the model to maintain or improve performance
    • Scheduled Retraining: Retrain the model at regular intervals using new data
    • Triggered Retraining: Retrain the model when performance metrics indicate degradation
  • 5. Documentation and Reporting - Objective: Document the deployment process, including setup, configurations, and usage
    • Deployment Guide: Create a guide detailing the steps to deploy and run the model
    • API Documentation: Provide clear documentation for any APIs, including endpoints, request/response formats, and examples
    • Monitoring Reports: Regularly report on model performance, issues, and updates
  • 6. Security and Compliance - Objective: Ensure the deployed model adheres to security and compliance standards
    • Data Privacy: Implement measures to protect sensitive data
    • Access Control: Restrict access to the model and data to authorized users
    • Compliance: Ensure the deployment complies with relevant regulations (e.g., GDPR, HIPAA)

    double_arrow Model Serialization

      
      import pickle
      import joblib
      
      # Using pickle
      with open('model.pkl', 'wb') as f:
          pickle.dump(model, f)
      
      # Using joblib
      joblib.dump(model, 'model.joblib')
      
      # Loading the model
      with open('model.pkl', 'rb') as f:
          loaded_model = pickle.load(f)
                                  

    double_arrow Model Integration

      
      # Script to load model and process data
      def process_batch(data_batch):
          model = joblib.load('model.joblib')
          predictions = model.predict(data_batch)
          return predictions
      # Example of a cron job to run the script daily
      # 0 0 * * * /usr/bin/python /path/to/your/script.py
      
      from flask import Flask, request, jsonify
      import joblib
      
      app = Flask(__name__)
      model = joblib.load('model.joblib')
      
      @app.route('/predict', methods=['POST'])
      def predict():
          data = request.get_json(force=True)
          prediction = model.predict([data['features']])
          return jsonify({'prediction': prediction.tolist()})
      
      if __name__ == '__main__':
          app.run(debug=True)
                                  

    double_arrow Model Monitoring

      
      from scipy.stats import ks_2samp
      
      # Function to detect data drift
      def detect_drift(new_data, reference_data):
          p_value = ks_2samp(new_data, reference_data).pvalue
          return p_value < 0.05  # If p-value is less than 0.05, data drift is detected
      
      # Monitor data drift
      is_drifted = detect_drift(new_data['feature'], reference_data['feature'])
      print(f'Data Drift Detected: {is_drifted}')
                                  

    double_arrow Model Updating and Retraining

      
      def retrain_model(new_data, new_labels):
          # Load existing model
          model = joblib.load('model.joblib')
          
          # Retrain model with new data
          model.fit(new_data, new_labels)
          
          # Save the retrained model
          joblib.dump(model, 'model.joblib')
      
      # Example of retraining the model
      new_data = ...  # Load new data
      new_labels = ...  # Load new labels
      retrain_model(new_data, new_labels)
                                  

    double_arrow Security and Compliance

      
      from flask import Flask, request, jsonify
      import joblib
      import jwt  # For authentication tokens
      
      app = Flask(__name__)
      model = joblib.load('model.joblib')
      
      def authenticate(token):
          # Implement token authentication logic
          try:
              payload = jwt.decode(token, 'your-secret-key', algorithms=['HS256'])
              return payload['user'] == 'authorized_user'
          except jwt.ExpiredSignatureError:
              return False
          except jwt.InvalidTokenError:
              return False
      
      @app.route('/predict', methods=['POST'])
      def predict():
          token = request.headers.get('Authorization')
          if not authenticate(token):
              return jsonify({'error': 'Unauthorized'}), 401
          
          data = request.get_json(force=True)
          prediction = model.predict([data['features']])
          return jsonify({'prediction': prediction.tolist()})
      
      if __name__ == '__main__':
          app.run(debug=True)
                                  

9. Communication

Ensures that the results of the data science process are effectively conveyed to stakeholders in a clear, concise, and actionable manner

By tailoring the message to the audience, using appropriate formats and visualizations, and gathering feedback, data scientists can ensure that their work has a meaningful impact on decision-making and business outcomes

  • Reporting: create reports and visualizations to communicate findings to stakeholders
  • Presentation: present insights and recommendations in a clear and actionable manner
  • Detailed breakdown of the Communication step:

    • 1. Identify the Audience - Objective: Tailor the communication to the knowledge level, interests, and needs of different stakeholders
      • Executives: Focus on high-level insights, business impact, and strategic recommendations
      • Technical Teams: Provide detailed methodology, technical findings, and implications for system integration
      • End Users: Highlight practical benefits and usability of the results
    • 2. Structure the Message - Objective: Organize the communication in a clear and logical manner
      • Executive Summary - Objective: Summarize the key findings, implications, and recommendations
        • Key Insights: Highlight the most critical findings
        • Business Impact: Explain how the findings affect the business
        • Recommendations: Provide actionable suggestions based on the analysis
      • Detailed Analysis - Objective: Provide a thorough explanation of the data, methodology, and results
        • Data Overview: Describe the data sources, volume, and characteristics
        • Methodology: Explain the analytical techniques and models used
        • Results: Present detailed findings with supporting evidence
      • Visualizations - Objective: Use visual aids to enhance understanding and engagement
        • Graphs and Charts: Use bar charts, line graphs, scatter plots, etc., to illustrate key points
        • Tables: Present detailed numerical data in a clear and organized manner
        • Dashboards: Create interactive dashboards for real-time exploration of the results
    • 3. Deliver the Message - Objective: Choose the appropriate format and medium for communication
      • Reports - Objective: Provide a comprehensive document detailing the entire analysis process and findings
        • Written Report: Include sections for executive summary, detailed analysis, visualizations, and appendices
        • Technical Documentation: Provide detailed technical explanations and code snippets for reproducibility
      • Presentations - Objective: Summarize and present key findings in a concise and engaging format
        • Slides: Use presentation software (e.g., PowerPoint, Keynote) to create slides
        • Storytelling: Build a narrative around the findings to maintain interest and relevance
      • Interactive Dashboards - Objective: Allow stakeholders to explore the data and findings interactively
        • Tools: Use tools like Tableau, Power BI, or web-based applications (e.g., Dash, Streamlit) to create dashboards
        • Customization: Tailor dashboards to show relevant metrics and allow filtering and drilling down into the data
    • 4. Feedback and Iteration - Objective: Gather feedback from stakeholders to refine and improve the communication and the model itself
      • Feedback Sessions: Hold meetings or workshops to present findings and gather feedback
      • Surveys: Use surveys to collect structured feedback on the clarity and usefulness of the communication
      • Iteration: Revise the analysis, models, and communication materials based on the feedback received
    • 5. Documentation and Archiving - Objective: Ensure that all materials are properly documented and archived for future reference
      • Version Control: Use version control systems (e.g., Git) to track changes in code and documentation
      • Documentation: Maintain comprehensive documentation of the analysis process, code, and findings
      • Archiving: Store reports, presentations, and datasets in a centralized and accessible location

      double_arrow

        
        
                                    

10. Maintenance

Ensures that the machine learning model remains effective and relevant over time

By continuously monitoring performance, retraining as necessary, versioning models, ensuring compliance, and keeping stakeholders informed, you can maintain the reliability and accuracy of the model in a dynamic environment

Proper maintenance helps in sustaining the value provided by the model and adapting to changes in data and business needs

  • Updating: regularly update the model with new data to maintain accuracy
  • Monitoring: monitor for any changes in data patterns and adjust the model accordingly

Detailed breakdown of the Maintenance step:

  • 1. Continuous Monitoring - Objective: Continuously track the model’s performance and health in production to detect issues early
    • Performance Metrics: - Objective: Monitor key metrics that indicate the model's performance
      • Accuracy Metrics: Track metrics like accuracy, precision, recall, F1 score, etc., for classification models
      • Error Metrics: Monitor metrics like MAE, MSE, RMSE, etc., for regression models
      • Custom Metrics: Use business-specific metrics that align with the goals of the model
    • Data Quality: Objective: Ensure the incoming data is of high quality and consistent with the training data
      • Data Integrity: Check for missing values, outliers, and anomalies
      • Data Distribution: Monitor the distribution of incoming data to detect shifts or drifts
  • 2. Retraining and Updating - Objective: Regularly update the model to incorporate new data and improve performance
    • Scheduled Retraining - Objective: Retrain the model at regular intervals using new data
      • Frequency: Determine an appropriate retraining schedule (e.g., weekly, monthly, quarterly)
      • Automation: Automate the retraining process using scripts and scheduling tools
    • Triggered Retraining - Objective: Retrain the model when performance metrics fall below a certain threshold
      • Performance Thresholds: Set thresholds for key metrics that trigger retraining
      • Monitoring: Continuously monitor performance metrics and trigger retraining when necessary
    • 3. Model Versioning - Objective: Keep track of different versions of the model to manage updates and rollbacks
      • Version Control: Use version control systems to track changes in the model, code, and data
      • Metadata: Maintain metadata about each version, including training data, hyperparameters, performance metrics, and deployment dates
    • 4. Model Governance - Objective: Ensure the model adheres to regulatory and compliance requirements
      • Compliance - emObjective: Maintain compliance with relevant regulations (e.g., GDPR, HIPAA)
        • Data Privacy: Ensure that data used for training and inference complies with data privacy regulations
        • Audit Trails: Maintain logs of model training, updates, and predictions for audit purposes
      • Ethical Considerations - Objective: Ensure the model is fair and does not introduce bias
        • Bias Detection: Implement techniques to detect and mitigate bias in the model
        • Fairness Metrics: Monitor metrics like demographic parity, equal opportunity, and disparate impact
    • 5. Communication and Documentation - Objective: Keep stakeholders informed about the model’s performance, updates, and any issues
      • Regular Reports: Provide periodic reports on model performance, issues detected, and updates made
      • Documentation Updates: Continuously update documentation to reflect changes in the model, data, and processes

    double_arrow