AI - Artificial Intelligence

Artificial Intelligence Basics

Simulation of human intelligence in machines that are programmed to think, learn, and make decisions autonomously

These systems are designed to mimic human cognitive functions like problem-solving, language understanding, learning from experience, and pattern recognition

Branch of computer science dealing with the simulation of intelligent behavior in computers

History of Artificial Intelligence

  • 1940s-1950s: The Birth of AI Concepts
    • 1943: McCulloch & Pitts Model: Warren McCulloch and Walter Pitts created a mathematical model for neural networks, laying the foundation for AI concepts
    • 1950: Turing Test: Alan Turing proposed the Turing Test, a criterion to determine if a machine can exhibit intelligent behavior indistinguishable from that of a human
    • 1956: Dartmouth Conference: John McCarthy coined the term "Artificial Intelligence," marking the formal beginning of AI as a field of study
    • 1957: Frank Rosenblatt builds the Mark 1 Perceptron, the first computer based on a neural network that 'learned' though trial and error
    • 1959: Arthur Samuel published an algoritm for a checkers program using machine learning
  • 1960s-1970s: The Rise of Early AI Systems
    • 1961: First Industrial Robot: Unimate, the first industrial robot, was introduced, automating assembly line tasks at General Motors
    • 1966: ELIZA: Joseph Weizenbaum developed ELIZA, one of the first chatbots that could mimic human conversation, albeit in a limited way
    • 1969: Shakey the Robot: Shakey was developed by SRI International, capable of planning, reasoning, and problem-solving in simple environments
  • 1980s: AI Winter and Expert Systems
    • 1980s: Expert Systems Boom: Systems like XCON by Digital Equipment Corporation were used commercially, using rules-based logic to simulate expert-level decision-making in specific fields
    • 1980s: Neural networks which use a backpropagation algorithm to train itself become widely used in AI applications
    • Late 1980s: AI Winter: Hype and unrealistic expectations led to reduced funding and interest in AI research due to limited progress and the failure of early systems
  • 1990s: Machine Learning and Data-Driven AI
    • 1997: Deep Blue Defeats Kasparov: IBM’s Deep Blue defeated world chess champion Garry Kasparov, showcasing AI's potential in complex problem-solving
    • 1990s: Introduction of Machine Learning: The focus shifted to machine learning, leveraging statistical techniques and large datasets to improve AI capabilities
  • 2000s: Rise of Data and Neural Networks
    • 2006: Deep Learning Renaissance: Geoffrey Hinton popularized deep learning, leading to significant advancements in neural networks and AI capabilities
    • 2009: The ImageNet database of human-tagged images is presented at the CVPR conference
  • 2010s: AI in Everyday Applications
    • 2011: IBM Watson: Watson won the quiz show "Jeopardy!" against human champions, demonstrating the power of natural language processing
    • 2012: ImageNet Competition: AlexNet, a deep convolutional neural network, drastically improved image recognition accuracy, marking a breakthrough in computer vision
    • 2014: Generative Adversarial Networks (GANs): Ian Goodfellow introduced GANs, which allowed AI to generate new, realistic data, such as images and music
    • 2016: AlphaGo Defeats Lee Sedol: Google DeepMind’s AlphaGo defeated world champion Go player Lee Sedol, showcasing AI’s strategic thinking capabilities
    • 2018: Waymo launches commercial self-driving car service in suburbs of Phoenix
    • 2019: IMB project Debater is able to have a full debate with rebuttal with champion human debater
  • 2020s: AI in the Real World
    • 2020: GPT-3 by OpenAI: A significant leap in natural language processing, GPT-3 demonstrated the ability to generate human-like text, improving chatbots, translation, and content creation
    • 2022: DALL-E and Stable Diffusion: AI models like DALL-E and Stable Diffusion allowed for the creation of detailed images from text prompts, revolutionizing digital art and design

Types of Artificial Intelligence

  • Based on Functionalities:
    • Purely Reactive: These AI machines do not have any memory or data to work with, specializing in just one field of work (chess game) playing chess
    • Limited Memory: These AI machines collect previous data and continue adding it to their memory. They have enough memory or experience to make proper decisions, but memory is minimal (suggesting a restaurant in the neighborhood)
      • Reinforcement learning, which learns to make better predictions through repeated trial-and-error
      • Long Short Term Memory (LSTM), which utilizes past data to help predict the next item in a sequence. LTSMs view more recent information as most important when making predictions and discounts data from further in the past, though still utilizing it to form conclusions
      • Evolutionary Generative Adversarial Networks (E-GAN), which evolves over time, growing to explore slightly modified paths based off of previous experiences with every new decision. This model is constantly in pursuit of a better path and utilizes simulations and statistics, or chance, to predict outcomes throughout its evolutionary mutation cycle
    • Theory of Mind: This AI is predicted to work in the future to understand emotions and thoughts to interact socially
    • Self-Aware: Another projected AI machine with self-awareness capabilities to make conscious decisions
  • Based on Capabilities:
    • Weak (narrow) AI (ANI) - AI systems that are designed to perform specific tasks and are limited to those tasks only. These AI systems excel at their designated functions but lack general intelligence. Weak AI operates within predefined boundaries and cannot generalize beyond their specialized domain (voice assistants like Siri or Alexa, recommendation algorithms, image recognition systems, self-driving car)
    • Strong (general) AI (AGI) - AI systems that possess human-level intelligence or even surpass human intelligence across a wide range of tasks. Strong AI would be capable of understanding, reasoning, learning, and applying knowledge to solve complex problems in a manner similar to human cognition
    • Artificial Superintelligence (ASI): Hypothetical AI surpassing human intelligence in all aspects, potentially capable of solving complex problems and making advancements beyond human comprehension
  • Based on Technologies:
    • Machine Learning (ML) - AI systems capable of self-improvement through experience, without direct programming; concentrate on creating software that can independently learn by accessing and utilizing data
    • Deep Learning - A subset of ML involving many layers of neural networks; used for learning from large amounts of data and is the technology behind voice control in consumer devices, image recognition, and many other applications
    • Natural Language Processing (NLP) - Enables machines to understand and interpret human language; used in chatbots, translation services, and sentiment analysis applications
    • Robotics - Designing, constructing, operating, and using robots and computer systems for controlling them, sensory feedback, and information processing
    • Computer Vision - Allows machines to interpret the world visually, and it's used in various applications such as medical image analysis, surveillance, and manufacturing
    • Expert Systems - Answer questions and solve problems in a specific domain of expertise using rule-based systems

Applications of Artificial Intelligence

  • Natural Language Processing (NLP) - speech recognition, machine translation, sentiment analysis, and virtual assistants like Siri and Alexa
  • Image and Video Analysis - facial recognition, object detection and tracking, content moderation, medical imaging, and autonomous vehicles
  • Robotics and Automation - manufacturing, healthcare, logistics, and exploration
  • Recommendation Systems - e-commerce, streaming platforms, and social media to personalize user experiences
  • Financial Services - fraud detection, algorithmic trading, credit scoring, and risk assessment
  • Healthcare - disease diagnosis, medical imaging analysis, drug discovery, personalized medicine, and patient monitoring
  • Virtual Assistants and Chatbots - customer support, information retrieval, and personalized assistance
  • Gaming - creating realistic virtual characters, opponent behavior, and intelligent decision-making
  • Smart Homes and IoT - smart home systems that can automate tasks, control devices, and learn from user preferences
  • Cybersecurity - detecting and preventing cyber threats by analyzing network traffic, identifying anomalies, and predicting potential attacks
ML - Machine Learning

ML Basics

Study of programs that are not explicitly programmed, but instead these algorithms learn patterns from data

Application of AI that provides systems the ability to learn on their own and improve from experiences without being programmed externally

  • Factors that have contributed to the current state of Machine Learning are:
    • bigger data sets
    • faster computers
    • open source packages
    • wide range of neural network architectures
  • Machine Learning Workflow:
    • Problem definition - Define the problem, goals, and the expected outcome. What are we trying to predict or classify, what is the metric for success?
    • Data collection - Identify and gather relevant data from sources like databases, APIs, third-party sources, or web scraping. Machines initially learn from the data that you give them, it is of the utmost importance to collect reliable data so that your machine learning model can find the correct patterns. The quality of the data that you feed to the machine will determine how accurate your model is.
    • Data preprocessing and exploration
      • Putting together all the data you have and randomizing it
      • Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values, data type conversion
      • Transformation - convert data into a suitable format. This may involve normalization, scaling, encoding categorical variables, and creating new features
      • Feature Engineering = create new features or select the most relevant features for the model. Techniques may include binning, polynomial features, or aggregations
      • Understand data patterns, distributions, and relationships
      • Visualize the data to understand how it is structured and understand the relationship between various variables and classes present
      • Statistical Analysis - calculate basic statistics (mean, median, standard deviation) and investigate correlations or outliers
      • Insights - identify patterns that could impact model performance, such as class imbalance, outliers, or multicollinearity
    • Modeling - A machine learning model determines the output you get after running a machine learning algorithm on the collected data. It is important to choose a model which is relevant to the task at hand. Training is the most important step in machine learning. In training, you pass the prepared data to your machine learning model to find patterns and make predictions.
      • Choose a suitable algorithm based on the problem type, test several algorithms on a subset of the data to see which ones might yield the best performance
      • Splitting Data - divide data into training (the set your model learns from), validation, and test sets (used to check the accuracy of your model after training) to avoid overfitting and ensure generalization
      • Hyperparameter Tuning - adjust parameters to optimize model performance, often using techniques like grid search or random search
      • Training - feed the training set into the model and adjust parameters to minimize error. Various training methods like batch, mini-batch, or online learning may be used
    • Evaluation
      • Validation - Testing the performance of the model on previously unseen data. The unseen data used is the testing set that you split our data into earlier
      • Metrics Selection - Use appropriate evaluation metrics based on the problem type (e.g., accuracy, precision, recall, F1-score for classification; RMSE, MAE for regression)
      • Error Analysis - Identify where the model is underperforming and analyze misclassified instances or high-error predictions
    • Decision Making and Deployment
  • Common taxonomy:
    • target - category or value you are trying to predict
    • features - explanatory variables used for prediction
    • example - an observation or single data point within the data
    • label - the value of the target for a single data point

Types of Machine Learning

  • Supervised learning - used to predict data. Method in which a model learns from a labeled dataset containing input-output pairs. Each input in the dataset has a corresponding correct output (the label), and the model's task is to learn the relationship between the inputs and outputs. This enables the model to make predictions on new, unseen data by applying the learned mapping.
    • Categories:
      • Regression - The model anticipates a continuous value or quantity
      • Classification - The model predicts a discrete class label or category
    • Algorithms:
      • Linear Regression: used for predicting a continuous target variable, models the relationship between input features and the output by fitting a linear equation
      • Logistic Regression: used for binary classification tasks, estimates probabilities using the logistic function and applies a threshold for classification
      • K-Nearest Neighbors (KNN): classifies data points based on their proximity to neighboring points, often used for both classification and regression
      • Support Vector Machines (SVM): finds the hyperplane that maximizes the margin between different classes, effective for classification, especially in high-dimensional spaces
      • Decision Trees: tree-like model where each node represents a feature, and the branches represent decision rules, works for both classification and regression
      • Random Forests: ensemble of decision trees, random forests reduce overfitting by averaging multiple decision trees
      • Gradient Boosting Machines (GBMs): ensemble technique that builds trees sequentially, where each tree corrects the errors of the previous one, popular variations include XGBoost and LightGBM
    • Applications:
      • Healthcare: Used to predict patient diagnoses based on symptoms and past medical history
      • Finance: For credit scoring and predicting stock prices
      • Retail: To forecast sales, recommend products, and personalize marketing
      • Autonomous Vehicles: These are used to recognize traffic signs and pedestrians
      • Speech Recognition: In virtual assistants and transcription services
    • Advantages:
      • Effectiveness: Supervised learning can predict outcomes based on past data
      • Simplicity: It's relatively easy to understand and implement
      • Performance Evaluation: It is easy to measure the performance of a supervised learning model since the ground truth (labels) is known
      • Applications: Can be used in various fields like finance, healthcare, marketing, etc
      • Feature Importance: It allows an understanding of which features are most important in making predictions
    • Disadvantages:
      • Dependency on Labeled Data: Supervised learning requires a large amount of labeled data, which can be expensive and time-consuming
      • Overfitting: Models can become too complex and fit the noise in the training data rather than the actual signal, which degrades their performance on new data
      • Generalization: Sometimes, these models do not generalize well to unseen data if the data they were trained on does not represent the broader context
  • Unsupervised learning - used to find hidden patterns or structures in data. The training data is unknown and unlabeled - meaning that no one has looked at the data before. This data is fed to the Machine Learning algorithm and is used to train the model. The trained model tries to search for a pattern and give the desired response.
    • Categories:
      • Clustering - Data is grouped into subsets (clusters) such that data in each cluster are more similar than those in others
      • Association - Discovering rules that capture interesting relationships between variables in large databases (e.g., market basket analysis)
      • Dimensionality Reduction - Reducing the number of random variables under consideration (e.g., PCA, t-SNE), which helps to simplify the data without losing important information
    • Algorithms:
      • K-Means Clustering: Groups data into 𝐾 clusters by minimizing the variance within each cluster
      • Hierarchical Clustering: builds a hierarchy of clusters either by agglomerative (bottom-up) or divisive (top-down) approaches
      • Principal Component Analysis (PCA): reduces the dimensionality of data by transforming it into a set of uncorrelated variables called principal components, useful for data compression and visualization
      • Association Rule Learning: identifies relationships or associations between variables in large datasets, commonly used in market basket analysis, algorithms include Apriori and Eclat
      • Gaussian Mixture Models (GMM): a probabilistic approach to clustering, where data points are assumed to belong to a mixture of Gaussian distributions
    • Applications:
      • Customer Segmentation: Businesses use clustering to segment customers based on behaviors and preferences for targeted marketing
      • Anomaly Detection: Identifying unusual data points can be critical in fraud detection or network security
      • Recommendation Systems: Associative models help build recommendation systems that suggest products based on user behavior
      • Feature Elicitation: Used in preprocessing steps to extract new features from raw data which can improve the accuracy of predictive models
      • Image Segmentation: Applied in computer vision to divide an image into meaningful segments and analyze each segment individually
    • Advantages:
      • Discovering Hidden Patterns: It can identify patterns and relationships in data that are not initially evident
      • No Need for Labelled Data: Works with unlabeled data, making it useful where obtaining labels is expensive or impractical
      • Reduction of Complexity in Data: Helps reduce the dimensionality of data, making complex data more comprehensible
      • Feature Discovery: This can be used to find useful features that can improve the performance of supervised learning algorithms
      • Flexibility: Can handle changes in input data or the environment since it doesn’t rely on predefined labels
    • Disadvantages:
      • Interpretation of Results: The results can be ambiguous and harder to interpret than those from supervised learning models
      • Dependency on Input Data: The output quality heavily depends on the quality of the input data
      • Lack of Precise Objectives: Without specific tasks like prediction or classification, the direction of learning is less focused, leading to less actionable insights
  • Reinforcement learning - learns from its mistakes and experiences. The algorithm discovers data through a process of trial and error and then decides what action results in higher rewards. Three major components make up reinforcement learning: the agent is the learner or decision-maker, the environment includes everything that the agent interacts with, and the actions are what the agent does.
    • Categories:
      • Model-based RL - The agent builds a model of the environment and uses it to predict future rewards and states. This allows the agent to plan by considering potential future situations before taking action
      • Model-free RL - The agent learns to act without explicitly constructing a model of the environment. It directly learns the value of actions or action policies from experience, using methods like Q-learning or policy gradients
      • Partially Observable RL - The agent doesn't have access to the full state of the environment. The agent must learn to make decisions based on incomplete information, often using strategies that involve maintaining internal state estimates
    • Algorithms:
      • Q-Learning: a value-based method where the agent learns a policy that tells it the best action to take in each state, aiming to maximize cumulative reward
      • Deep Q Networks (DQN): combines Q-Learning with deep neural networks, enabling it to handle more complex, high-dimensional state spaces
      • Policy Gradient Methods: optimizes the agent's policy directly by maximizing expected rewards, algorithms include REINFORCE and Proximal Policy Optimization (PPO)
    • Applications:
      • Autonomous Vehicles: RL is used to develop autonomous driving systems, helping vehicles learn to navigate complex traffic environments safely
      • Robotics: RL enables robots to learn complex tasks like walking, picking up and manipulating objects, and interacting with humans and other robots in a dynamic environment
      • Gaming: In the gaming industry, RL is used to develop AI that can challenge human players, adapt to their strategies, and provide engaging gameplay
      • Finance: RL can be applied to trading and investment strategies where the algorithm learns to make buying and selling decisions to maximize financial return
      • Healthcare: RL algorithms are being explored for various applications in healthcare, including personalized treatment recommendation systems and management of healthcare logistics
    • Advantages:
      • Adaptability: RL agents can adapt to new environments or changes within their environment, making them suitable for dynamic and uncertain situations
      • Decision-Making Autonomy: RL agents make decisions based on learned experiences rather than pre-defined rules, which can be advantageous in complex environments where manual behavior specification is impractical
      • Continuous Learning: Since the learning process is continuous, RL agents can improve their performance over time as they gain more experience
      • Handling Complexity: RL can handle problems with high complexity and numerous possible states and actions, which might be infeasible for traditional algorithms
      • Optimization: RL is geared towards optimization of the decision-making process, aiming to find the best sequence of actions for any given situation
    • Disadvantages:
      • Dependency on Reward Design: The effectiveness of an RL agent is heavily dependent on the design of the reward system. Poorly designed rewards can lead to unwanted behaviors
      • High Computational Cost: Training RL models often requires significant computational resources and time, especially as the complexity of the environment increases
      • Sample Inefficiency: RL algorithms typically require many interactions with the environment to learn effective policies, which can be impractical in real-world scenarios where each interaction could be costly or time-consuming

Assessing Performance

Error Types

  • Training error
    • Error that a machine learning model makes on the same dataset it was trained on
    • Measures how well the model has fit the training data by calculating the discrepancy between the predicted values and the actual values in this dataset
    • Training error is overly optimistic
    • Small training error ≠> good predictions unless training data includes everything you might ever see
    • A model with a very low training error may be overfitting, meaning it has "memorized" the training data rather than learning a general pattern
    • Overfitting often leads to low training error but high test error (the error on unseen data), indicating poor generalization.
    • Training error should ideally be lower than test error, as the model is optimized directly on the training data
    • High bias (underfitting) will result in high training error, whereas high variance (overfitting) can produce low training error but high test error
  • Generalization error
    • Measure of how accurately a machine learning model predicts outcomes for new, unseen data
    • Reflects the model's ability to generalize the patterns it learned during training and apply them to data that wasn't part of the training set
    • Lower generalization error indicates that a model has learned well from its training data and can make reliable predictions on new data
    • Often measured using the model’s performance on a test or validation dataset, which the model hasn’t seen during training
    • Arises from three main sources: bias (error due to overly simplistic assumptions in the learning algorithm), variance (error due to excessive sensitivity to small fluctuations in the training set) and irreducible error (natural noise in the data that cannot be reduced by any model)
    • High generalization error is often a result of overfitting, where the model performs well on the training data but poorly on the test data, indicating it has learned noise rather than true patterns
    • Underfitting, where both training and test errors are high, also leads to a high generalization error since the model fails to capture the patterns in the data
    • Techniques to minimize generalization error include cross-validation, regularization, using simpler models, and gathering more training data
  • Test error
    • Error that a machine learning model makes on a test dataset, which is a set of data that the model has never seen during training
    • Estimate of the model’s ability to generalize to new, unseen data and provides a measure of how well the model will likely perform in real-world applications
    • Low test error indicates that the model has learned the underlying patterns well without overfitting or underfitting the training data
    • Significant difference between training and test errors indicates potential overfitting (if test error is much higher) or underfitting (if both errors are high)
    • Test error should be close to the training error to show good generalization
    • Low test error means the model generalizes well, while a high test error may imply poor generalization and the need for model adjustments
    • Methods to reduce test error and improve generalization include cross-validation, regularization, and tuning model complexity (such as adjusting polynomial degrees or pruning decision trees)

Error Sources

  • Noise
    • Refers to random, unpredictable variations or errors in data that do not represent the true underlying patterns or signals the model aims to capture
    • Can arise from many sources, including measurement errors, data entry mistakes, external environmental factors, or natural randomness
    • Noise can cause a model to learn irrelevant details in the training data (overfitting), reducing its ability to generalize to new data
    • Choosing models that are too complex can cause them to be overly sensitive to noise, while simpler models may ignore noise but also miss capturing genuine patterns
    • Measurement Noise: errors in data collection, such as imprecise sensors or human errors in data entry
    • Environmental Noise: external, uncontrollable factors that influence the data, like fluctuating market conditions affecting sales data
    • Irreducible Noise: the portion of noise that cannot be eliminated, even with an ideal model, often due to inherent randomness in the phenomenon being studied
  • Bias
    • Refers to the error introduced by approximating a real-world problem, which may have complex relationships, with a simpler model
    • Measures how far off the predictions are from the actual values on average
    • Model with high bias makes strong assumptions about the data and is too simple to capture the underlying pattern accurately
    • High Bias → Underfitting: the model is too simplistic, leading to poor performance both on the training data and on new (test) data, fails to capture the true relationship
  • Variance
    • Measures how sensitive the model is to fluctuations in the training data
    • Average squared difference of each model you train relative to the average prediction of each model
    • High variance models adapt too closely to the training data and may capture noise along with the true pattern
    • Model with high variance is typically too complex and overly tuned to the training data
    • High Variance → Overfitting: the model performs well on the training data but fails to generalize to new data, resulting in poor test performance
  • Bias-variance tradeoff refers to the tradeoff between two sources of error that affect the performance of a model: bias and variance
  • \[ Total Error = Bias^2 + Variance + Irreducible Error\]

    • Bias\[^2\] : error from incorrect model assumptions

    • Variance : error from sensitivity to small fluctuations in the training data

    • Irreducible Error : random noise in the data that cannot be eliminated

    • The key challenge is finding the optimal balance between bias and variance to minimize the total error
    • If the model is too simple (high bias), it will miss relevant patterns in the data (underfit)
    • If the model is too complex (high variance), it will fit the training data too well, capturing noise and performing poorly on unseen data (overfit)
    • Strategies to Address Bias-Variance Tradeoff
      • Cross-Validation: Use techniques like k-fold cross-validation to assess model performance on different subsets of the data
      • Regularization (Ridge/Lasso): Introduce penalty terms in the objective function to reduce model complexity (variance) while retaining essential patterns
      • Feature Engineering: Carefully select features to prevent overfitting and reduce bias
      • Ensemble Models: Use models like bagging or boosting (e.g., random forests, gradient boosting) to reduce variance
      • Early Stopping: In iterative models like neural networks, stop training when performance on a validation set starts to degrade

ML techniques

  • Training and Test Splits
    • Splitting your data into a training and a test set can help you choose a model that has better chances at generalizing and is not overfitted
    • The training data is used to fit the model, while the test data is used to measure error and performance
    • Training error tends to decrease with a more complex model
    • 
      from sklearn.model_selection import train_test_split
      
      # Define the feature(s) and target variable
      y_col = "price" # define y column
      X = data_df.drop(y_col, axis=1) # drop y column from features data
      y = data_df.y_col
      
      # split the data into training and testing sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      print("number of test samples :", X_test.shape[0])
      print("number of training samples:",X_train.shape[0])
                              

      model_selection.train_test_split - split arrays or matrices into random train and test subsets

      • X, y: the allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas data frames
      • test_size: if float, it should be between 0.0 and 1.0 and represents the proportion of the dataset to include in the test split. If int (integer), it represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25
      • random_state: controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls
  • Transforming Target
    • Making the target variable normally distributed often will lead to better results
      • Using a visual approach - by plotting a histogram
      • Using a statistical test - this test outputs a p-value, the higher this p-value is, the closer the distribution is to normal
    • Linear Regression assumes a normally distributed residuals which can be aided by transforming y variable which is the target variable
      • Log Transformation - can transform data that is significantly skewed right to be more normally distributed
      • Square root Transformation
      • Box cox Transformation - parametrized transformation that tries to get distributions "as close to a normal distribution as possible"
      • \[ \text{boxcox}(y_i) = \frac{y_i^{\lambda} - 1}{\lambda}\]

      scipy.stats.boxcox - return a dataset transformed by a Box-Cox power transformation

    
    import matplotlib.pyplot as plt
    from scipy.stats.mstats import normaltest # D'Agostino K^2 Test
    normaltest(data_df.target.values)
    
    log_target = np.log(data_df.target)
    log_target.hist();
    
    sqrt_target = np.sqrt(data_df.target)
    sqrt_target.hist();
    
    from scipy.stats import boxcox
    bc_result = boxcox(data_df.target)
    boxcox_target = bc_result[0]
    lam = bc_result[1]
    plt.hist(boxcox_target);
                            

Regression (Supervised Learning)

Statistical method used to model the relationship between a dependent (target or outcome) variable and one or more independent (predictors or features) variables

Predicts the continuous output variables based on the independent input variable

Goal is to understand how changes in the independent variables are associated with changes in the dependent variable and to make predictions about the dependent variable based on known values of the independent variables

  • Application:
    • Sales forecasting
    • Satisfaction analysis
    • Price estimation
    • Employment income
  • Regression models:
    • Linear Regression
    • Polynomial Regression
    • Ridge Regression (L2 Regularization)
    • Lasso Regression (L1 Regularization)
    • Elastic Net Regression
    • Support Vector Regression (SVR)
    • Decision Tree Regression
    • Random Forest Regression
    • Gradient Boosting Regression
    • XGBoost Regression
    • K-Nearest Neighbors (KNN) Regression
    • Bayesian Regression
    • Principal Component Regression (PCR)
    • Partial Least Squares Regression (PLSR)
    • Quantile Regression

Assessing Performance

  • Cross-Validation
  • Statistical method used in machine learning and data science to assess the performance of a model by splitting the data into multiple subsets (folds) and training/testing the model on different combinations of these subsets.

    The purpose is to provide a more accurate and robust estimate of the model’s performance on unseen data, helping to prevent overfitting and improve generalization

    • K-Fold Cross-Validation
      • dataset is randomly split into k equally sized folds or subsets
      • model is trained on k−1 folds and tested on the remaining fold, rotating through all folds so that each fold serves as a test set once
      • final performance score is the average of the scores from each fold, providing a reliable estimate of model accuracy
    • Leave-One-Out Cross-Validation (LOOCV)
      • k equals the number of data points
      • model is trained on all data points except one, and this process is repeated for each data point in the dataset
      • provides an unbiased performance estimate but is computationally intensive for large datasets
    • Stratified Cross-Validation
      • useful for imbalanced datasets, as it maintains the same proportion of classes (for classification problems) in each fold as in the full dataset, ensuring consistent class distribution across training and validation sets
    • Nested Cross-Validation
      • used when hyperparameter tuning is involved. It has two levels of cross-validation: the outer loop estimates model performance, while the inner loop is used for hyperparameter tuning, ensuring an unbiased evaluation of tuned models
    
    from sklearn.model_selection import cross_val_score, KFold
    from sklearn.linear_model import LinearRegression
    from sklearn.datasets import make_regression
    
    # Sample data
    X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
    
    # Model
    model = LinearRegression()
    
    # KFold Cross-Validation
    kfold = KFold(n_splits=5)
    results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
    
    # Average MSE across folds
    average_mse = -results.mean()
    print("Average MSE:", average_mse)
                        

model_selection.KFold K-Fold cross-validator creates number of k-fold splits, allowing cross validation

  • n_splits - number of folds, must be at least 2

model_selection.cross_val_score evaluates model's score through cross validation

  • estimator - the object to use to fit the data
  • X, y - the data to fit, the target variable to try to predict in the case of supervised learning
  • cv - determines the cross-validation splitting strategy (None, int, CV splitter)
  • scoring - a str or a scorer callable object / function with signature scorer (estimator, X, y) which should return only a single value

model_selection.cross_val_predict produces the out-of-bag prediction for each row

model_selection.GridSearchCV scans over parameters to select the best hyperparameter set with the best out-of-sample score

  • Regularization
  • Technique used to prevent overfitting by penalizing high-valued coefficients (reduces parameters and shrinks the model)

    Specially useful in complex models like high-degree polynomial regression, deep neural networks, and decision trees, where the risk of overfitting is high

    Modifies the cost function used to train the model by adding a penalty term, typically multiplied by a hyperparameter λ (regularization strength)

    Regularization techniques have an analytical, a geometric, and a probabilistic interpretation

    • Ridge (L2 Regularization)
    • \[ \text{L2 penalty} = \lambda \sum_{i=1}^{n} w_i^2 \]

      • penalizes the size magnitude of the regression coefficients by adding a squad term
      • complexity penalty λ is applied proportionally to squared coefficient values
      • enforces the coefficients to be lower, but not 0
      • minimizes irrelevant features and does not remove them
      • faster to train
      • this imposes bias on the model, but also reduces variance
      • we can select the best regularization strength lambda via cross-validation
      • it’s a best practice to scale features (i.e. using StandardScaler) so penalties aren’t impacted by variable scale
    • LASSO (L1 Regularization) - Least Absolute Shrinkage and Selection Operator
    • \[ \text{L1 penalty} = \lambda \sum_{i=1}^{n} |w_i| \]

      • penalizes the absolute value of the coefficients
      • complexity penalty λ is proportional to the absolute value of coefficients
      • sets irrelevant features to 0
      • finds features you don't need
      • LASSO is more likely than Ridge to perform feature selection, for a fixed λ LASSO is more likely to result in coefficients being set to zero
      • LASSO’s feature selection property yields an interpretability advantage, but may underperform if the target truly depends on many of the features
    • Elastic Net (L1+L2 Regularization)
    • \[ \text{Elastic Net penalty} = \lambda_1 \sum_{i=1}^{n} |w_i| + \lambda_2 \sum_{i=1}^{n} w_i^2 \]

      • combines both L1 and L2 regularization, balancing between feature selection (L1) and feature shrinkage (L2)
      • penalizes the size magnitude of the regression and absolute value of the coefficients
      • sets irrelevant features to 0 and enforces the coefficients to be lower
      • introduces a new parameter α (alpha) that determines a weighted average of L1 and L2 penalties
    • Dropout Regularization (for Neural Networks)
      • temporarily “drops out” random neurons during each training step by setting their outputs to zero with a specified probability
      • prevents any single neuron from becoming too important, encouraging the network to learn redundant representations and reducing overfitting
    • Early Stopping
      • training is stopped as soon as the model’s performance on a validation set begins to deteriorate (prevents the model from learning noise in the training data)

    Evaluation

    • Mean Absolute Error (MAE) - Computes the average absolute differences between the predicted and actual values. It provides a measure of the average magnitude of errors. Treats all errors equally by taking their absolute values.
    • \[ MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right| \]

      • \[n: \text{Total number of observations}\]

      • \[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]

      • \[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]

      • \[\left| y_i - \hat{y}_i \right|: \text{Absolute error for the}\] \[i^{th} \text{ prediction}\]

      • Properties:
        • Can take any non-negative value
        • A perfect model will have an MAE of 0, meaning no error between actual and predicted values
        • Gives equal importance to small and large errors, making it less sensitive to outliers
        • Retains the same unit as the dependent variable, providing a direct interpretation of the average error
      • Advantages/Disadvantages:
        • A: Robust to Outliers - it doesn’t square the errors
        • A: Simple to Understand and Calculate
        • D: Not Differentiable at Zero
        • D: Insensitive to Large Errors - may not penalize large errors
      • When to use:
        • When Outliers are Present - If the dataset contains outliers or noisy data
        • When an easily interpretable error metric is needed in the same unit as the target variable
        • Often used in time series forecasting to evaluate how close predictions are to the observed values
      • How it works
        • Actual (y) Predicted (y_hat) Error (y-y_hat) Absolute Error
          3.0 2.5 0.5 0.5
          5.0 4.5 0.5 0.5
          2.0 2.5 -0.5 0.5
          7.0 6.0 1.0 1.0
        • Calculate Errors: For each data point, find the difference between the actual value and the predicted value
        • Take the Absolute Value: Ignore whether the error is positive or negative by taking the absolute value of each error
        • Average the Absolute Errors: Sum all the absolute errors and divide by the number of observations
        • 0.5 + 0.5 + 0.5 + 1.0 = 2.5 / 4 = 0.625 - on average, the model’s predictions deviate from the actual values by 0.625 units
    • Mean Squared Error (MSE) - Squares the differences between predicted and actual values before averaging, emphasizing larger errors. It captures how well a model’s predictions align with the true outcomes, with more weight given to larger errors due to squaring.
    • \[ MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

      • \[n: \text{Total number of observations}\]

      • \[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]

      • \[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]

      • \[\left( y_i - \hat{y}_i \right)^2: \text{Squared error for the}\] \[i^{th} \text{ prediction}\]

      • Properties:
        • Always non-negative
        • A perfect model has an MSE of 0, indicating no difference between actual and predicted values
        • Penalizes large deviations more than small ones, making it sensitive to outliers
        • Expressed in the square of the unit of the target variable
      • Advantages/Disadvantages:
        • A: Popular in Machine Learning - used extensively in model evaluation and optimization
        • A: Mathematically Convenient - differentiable, which makes it suitable for gradient-based optimization techniques to minimize error during training
        • D: Sensitive to Outliers - large errors have a significant impact on MSE, making it less robust to datasets with noisy data or outliers
        • D: Squared units of MSE can make it harder to interpret
      • When to use:
        • Training Machine Learning Models - common loss function for many algorithms
        • Evaluating Models - provides insight into the magnitude of the average squared error
        • When Large Errors are Critical
      • How it works
        • Actual (y) Predicted (y_hat) Error (y-y_hat) Squared Error
          3.0 2.5 0.5 0.25
          5.0 4.5 0.5 0.25
          2.0 2.5 -0.5 0.25
          7.0 6.0 1.0 1.00
        • Calculate Errors: Subtract the predicted value from the actual value for each observation
        • Square the Errors: Squaring ensures that all errors are positive and emphasizes larger errors more than smaller ones
        • Average the Squared Errors: Sum the squared errors and divide by the total number of observations
        • 0.25 + 0.25 + 0.25 + 1.00 = 1.75 / 4 = 0.4375
      • Example in Python
      • 
        from sklearn.metrics import mean_squared_error
        import numpy as np
        
        # Example usage
        y_true = np.array([3.0, -0.5, 2.0, 7.0])
        y_pred = np.array([2.5, 0.0, 2.1, 7.8])
        
        # Calculate MSE with sklearn
        mse = mean_squared_error(y_test, y_pred)
        print("Mean Squared Error:", mse)
        
        # Define function for MSE
        def mse(y_true, y_pred):
            return np.mean((y_true - y_pred) ** 2)
        
        # Calculate MSE
        error = mse(y_true, y_pred)
        print("Mean Squared Error:", error)
                                

        metrics.mean_squared_error(y_true, y_pred) - mean squared error regression loss

        • y_true: ground truth (correct) target values
        • y_pred: estimated target values
    • Root Mean Squared Error (RMSE) - The square root of Mean Squared Error (MSE) shares the same unit as the target variable, enhancing its interpretability
    • \[ RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 } \]

      • \[n: \text{Total number of observations}\]

      • \[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]

      • \[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]

      • Properties:
        • Can take any non-negative value
        • A perfect model would have an RMSE of 0, meaning there is no difference between actual and predicted values
        • Gives higher weight to large errors because of the squaring step, this makes it sensitive to outliers
        • Retains the same unit of measurement as the dependent variable
      • Advantages/Disadvantages:
        • A: Easy Interpretation - same units as the target variable
        • A: Effective for Penalizing Large Errors - particularly useful when large errors are undesirable and need to be minimized
        • D: Sensitive to Outliers - single large error can disproportionately increase the RMSE
        • D: Not Robust - if the data contains noise or outliers, RMSE may not be the best performance measure
        • D: Comparison Limitation - it is difficult to compare RMSE across different datasets with varying scales
      • When to use:
        • Regression Models
        • Forecasting - helps in understanding how much the forecast deviates from the observed values
        • Model Selection - often used to compare models, the one with the lowest RMSE is considered better
      • How it works
        • Actual (y) Predicted (y_hat) Error (y-y_hat) Squared Error
          3.0 2.5 0.5 0.25
          5.0 4.5 0.5 0.25
          2.0 2.5 -0.5 0.25
          7.0 6.0 1.0 1.00
        • Calculate Errors: For each data point, find the difference between the actual value and the predicted value
        • Square the Errors: Ensures that negative errors don’t cancel out positive ones and gives more weight to larger errors
        • Average the Squared Errors: Compute the mean of these squared errors
        • Take the Square Root: The square root brings the result back to the original unit of measurement
        • 0.25 + 0.25 + 0.25 + 1.00 = 1.75 / 4 = 0.4375 ^ 0.5 = 0.661
      • Example in Python
      • 
        import numpy as np
        from sklearn.metrics import mean_squared_error
        
        # Example usage
        y_true = np.array([3.0, -0.5, 2.0, 7.0])
        y_pred = np.array([2.5, 0.0, 2.1, 7.8])
        
        # Calculate RMSE with sklearn
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        print("Root Mean Squared Error:", rmse)
        
        # Define function for RMSE
        def rmse(y_true, y_pred):
            return np.sqrt(np.mean((y_true - y_pred) ** 2))
        
        # Calculate RMSE
        error = rmse(y_true, y_pred)
        print("Root Mean Squared Error:", error)
                                
    • R-squared (R2) - Quantifies the predictability of the dependent variable based on the independent variables, ranging from 0 to 1, with 1 denoting flawless predictions. The coefficient of determination represents statistical measure that indicates the proportion of variance in the dependent (target) variable that is explained by the independent (predictor) variables in a regression model.
    • \[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

      • \(SS_{res} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\) : Residual Sum of Squares (the sum of squared errors between actual and predicted values).

      • \(SS_{tot} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2\) : Total Sum of Squares (the total variance of the actual values from their mean).

      • \[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]

      • \[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]

      • \[\bar{y}: \text{Mean of the actual values}\]

      • Interpretation:
        • R² = 1: model explains all the variance in the target variable (perfect fit)
        • R² = 0: model explains none of the variance, meaning predictions are no better than simply using the mean of the actual values
        • R² < 0: indicates that the model performs worse than a simple mean-based model. It can happen if the predictions are far from the actual values
      • Advantages/Disadvantages:
        • A: Easy Interpretation - provides a clear percentage of how much variation in the target variable is explained by the model
        • A: Model Comparison - helps compare multiple models, model with a higher R² is generally better at explaining variance in the target variable
        • A: Widely Used - tandard measure in regression analysis and is useful for determining model quality
        • D: Does Not Measure Predictive Accuracy - high R² does not guarantee that the model will make accurate predictions for new data
        • D: Sensitive to the Number of Predictors - adding more predictors can artificially inflate R²
        • D: Only Works with Linear Relationships - may be misleading if the relationship between variables is non-linear
      • How it works
        • Actual (y) Predicted (y_hat) Mean of Actual ()
          3.0 2.5 4.25
          5.0 4.5 4.25
          2.0 2.5 4.25
          7.0 6.0 4.25
        • Calculate Total Variance: measures how much the actual values vary from the mean. It reflects the overall variability in the data
        • Calculate Residual Variance: measures how far the predictions are from the actual values (prediction errors)
        • Compute R²: compares the proportion of unexplained variance (residual variance) to the total variance. If the residual variance is small compared to the total variance, R² will be close to 1, indicating that the model fits the data well.
        • SStot: (3.0 - 4.25) ^ 2 + (5.0 - 4.25) ^ 2 + (2.0 - 4.25) ^ 2 + (7.0 - 4.25) ^ 2 = 17.0
        • SSres: (3.0 - 2.5) ^ 2 + (5.0 - 4.5) ^ 2 + (2.0 - 2.5) ^ 2 + (7.0 - 6.0) ^ 2 = 2.75
        • R²: 1 - (2.75 / 17.0) = 0.838 - meaning the model explains 83.8% of the variance in the target variable
      • Example in Python
      • 
        import numpy as np
        from sklearn.metrics import r2_score
        
        # Example true and predicted values
        y_true = [3.0, -0.5, 2.0, 7.0]
        y_pred = [2.5, 0.0, 2.1, 7.8]
        
        # Calculate R-squared
        r_squared = r2_score(y_true, y_pred)
        print("R-squared:", r_squared)
        
        # Convert to numpy arrays
        y_true = np.array(y_true)
        y_pred = np.array(y_pred)
        
        # Calculate the mean of the true values
        y_mean = np.mean(y_true)
        
        # Calculate the total sum of squares (TSS) and residual sum of squares (RSS)
        ss_total = np.sum((y_true - y_mean) ** 2)
        ss_residual = np.sum((y_true - y_pred) ** 2)
        
        # Calculate R-squared
        r_squared = 1 - (ss_residual / ss_total)
        print("R-squared:", r_squared)
                                

        metrics.r2_score(y_true, y_pred) - R2 (coefficient of determination) regression score function (best possible score is 1.0)

        • y_true: ground truth (correct) target values
        • y_pred: estimated target values
    • Adjusted R-squared - An adjustment of R-squared that penalizes the addition of unnecessary predictors, providing a better measure of model complexity
    • \[ R^2_{adj} = 1 - \left( \frac{SS_{res} / (n - k - 1)}{SS_{tot} / (n - 1)} \right) \]

      • \(SS_{res} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\) : Residual Sum of Squares.

      • \(SS_{tot} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2\) : Total Sum of Squares.

      • \[n: \text{Total number of observations}\]

      • \[k: \text{Total number of predictors}\]

      • Interpretation:
        • Adjusted R² will be smaller than R² if additional predictors do not significantly improve the model
      • Example in Python
      • 
        import numpy as np
        from sklearn.metrics import r2_score
        
        # Example true and predicted values
        y_true = [3.0, -0.5, 2.0, 7.0]
        y_pred = [2.5, 0.0, 2.1, 7.8]
        
        # Calculate R-squared
        r_squared = r2_score(y_true, y_pred)
        print("R-squared:", r_squared)
        
        # Number of observations and predictors
        n = len(y_true)  # number of data points
        p = 1  # number of predictors (set to 1 for simplicity, adjust for more features)
        
        # Calculate Adjusted R-squared
        r_squared_adjusted = 1 - ((1 - r_squared) * (n - 1) / (n - p - 1))
        print("Adjusted R-squared:", r_squared_adjusted)
                                

    Linear Regression (simple / multiple)

    Model the relationship between a dependent (response/target) continuous variable and one or more independent (predictors/features) variables by fitting a linear equation to observed data

    Simple Linear Regression involves a single independent variable (relationship between this variable and the target is represented as a straight line)

    Multiple Linear Regression extends simple linear regression by including multiple predictors

    Used when relationship between the dependent and independent variables is linear, dataset is small to medium-sized and does not have too many complex features, outliers are minimal and the assumptions of linearity and homoscedasticity are reasonable

    \( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon \)

    • \[y\] : dependent (response/target) variable we aim to predict (changes when there is any change in the values of the independent variables)

    • \[x_i\] : independent (predictors/features) variables (does not change based on the effects of other variables)

    • \[\beta_0\] : y-intercept (the value of y when all \[x_i=0\])

    • \[\beta_i\] : coefficients corresponding to each independent variable (change (increase/decrease) in y for a unit increase in \[x_i)\]

    • \[\epsilon\] : error term (variability in y that cannot be explained by the linear relationship with x)

    • Fitting a Linear Regression model (cost function???)
    • Ordinary Least Squares (OLS) minimizes the sum of squared residuals (errors). The sum of squared differences between observed values and predicted values

      \( \text{Minimize } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \)

      • \[y_i\] : actual value

      • \[\hat{y}_i\] : predicted value

    • Assumptions
      • Linearity: Relationship between predictors and the target variable is linear
      • Independence: Observations are independent of each other
      • Homoscedasticity: Constant variance of errors across values of the independent variables
      • Normality of Errors: Residuals (differences between observed and predicted values) are normally distributed
    • Applications
      • Finance: Predicting stock prices or return on investment based on market indicators
      • Healthcare: Estimating health outcomes based on risk factors
      • Economics: Modeling economic indicators like GDP growth or unemployment rates
      • Marketing: Predicting sales based on advertising expenditure or customer demographics
    • Limitations
      • Assumes Linearity: Poor performance if the relationship is non-linear
      • Sensitive to Outliers: Outliers can disproportionately affect the model
      • Assumes Homoscedasticity: If the variance of errors varies, the model’s predictions are likely biased
      • Multicollinearity: High correlation between independent variables can lead to unreliable estimates
    • Example in Python
    • 
      
      import numpy as np
      import pandas as pd
      
      from sklearn.linear_model import LinearRegression
      from sklearn.model_selection import train_test_split
      from sklearn.metrics import mean_squared_error, r2_score
      
      df = pd.DataFrame(data)
      
      # Define predictor and response variable
      X = df[['Advertising']]
      y = df['Sales']
      
      # Split into train and test sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      
      # Initialize and fit the model
      model = LinearRegression()
      model.fit(X_train, y_train)
      
      # Predictions
      y_pred = model.predict(X_test)
      
      # Evaluation
      mse = mean_squared_error(y_test, y_pred)
      r2 = r2_score(y_test, y_pred)
      
      print("Mean Squared Error:", mse)
      print("R-squared:", r2)
                          

      linear_model.LinearRegression - ordinary least squares Linear Regression

      • coef_ - estimated coefficients for the linear regression problem
      • intercept_ - independent term in the linear model
      • fit(X, y) - fit linear model, X: training data; y: target values
      • predict(X) - predict using the linear model
      • score(X, y) - return the coefficient of determination R2 of the prediction

    Polynomial Regression (simple / multiple)

    Type of regression analysis in which the relationship between the independent variable (predictor) and the dependent variable (response) is modeled as an n-th degree polynomial

    Polynomial regression is capable of capturing non-linear relationships by fitting a curved line to the data

    As the degree n increases, the curve becomes more flexible, allowing the model to capture more complex patterns in the data

    Used when relationship between variables is non-linear, but still smooth (can be captured by a polynomial curve)

    \( y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \epsilon \)

    • \[y\] : dependent (response/target) variable

    • \[x\] : independent (predictor/feature) variable

    • \[\beta_i\] : coefficients of the polynomial

    • \[n\] : degree of the polynomial (quadratic for n=2, cubic for n=3)

    • \[\epsilon\] : error term

    • Fitting a Polynomial Regression model (cost function)
    • Ordinary Least Squares (OLS) minimizes the sum of squared residuals (errors). The sum of squared differences between observed values and predicted values

      \( \text{Minimize } \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_n x_i^n \right) \right)^2 \)

      • \[y_i\] : actual value

      • \[\hat{y}_i\] : predicted value

    • Steps
      • Choose the Degree of the Polynomial: The degree n determines the complexity of the curve
      • Transform the Independent Variable: The original variable x is expanded to include higher powers (x, x^2, x^3...)
      • Fit the Model Using Linear Regression: Treat the transformed terms as new features and apply ordinary least squares (OLS) regression
      • Evaluate and Refine: Check the model’s performance, possibly adjusting the polynomial degree to balance bias and variance
    • Choosing the Degree
      • Underfitting: When the degree is too low, the model might not capture the underlying trend in the data, leading to high bias
      • Overfitting: When the degree is too high, the model might capture noise or random fluctuations in the data rather than the underlying pattern, leading to high variance
      • Optimal Degree: Cross-validation or evaluating metrics on test data can help determine the optimal polynomial degree, balancing bias and variance (bias-variance tradeoff)
    • Assumptions
      • Linearity of Parameters: Although the model itself is non-linear, it is linear in terms of the parameters (coefficients)
      • Independence of Errors: Observations are independent of each other
      • Homoscedasticity: Constant variance of errors across values of the independent variables
      • Normality of Errors: Residuals (differences between observed and predicted values) are normally distributed
    • Applications
      • Economics: Modeling complex relationships between economic indicators
      • Environmental Science: Modeling growth patterns, such as population growth over time
      • Engineering: Estimating stress-strain relationships in material science
      • Marketing: Understanding the impact of advertising spending on sales, where effects are non-linear
    • Advantages
      • Captures Non-Linear Relationships: By introducing polynomial terms, the model can represent curved patterns in the data
      • Flexibility: Polynomial regression is more flexible than simple linear regression, making it suitable for data with a non-linear trend
      • Easy to Implement: Can be implemented using simple linear regression tools, with the only modification being the transformation of the input variables
    • Limitations
      • Overfitting: High-degree polynomials can overfit the training data, leading to poor generalization to new data
      • Sensitive to Outliers: Polynomial regression is particularly sensitive to outliers, which can skew the polynomial curve
      • Extrapolation Challenges: Predictions outside the range of the data can be highly unreliable, as polynomial regression is often unstable in these regions
      • Complexity Increases Quickly: As the polynomial degree increases, the model complexity increases, making it harder to interpret
    • Example in Python
    • 
      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt
      
      from sklearn.linear_model import LinearRegression
      from sklearn.preprocessing import PolynomialFeatures
      from sklearn.metrics import mean_squared_error
      
      # Sample data
      x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(-1, 1)
      y = np.array([1, 4, 9, 16, 25, 36, 49, 64, 81])
      
      # Transform the data to include polynomial terms (e.g., x, x^2)
      poly = PolynomialFeatures(degree=2)  # Change degree as needed
      x_poly = poly.fit_transform(x)
      
      # Fit the polynomial regression model
      model = LinearRegression()
      model.fit(x_poly, y)
      
      # Predict values
      y_pred = model.predict(x_poly)
      
      # Evaluation
      mse = mean_squared_error(y, y_pred)
      print("Mean Squared Error:", mse)
      
      # Plotting the results
      plt.scatter(x, y, color='blue', label='Actual data')
      plt.plot(x, y_pred, color='red', label='Polynomial regression fit')
      plt.xlabel('X')
      plt.ylabel('Y')
      plt.legend()
      plt.show()
                          

      preprocessing.PolynomialFeatures - generate polynomial and interaction features

      • degree - int specifies the maximal degree of the polynomial features, or tuple for (min_degree, max_degree)
      • fit_transform - fit to data, then transform it (X: input samples; y: target values)

    Ridge Regression (L2 Regularization)

    Linear regression technique that includes a regularization term to mitigate overfitting, especially when working with multicollinear data

    Penalty is added to the linear regression objective function based on the squared magnitude of the coefficients, effectively “shrinking” coefficients towards zero but never allowing them to be exactly zero

    Used for scenarios with multicollinearity among features, high-dimensional data where overfitting is a concern, situations where interpretability is less critical than predictive performance

    \[ \text{Cost} = \text{MSE} + \lambda \sum_{j=1}^{p} w_j^2 \]

    • \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] : the Mean Squared Error

    • \[\lambda\] : regularization parameter that controls the strength of the penalty

    • \[w_j\] : coefficients (weights) of the model

    • As 𝜆 increases, the influence of the regularization term also increases, forcing the values of the coefficients 𝑤𝑗 to shrink, but not to exactly zero

    • How Ridge Regression Works
      • reduces the model’s variance by shrinking the coefficient estimates, making the model less sensitive to small fluctuations in the data
      • increasing 𝜆 introduces more bias, the goal is to find a balance that minimizes overall prediction error
      • when 𝜆 = 0 Ridge Regression is equivalent to ordinary least squares regression since no penalty is applied
      • as 𝜆 approaches infinity the model shrinks all coefficients towards zero, resulting in a model that only predicts the mean of the target variable
      • the optimal value for 𝜆 is determined through cross-validation
      • in datasets where features are correlated (multicollinear), Ridge Regression reduces the variance caused by multicollinearity by penalizing large coefficients
      • Ridge does not perform feature selection, so all features remain in the model with adjusted coefficients
    • Applications
      • Credit Risk Scoring: Predicting the probability of loan default with many features (e.g., customer demographics, financial history)
      • Marketing Campaign Response: Estimating customer response to an email campaign when some features (e.g., past purchases) are correlated
      • Biological Research: Identifying significant genes for a disease from thousands of genetic features (Lasso helps with feature selection)
    • Advantages
      • Reduces Overfitting: The L2 penalty stabilizes the model by controlling high-variance coefficients
      • Works with High-Dimensional Data: Performs well when there are more features than observations
      • Closed-Form Solution: For simple cases, Ridge Regression has a closed-form solution, making it computationally efficient
    • Disadvantages
      • Does Not Perform Feature Selection: Unlike Lasso Regression, Ridge does not eliminate features, which can lead to less interpretable models
      • Sensitivity to Feature Scaling: Like many distance-based techniques, Ridge Regression is sensitive to the scale of features, so standardization or normalization is typically required
    • Example in Python
    • 
      from sklearn.linear_model import Ridge
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import StandardScaler
      from sklearn.metrics import mean_squared_error
      import numpy as np
      
      # Generate or load data
      X = np.random.rand(100, 3)  # example features
      y = X @ np.array([3, 5, 2]) + np.random.normal(0, 1, 100)  # example target with added noise
      
      # Split data into training and test sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
      # Scale the features (important for regularized models)
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      
      # Fit the Ridge Regression model
      ridge = Ridge(alpha=1.0)  # alpha is the regularization parameter (lambda)
      ridge.fit(X_train_scaled, y_train)
      
      # Predict and evaluate the model
      y_pred = ridge.predict(X_test_scaled)
      mse = mean_squared_error(y_test, y_pred)
      print("Mean Squared Error:", mse)
      print("Model Coefficients:", ridge.coef_)
                          

      linear_model.Ridge - linear least squares with l2 regularization

      • alpha - constant that multiplies the L2 term, controlling regularization strength
      • coef_ - weight vector(s)
      • intercept_ - independent term in decision function
      • fit(X, y) - fit Ridge regression model, X: training data; y: target values
      • predict(X) - predict using the linear model
      • score(X, y) - return the coefficient of determination R2 of the prediction

    Lasso Regression (L1 Regularization) - Least Absolute Shrinkage and Selection Operator

    Regression technique that introduces a penalty equal to the absolute value of the magnitude of coefficients

    Uses an L1 penalty (the sum of absolute values of coefficients) that results in feature selection as it drives some coefficients to zero, effectively removing less important features from the model

    Used when feature selection is required (high-dimensional datasets with many irrelevant features), sparse solutions are desirable (only a few significant features should have non-zero coefficients), data overfitting is a concern

    \[ \text{Cost} = \text{MSE} + \lambda \sum_{j=1}^{p} |w_j| \]

    • \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] : the Mean Squared Error

    • \[\lambda\] : regularization parameter that controls the strength of the penalty

    • \[wj\] : coefficients (weights) of the model

    • As 𝜆 increases the penalty term grows, shrinking coefficients toward zero, when is sufficiently strong some coefficients may be forced to zero, resulting in automatic feature selection

    • How lasso Regression Works
      • can reduce some coefficients to exactly zero that is particularly helpful in sparse datasets where only a subset of features influences the target variable
      • By tuning 𝜆 you can control the number of features retained, balancing interpretability with predictive performance
      • increasing 𝜆 introduces more bias, as it forces the model to generalize by shrinking coefficients
      • Lowering 𝜆 allows more variance in the coefficients, but this can lead to overfitting if there is high noise in the data
      • If the data has highly correlated features, Lasso may arbitrarily select only one of them, ignoring the others
    • Applications
      • Credit Risk Scoring: Predicting the probability of loan default with many features (e.g., customer demographics, financial history)
      • Marketing Campaign Response: Estimating customer response to an email campaign when some features (e.g., past purchases) are correlated
      • Biological Research: Identifying significant genes for a disease from thousands of genetic features (Lasso helps with feature selection)
    • Advantages
      • Feature Selection and Sparsity: Lasso automatically performs feature selection, producing simpler and more interpretable models
      • Prevents Overfitting: The regularization term reduces the risk of overfitting by penalizing complex models
      • Useful for High-Dimensional Data: Particularly helpful in scenarios with many irrelevant features
    • Disadvantages
      • Feature Exclusion with Correlated Data: Lasso can exclude correlated features, leading to biased feature importance
      • Sensitivity to Feature Scaling: Lasso requires standardized or normalized data to perform effectively
    • Example in Python
    • 
      from sklearn.linear_model import Lasso
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import StandardScaler
      from sklearn.metrics import mean_squared_error
      import numpy as np
      
      # Generate or load sample data
      X = np.random.rand(100, 3)  # example features
      y = X @ np.array([3, 5, 0]) + np.random.normal(0, 1, 100)  # target with sparse signal
      
      # Split data into training and test sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
      # Scale the features (important for regularized models)
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      
      # Fit the Lasso Regression model
      lasso = Lasso(alpha=1.0)  # alpha is the regularization parameter (lambda)
      lasso.fit(X_train_scaled, y_train)
      
      # Predict and evaluate the model
      y_pred = lasso.predict(X_test_scaled)
      mse = mean_squared_error(y_test, y_pred)
      print("Mean Squared Error:", mse)
      print("Model Coefficients:", lasso.coef_)
                          

      linear_model.Lasso - linear model trained with L1 prior as regularizer

      • alpha - constant that multiplies the L1 term, controlling regularization strength
      • coef_ - weight vector(s)
      • intercept_ - independent term in decision function
      • fit(X, y) - fit Lasso regression model, X: training data; y: target values
      • predict(X) - predict using the linear model
      • score(X, y) - return the coefficient of determination R2 of the prediction

    K-Nearest Neighbors (KNN) Regression

    Non-parametric, instance-based algorithm that predicts the value of a target variable by averaging the values of the 𝑘-closest data points (neighbors) to a given input point

    Doesn’t assume any underlying distribution of the data or a linear relationship between variables (flexible choice, especially for non-linear datasets)

    Used when the data has local patterns and you need a non-parametric model, works well with small datasets but struggles with large datasets

    • \[ d(x, x') = \sqrt{\sum_{i=1}^{n} (x_i - x_i')^2} \] : Euclidean Distance between two points

    • \[ d(x, x') = \sum_{i=1}^{n} |x_i - x_i'| \] : Manhattan Distance

    • \[ \hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i \] : Prediction for KNN Regression as the average of the target values of the \( k \) nearest neighbors

    • \[ \hat{y} = \frac{\sum_{i=1}^{k} \frac{y_i}{d(x, x_i)}}{\sum_{i=1}^{k} \frac{1}{d(x, x_i)}} \] : Weighted KNN prediction

    • Key Concepts
      • relies on measuring the distance between data points to identify neighbors, commonly using the Euclidean distance
      • Manhattan, Minkowski, or Mahalanobis can also be used depending on the nature of the data and dimensionality
      • parameter 𝑘 represents the number of closest points used to make predictions, smaller 𝑘 values make the model sensitive to noise (high variance), while larger 𝑘 values lead to smoother predictions but can underfit (high bias)
      • typically 𝑘 is chosen through cross-validation, where the optimal value minimizes the prediction error on a validation set
      • for a given input x KNN regression identifies the 𝑘-nearest neighbors, retrieves their target values, and averages (makes KNN regression robust to outliers, as each prediction is influenced by multiple data points) them to produce the prediction
      • in weighted KNN neighbors closer to the query point have a higher influence on the prediction (particularly useful when neighbors vary significantly in their distance from the query point)
    • Applications
      • House Rent Estimation: Estimating rent based on nearby properties with similar characteristics
      • User Rating Prediction: Predicting product ratings by aggregating the ratings of similar users
      • Restaurant Recommendation: Predicting restaurant ratings for a user based on their similarity to other users' preferences
    • Advantages
      • Simplicity: KNN regression is straightforward to understand and implement
      • Non-Parametric: It doesn’t assume a specific form for the data distribution, which makes it adaptable to different patterns and useful for non-linear data
      • Interpretability: Each prediction can be explained by the nearest neighbors, which can provide insights into the relationships within the data
    • Disadvantages
      • Computationally Intensive: KNN regression requires calculating the distance from the query point to all other points, making it slow for large datasets
      • Sensitivity to Feature Scaling: Since KNN relies on distance calculations, feature scaling (e.g., normalization or standardization) is crucial for ensuring all features contribute equally to the distance metric
      • High Variance: The model’s performance depends heavily on the choice of 𝑘 and can fluctuate with small changes in the training data
    • Example in Python
    • 
      from sklearn.neighbors import KNeighborsRegressor
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import StandardScaler
      from sklearn.metrics import mean_squared_error
      import numpy as np
      
      # Generate or load sample data
      X = np.arange(1, 21).reshape(-1, 1)  # example feature (e.g., month)
      y = np.random.normal(50, 10, 20)     # example target (e.g., sales)
      
      # Split data into training and test sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
      # Scale the features
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      
      # Create and fit the KNN Regressor model
      knn = KNeighborsRegressor(n_neighbors=3, weights='distance')  # using weighted KNN
      knn.fit(X_train_scaled, y_train)
      
      # Predict and evaluate the model
      y_pred = knn.predict(X_test_scaled)
      mse = mean_squared_error(y_test, y_pred)
      print("Mean Squared Error:", mse)
                          

      neighbors.KNeighborsRegressor - regression based on k-nearest neighbors

      • n_neighbors - number of neighbors to use by default for kneighbors queries
      • weights - weight function used in prediction (uniform, distance or callable)
      • coef_ - weight vector(s)
      • intercept_ - independent term in decision function
      • fit(X, y) - fit k-nearest neighbors regression model, X: training data; y: target values
      • predict(X) - predict the target for the provided data
      • score(X, y) - return the coefficient of determination R2 of the prediction

    Support Vector Regression (SVR)

    Type of Support Vector Machine (SVM) adapted for regression tasks, where the goal is to predict continuous values rather than classify data points

    Attempts to find a function that fits the data within a specified margin of error (epsilon-insensitive margin), where the model disregards errors that fall within a distance 𝜖 of the true values

    • \[ L(y, f(x)) = \begin{cases} 0 & \text{if } |y - f(x)| \leq \epsilon \\ |y - f(x)| - \epsilon & \text{otherwise} \end{cases} \] : Epsilon-Insensitive Loss Function

    • \[ \min \frac{1}{2} ||w||^2 + C \sum_{i=1}^{n} \max(0, |y_i - f(x_i)| - \epsilon) \] : Objective function for SVR, balancing flatness and error minimization

    • \[ ||w||^2 \] : controls the flatness of the regression line

    • \[ C \] : regularization parameter that balances margin maximization and error minimization

    • Key Concepts
      • margin 𝜖 is set around the predicted line, errors within this margin are not penalized making it “insensitive” to small deviations, only points outside this margin contribute to the model’s error calculation
      • goal of SVR is to find a function that minimizes a combined objective: maximizing the margin while keeping errors outside the epsilon margin small
      • can handle non-linear relationships by using the kernel trick, allowing it to project data into higher-dimensional spaces where linear separability (or linear regression) is easier to achieve
      • support vectors are the data points that lie outside the epsilon margin, influencing the position and orientation of the regression line (only these points affect the SVR model)
      • regularization parameter 𝐶 controls the trade-off between maximizing the margin and minimizing the error beyond 𝜖 (high 𝐶 assigns a larger penalty to errors leading to overfitting)
    • Applications
      • House Rent Estimation: Estimating rent based on nearby properties with similar characteristics
      • User Rating Prediction: Predicting product ratings by aggregating the ratings of similar users
      • Restaurant Recommendation: Predicting restaurant ratings for a user based on their similarity to other users' preferences
    • Advantages
      • Robust to Outliers: SVR is robust to outliers because of its epsilon-insensitive margin, which can ignore minor deviations within the 𝜖-margin
      • Effective in High-Dimensional Spaces: SVR performs well in high-dimensional data, making it suitable for complex and sparse datasets
      • Non-Linear Capabilities: With kernels, SVR can capture non-linear patterns effectively, offering flexibility across different types of data
    • Example in Python
    • 
      from sklearn.svm import SVR
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import StandardScaler
      from sklearn.metrics import mean_squared_error
      import numpy as np
      
      # Generate or load data
      # X = features, y = target
      X = np.arange(1, 21).reshape(-1, 1)  # example feature (e.g., months)
      y = np.random.normal(50, 10, 20)     # example target (e.g., sales)
      
      # Split data
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
      # Standardize data (SVR often performs better with scaled data)
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      
      # Fit SVR model with RBF kernel
      svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
      svr.fit(X_train_scaled, y_train)
      
      # Predict and evaluate
      y_pred = svr.predict(X_test_scaled)
      mse = mean_squared_error(y_test, y_pred)
      print("Mean Squared Error:", mse)
                          

      sklearn.svm.SVR - epsilon-support vector regression

      • kernel - specifies the kernel type to be used in the algorithm (linear, poly, rbf, sigmoid, precomputed)
      • C - regularization parameter
      • epsilon - specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value
      • fit(X, y) - fit the SVM model according to the given training data, X: training vectors; y: target values
      • predict(X) - perform regression on samples in X
      • score(X, y) - return the coefficient of determination R2 of the prediction

    Logistic Regression

    statistical and machine learning method commonly used for binary classification problems, where the goal is to predict the probability of a binary outcome (e.g., yes/no, true/false, 0/1) based on one or more predictor variables

    logistic regression is fundamentally a classification algorithm that applies a logistic (or sigmoid) function to estimate probabilities

    • \[ \sigma(z) = \frac{1}{1 + e^{-z}} \] : logistic or sigmoid function is used to map any real-valued number to a probability between 0 and 1

    • \[ z = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p \] : linear combination for z

    • \[ P(y = 1 | X) = \sigma(z) = \frac{1}{1 + e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}} \] : The probability of the positive class (class 1)

    • \[ P(y = 0 | X) = 1 - \sigma(z) = \frac{e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}}{1 + e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}} \] : The probability of the negative class (class 0)

    • \[ y = \begin{cases} 1 & \text{if } P(y=1 | X) \geq 0.5 \\ 0 & \text{if } P(y=1 | X) < 0.5 \end{cases} \] : decision threshold (commonly 0.5) to classify the output

    • \[ \text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i) \right] \] : Log Loss (Binary Cross-Entropy Loss) cost function

    • \[ \hat{w} = \arg \min_{w} -\sum_{i=1}^{n} \left[ y_i \cdot \log(\sigma(z_i)) + (1 - y_i) \cdot \log(1 - \sigma(z_i)) \right] \] : Maximum Likelihood Estimation - the objective of logistic regression is to maximize the likelihood of the observed data, often by minimizing the negative log likelihood

    • Key Concepts
      • logistic function, or sigmoid function maps any real-valued number to a range between 0 and 1, which can then be interpreted as a probability
      • 𝑧 is a linear combination of input features, where 𝑤 represents the model’s coefficients and 𝑥 the input features
      • sigmoid function transforms this linear output to a probability score, making it suitable for binary classification
      • outputs a probability score between 0 and 1, a threshold of 0.5 is used to classify the output: values greater than or equal to 0.5 are classified as one class (often labeled as 1)
      • optimized using a loss function known as log loss or binary cross-entropy, which measures the difference between the predicted probability and the actual label
      • logistic regression coefficients are estimated using maximum likelihood estimation, adjusts the weights so that the predicted probabilities align as closely as possible with the observed labels
      • assumes a linear relationship between the independent variables and the log-odds of the outcome
      • each observation should be independent of others, as logistic regression does not handle correlated data well
      • highly correlated features can affect the model's stability, so feature selection or dimensionality reduction techniques (e.g., PCA) are often applied beforehand
    • Advantages
      • Simple and Interpretable: Logistic regression is easy to interpret and provides straightforward probabilistic outputs
      • Efficient for Binary Classification: It is computationally efficient and performs well on linearly separable datasets
      • Works Well with Small Datasets: Logistic regression can perform well with relatively small datasets and is less prone to overfitting compared to more complex models when regularization is used
    • Disadvantages
      • Linear Decision Boundary: Logistic regression works best for linear decision boundaries. For more complex boundaries, it may underperform unless features are transformed or nonlinear techniques are used
      • Sensitive to Outliers: Logistic regression can be sensitive to outliers, particularly if regularization is not applied
      • Limited to Binary Classification: Standard logistic regression is inherently binary, although it can be extended to multiclass classification using techniques such as One-vs-Rest (OvR) and Softmax Regression
    • Example in Python
    • 
      from sklearn.linear_model import LogisticRegression
      from sklearn.model_selection import train_test_split
      from sklearn.metrics import accuracy_score, confusion_matrix
      
      # Generate or load example data
      X = [[1.1], [2.5], [3.3], [4.5], [5.1], [6.2], [7.4]]  # example features
      y = [0, 0, 1, 1, 0, 1, 1]  # binary target
      
      # Split data into training and test sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
      # Initialize and train the model
      model = LogisticRegression()
      model.fit(X_train, y_train)
      
      # Make predictions
      y_pred = model.predict(X_test)
      
      # Evaluate the model
      accuracy = accuracy_score(y_test, y_pred)
      conf_matrix = confusion_matrix(y_test, y_pred)
      print("Accuracy:", accuracy)
      print("Confusion Matrix:\n", conf_matrix)
                          

    Classification (Supervised Learning)

    A supervised learning approach that determines the class label for an unlabeled test case.

    Categorizing some unknown items into a discrete set of categories or "classes"

    The target attribute is a categorical or discrete variable

    • Application:
      • Which category a customer belongs to
      • Whether a customer switches to another provider/brand
      • Whether a customer responds to a particular advertising campaign
    • Evaluation:
      • Accuracy - The proportion of correctly classified instances out of the total instances. It's a simple and intuitive metric but can be misleading with imbalanced datasets
      • Precision - Precision quantifies the proportion of accurately predicted positive instances among all instances classified as positive, showcasing the model's capacity to identify true positives correctly
      • Recall (Sensitivity) - The ratio of correctly predicted positive observations to all actual positives. It measures the model's ability to find all the positive instances
      • F1 Score - The harmonic mean of precision and recall. It balances precision and recall, especially when dealing with imbalanced datasets
      • ROC Curve and AUC - ROC curves visualize the trade-off between true positive rate (TPR) and false positive rate (FPR) at various thresholds. AUC summarizes the ROC curve into a single value, indicating the model's ability to discriminate between positive and negative classes
      • Confusion Matrix - A table summarizing the number of correct and incorrect predictions provides insights into the model's performance across different classes
      • Classification Report - It comprehensively summarizes various classification metrics for each class, such as precision, recall, F1 score, and support