AI

Artificial Intelligence Basics

Simulation of human intelligence in machines that are programmed to think, learn, and make decisions autonomously

These systems are designed to mimic human cognitive functions like problem-solving, language understanding, learning from experience, and pattern recognition

Branch of computer science dealing with the simulation of intelligent behavior in computers

History of Artificial Intelligence

1940s-1950s: The Birth of AI Concepts

1943: McCulloch & Pitts Model: Warren McCulloch and Walter Pitts created a mathematical model for neural networks, laying the foundation for AI concepts
1950: Turing Test: Alan Turing proposed the Turing Test, a criterion to determine if a machine can exhibit intelligent behavior indistinguishable from that of a human
1956: Dartmouth Conference: John McCarthy coined the term "Artificial Intelligence," marking the formal beginning of AI as a field of study
1957: Frank Rosenblatt builds the Mark 1 Perceptron, the first computer based on a neural network that 'learned' though trial and error
1959: Arthur Samuel published an algoritm for a checkers program using machine learning

1960s-1970s: The Rise of Early AI Systems

1961: First Industrial Robot: Unimate, the first industrial robot, was introduced, automating assembly line tasks at General Motors
1966: ELIZA: Joseph Weizenbaum developed ELIZA, one of the first chatbots that could mimic human conversation, albeit in a limited way
1969: Shakey the Robot: Shakey was developed by SRI International, capable of planning, reasoning, and problem-solving in simple environments

1980s: AI Winter and Expert Systems

1980s: Expert Systems Boom: Systems like XCON by Digital Equipment Corporation were used commercially, using rules-based logic to simulate expert-level decision-making in specific fields
1980s: Neural networks which use a backpropagation algorithm to train itself become widely used in AI applications
Late 1980s: AI Winter: Hype and unrealistic expectations led to reduced funding and interest in AI research due to limited progress and the failure of early systems

1990s: Machine Learning and Data-Driven AI

1997: Deep Blue Defeats Kasparov: IBM’s Deep Blue defeated world chess champion Garry Kasparov, showcasing AI's potential in complex problem-solving
1990s: Introduction of Machine Learning: The focus shifted to machine learning, leveraging statistical techniques and large datasets to improve AI capabilities

2000s: Rise of Data and Neural Networks

2006: Deep Learning Renaissance: Geoffrey Hinton popularized deep learning, leading to significant advancements in neural networks and AI capabilities
2009: The ImageNet database of human-tagged images is presented at the CVPR conference

2010s: AI in Everyday Applications

2011: IBM Watson: Watson won the quiz show "Jeopardy!" against human champions, demonstrating the power of natural language processing
2012: ImageNet Competition: AlexNet, a deep convolutional neural network, drastically improved image recognition accuracy, marking a breakthrough in computer vision
2014: Generative Adversarial Networks (GANs): Ian Goodfellow introduced GANs, which allowed AI to generate new, realistic data, such as images and music
2016: AlphaGo Defeats Lee Sedol: Google DeepMind’s AlphaGo defeated world champion Go player Lee Sedol, showcasing AI’s strategic thinking capabilities
2018: Waymo launches commercial self-driving car service in suburbs of Phoenix
2019: IMB project Debater is able to have a full debate with rebuttal with champion human debater

2020s: AI in the Real World

2020: GPT-3 by OpenAI: A significant leap in natural language processing, GPT-3 demonstrated the ability to generate human-like text, improving chatbots, translation, and content creation
2022: DALL-E and Stable Diffusion: AI models like DALL-E and Stable Diffusion allowed for the creation of detailed images from text prompts, revolutionizing digital art and design

Types of Artificial Intelligence

Based on Functionalities:

Purely Reactive: These AI machines do not have any memory or data to work with, specializing in just one field of work (chess game) playing chess
Limited Memory: These AI machines collect previous data and continue adding it to their memory. They have enough memory or experience to make proper decisions, but memory is minimal (suggesting a restaurant in the neighborhood)

Reinforcement learning, which learns to make better predictions through repeated trial-and-error
Long Short Term Memory (LSTM), which utilizes past data to help predict the next item in a sequence. LTSMs view more recent information as most important when making predictions and discounts data from further in the past, though still utilizing it to form conclusions
Evolutionary Generative Adversarial Networks (E-GAN), which evolves over time, growing to explore slightly modified paths based off of previous experiences with every new decision. This model is constantly in pursuit of a better path and utilizes simulations and statistics, or chance, to predict outcomes throughout its evolutionary mutation cycle

Theory of Mind: This AI is predicted to work in the future to understand emotions and thoughts to interact socially
Self-Aware: Another projected AI machine with self-awareness capabilities to make conscious decisions

Based on Capabilities:

Weak (narrow) AI (ANI) - AI systems that are designed to perform specific tasks and are limited to those tasks only. These AI systems excel at their designated functions but lack general intelligence. Weak AI operates within predefined boundaries and cannot generalize beyond their specialized domain (voice assistants like Siri or Alexa, recommendation algorithms, image recognition systems, self-driving car)
Strong (general) AI (AGI) - AI systems that possess human-level intelligence or even surpass human intelligence across a wide range of tasks. Strong AI would be capable of understanding, reasoning, learning, and applying knowledge to solve complex problems in a manner similar to human cognition
Artificial Superintelligence (ASI): Hypothetical AI surpassing human intelligence in all aspects, potentially capable of solving complex problems and making advancements beyond human comprehension

Based on Technologies:

Machine Learning (ML) - AI systems capable of self-improvement through experience, without direct programming; concentrate on creating software that can independently learn by accessing and utilizing data
Deep Learning - A subset of ML involving many layers of neural networks; used for learning from large amounts of data and is the technology behind voice control in consumer devices, image recognition, and many other applications
Natural Language Processing (NLP) - Enables machines to understand and interpret human language; used in chatbots, translation services, and sentiment analysis applications
Robotics - Designing, constructing, operating, and using robots and computer systems for controlling them, sensory feedback, and information processing
Computer Vision - Allows machines to interpret the world visually, and it's used in various applications such as medical image analysis, surveillance, and manufacturing
Expert Systems - Answer questions and solve problems in a specific domain of expertise using rule-based systems

Applications of Artificial Intelligence

Natural Language Processing (NLP) - speech recognition, machine translation, sentiment analysis, and virtual assistants like Siri and Alexa
Image and Video Analysis - facial recognition, object detection and tracking, content moderation, medical imaging, and autonomous vehicles
Robotics and Automation - manufacturing, healthcare, logistics, and exploration
Recommendation Systems - e-commerce, streaming platforms, and social media to personalize user experiences
Financial Services - fraud detection, algorithmic trading, credit scoring, and risk assessment
Healthcare - disease diagnosis, medical imaging analysis, drug discovery, personalized medicine, and patient monitoring
Virtual Assistants and Chatbots - customer support, information retrieval, and personalized assistance
Gaming - creating realistic virtual characters, opponent behavior, and intelligent decision-making
Smart Homes and IoT - smart home systems that can automate tasks, control devices, and learn from user preferences
Cybersecurity - detecting and preventing cyber threats by analyzing network traffic, identifying anomalies, and predicting potential attacks

Useful links

-------

ML - Machine Learning

ML Basics

Study of programs that are not explicitly programmed, but instead these algorithms learn patterns from data

Application of AI that provides systems the ability to learn on their own and improve from experiences without being programmed externally

Factors that have contributed to the current state of Machine Learning are:

bigger data sets
faster computers
open source packages
wide range of neural network architectures

Machine Learning Workflow:

Problem definition - Define the problem, goals, and the expected outcome. What are we trying to predict or classify, what is the metric for success?
Data collection - Identify and gather relevant data from sources like databases, APIs, third-party sources, or web scraping. Machines initially learn from the data that you give them, it is of the utmost importance to collect reliable data so that your machine learning model can find the correct patterns. The quality of the data that you feed to the machine will determine how accurate your model is.
Data preprocessing and exploration
- Putting together all the data you have and randomizing it
- Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values, data type conversion
- Transformation - convert data into a suitable format. This may involve normalization, scaling, encoding categorical variables, and creating new features
- Feature Engineering = create new features or select the most relevant features for the model. Techniques may include binning, polynomial features, or aggregations
- Understand data patterns, distributions, and relationships
- Visualize the data to understand how it is structured and understand the relationship between various variables and classes present
- Statistical Analysis - calculate basic statistics (mean, median, standard deviation) and investigate correlations or outliers
- Insights - identify patterns that could impact model performance, such as class imbalance, outliers, or multicollinearity
Modeling - A machine learning model determines the output you get after running a machine learning algorithm on the collected data. It is important to choose a model which is relevant to the task at hand. Training is the most important step in machine learning. In training, you pass the prepared data to your machine learning model to find patterns and make predictions.

Choose a suitable algorithm based on the problem type, test several algorithms on a subset of the data to see which ones might yield the best performance
Splitting Data - divide data into training (the set your model learns from), validation, and test sets (used to check the accuracy of your model after training) to avoid overfitting and ensure generalization
Hyperparameter Tuning - adjust parameters to optimize model performance, often using techniques like grid search or random search
Training - feed the training set into the model and adjust parameters to minimize error. Various training methods like batch, mini-batch, or online learning may be used

Evaluation

Validation - Testing the performance of the model on previously unseen data. The unseen data used is the testing set that you split our data into earlier
Metrics Selection - Use appropriate evaluation metrics based on the problem type (e.g., accuracy, precision, recall, F1-score for classification; RMSE, MAE for regression)
Error Analysis - Identify where the model is underperforming and analyze misclassified instances or high-error predictions

Decision Making and Deployment

Common taxonomy:

target - category or value you are trying to predict
features - explanatory variables used for prediction
example - an observation or single data point within the data
label - the value of the target for a single data point

Types of Machine Learning

Supervised learning - used to predict data. Method in which a model learns from a labeled dataset containing input-output pairs. Each input in the dataset has a corresponding correct output (the label), and the model's task is to learn the relationship between the inputs and outputs. This enables the model to make predictions on new, unseen data by applying the learned mapping.

Categories:

Regression - The model anticipates a continuous value or quantity
Classification - The model predicts a discrete class label or category

Algorithms:

Linear Regression: used for predicting a continuous target variable, models the relationship between input features and the output by fitting a linear equation
Logistic Regression: used for binary classification tasks, estimates probabilities using the logistic function and applies a threshold for classification
K-Nearest Neighbors (KNN): classifies data points based on their proximity to neighboring points, often used for both classification and regression
Support Vector Machines (SVM): finds the hyperplane that maximizes the margin between different classes, effective for classification, especially in high-dimensional spaces
Decision Trees: tree-like model where each node represents a feature, and the branches represent decision rules, works for both classification and regression
Random Forests: ensemble of decision trees, random forests reduce overfitting by averaging multiple decision trees
Gradient Boosting Machines (GBMs): ensemble technique that builds trees sequentially, where each tree corrects the errors of the previous one, popular variations include XGBoost and LightGBM

Applications:

Healthcare: Used to predict patient diagnoses based on symptoms and past medical history
Finance: For credit scoring and predicting stock prices
Retail: To forecast sales, recommend products, and personalize marketing
Autonomous Vehicles: These are used to recognize traffic signs and pedestrians
Speech Recognition: In virtual assistants and transcription services

Advantages:

Effectiveness: Supervised learning can predict outcomes based on past data
Simplicity: It's relatively easy to understand and implement
Performance Evaluation: It is easy to measure the performance of a supervised learning model since the ground truth (labels) is known
Applications: Can be used in various fields like finance, healthcare, marketing, etc
Feature Importance: It allows an understanding of which features are most important in making predictions

Disadvantages:

Dependency on Labeled Data: Supervised learning requires a large amount of labeled data, which can be expensive and time-consuming
Overfitting: Models can become too complex and fit the noise in the training data rather than the actual signal, which degrades their performance on new data
Generalization: Sometimes, these models do not generalize well to unseen data if the data they were trained on does not represent the broader context

Unsupervised learning - used to find hidden patterns or structures in data. The training data is unknown and unlabeled - meaning that no one has looked at the data before. This data is fed to the Machine Learning algorithm and is used to train the model. The trained model tries to search for a pattern and give the desired response.

Categories:

Clustering - Data is grouped into subsets (clusters) such that data in each cluster are more similar than those in others
Association - Discovering rules that capture interesting relationships between variables in large databases (e.g., market basket analysis)
Dimensionality Reduction - Reducing the number of random variables under consideration (e.g., PCA, t-SNE), which helps to simplify the data without losing important information

Algorithms:

K-Means Clustering: Groups data into 𝐾 clusters by minimizing the variance within each cluster
Hierarchical Clustering: builds a hierarchy of clusters either by agglomerative (bottom-up) or divisive (top-down) approaches
Principal Component Analysis (PCA): reduces the dimensionality of data by transforming it into a set of uncorrelated variables called principal components, useful for data compression and visualization
Association Rule Learning: identifies relationships or associations between variables in large datasets, commonly used in market basket analysis, algorithms include Apriori and Eclat
Gaussian Mixture Models (GMM): a probabilistic approach to clustering, where data points are assumed to belong to a mixture of Gaussian distributions

Applications:

Customer Segmentation: Businesses use clustering to segment customers based on behaviors and preferences for targeted marketing
Anomaly Detection: Identifying unusual data points can be critical in fraud detection or network security
Recommendation Systems: Associative models help build recommendation systems that suggest products based on user behavior
Feature Elicitation: Used in preprocessing steps to extract new features from raw data which can improve the accuracy of predictive models
Image Segmentation: Applied in computer vision to divide an image into meaningful segments and analyze each segment individually

Advantages:

Discovering Hidden Patterns: It can identify patterns and relationships in data that are not initially evident
No Need for Labelled Data: Works with unlabeled data, making it useful where obtaining labels is expensive or impractical
Reduction of Complexity in Data: Helps reduce the dimensionality of data, making complex data more comprehensible
Feature Discovery: This can be used to find useful features that can improve the performance of supervised learning algorithms
Flexibility: Can handle changes in input data or the environment since it doesn’t rely on predefined labels

Disadvantages:

Interpretation of Results: The results can be ambiguous and harder to interpret than those from supervised learning models
Dependency on Input Data: The output quality heavily depends on the quality of the input data
Lack of Precise Objectives: Without specific tasks like prediction or classification, the direction of learning is less focused, leading to less actionable insights

Reinforcement learning - learns from its mistakes and experiences. The algorithm discovers data through a process of trial and error and then decides what action results in higher rewards. Three major components make up reinforcement learning: the agent is the learner or decision-maker, the environment includes everything that the agent interacts with, and the actions are what the agent does.

Categories:

Model-based RL - The agent builds a model of the environment and uses it to predict future rewards and states. This allows the agent to plan by considering potential future situations before taking action
Model-free RL - The agent learns to act without explicitly constructing a model of the environment. It directly learns the value of actions or action policies from experience, using methods like Q-learning or policy gradients
Partially Observable RL - The agent doesn't have access to the full state of the environment. The agent must learn to make decisions based on incomplete information, often using strategies that involve maintaining internal state estimates

Algorithms:

Q-Learning: a value-based method where the agent learns a policy that tells it the best action to take in each state, aiming to maximize cumulative reward
Deep Q Networks (DQN): combines Q-Learning with deep neural networks, enabling it to handle more complex, high-dimensional state spaces
Policy Gradient Methods: optimizes the agent's policy directly by maximizing expected rewards, algorithms include REINFORCE and Proximal Policy Optimization (PPO)

Applications:

Autonomous Vehicles: RL is used to develop autonomous driving systems, helping vehicles learn to navigate complex traffic environments safely
Robotics: RL enables robots to learn complex tasks like walking, picking up and manipulating objects, and interacting with humans and other robots in a dynamic environment
Gaming: In the gaming industry, RL is used to develop AI that can challenge human players, adapt to their strategies, and provide engaging gameplay
Finance: RL can be applied to trading and investment strategies where the algorithm learns to make buying and selling decisions to maximize financial return
Healthcare: RL algorithms are being explored for various applications in healthcare, including personalized treatment recommendation systems and management of healthcare logistics

Advantages:

Adaptability: RL agents can adapt to new environments or changes within their environment, making them suitable for dynamic and uncertain situations
Decision-Making Autonomy: RL agents make decisions based on learned experiences rather than pre-defined rules, which can be advantageous in complex environments where manual behavior specification is impractical
Continuous Learning: Since the learning process is continuous, RL agents can improve their performance over time as they gain more experience
Handling Complexity: RL can handle problems with high complexity and numerous possible states and actions, which might be infeasible for traditional algorithms
Optimization: RL is geared towards optimization of the decision-making process, aiming to find the best sequence of actions for any given situation

Disadvantages:

Dependency on Reward Design: The effectiveness of an RL agent is heavily dependent on the design of the reward system. Poorly designed rewards can lead to unwanted behaviors
High Computational Cost: Training RL models often requires significant computational resources and time, especially as the complexity of the environment increases
Sample Inefficiency: RL algorithms typically require many interactions with the environment to learn effective policies, which can be impractical in real-world scenarios where each interaction could be costly or time-consuming

Assessing Performance

Error Types

Training error

Error that a machine learning model makes on the same dataset it was trained on
Measures how well the model has fit the training data by calculating the discrepancy between the predicted values and the actual values in this dataset
Training error is overly optimistic
Small training error ≠> good predictions unless training data includes everything you might ever see
A model with a very low training error may be overfitting, meaning it has "memorized" the training data rather than learning a general pattern
Overfitting often leads to low training error but high test error (the error on unseen data), indicating poor generalization.
Training error should ideally be lower than test error, as the model is optimized directly on the training data
High bias (underfitting) will result in high training error, whereas high variance (overfitting) can produce low training error but high test error

Generalization error

Measure of how accurately a machine learning model predicts outcomes for new, unseen data
Reflects the model's ability to generalize the patterns it learned during training and apply them to data that wasn't part of the training set
Lower generalization error indicates that a model has learned well from its training data and can make reliable predictions on new data
Often measured using the model’s performance on a test or validation dataset, which the model hasn’t seen during training
Arises from three main sources: bias (error due to overly simplistic assumptions in the learning algorithm), variance (error due to excessive sensitivity to small fluctuations in the training set) and irreducible error (natural noise in the data that cannot be reduced by any model)
High generalization error is often a result of overfitting, where the model performs well on the training data but poorly on the test data, indicating it has learned noise rather than true patterns
Underfitting, where both training and test errors are high, also leads to a high generalization error since the model fails to capture the patterns in the data
Techniques to minimize generalization error include cross-validation, regularization, using simpler models, and gathering more training data

Test error

Error that a machine learning model makes on a test dataset, which is a set of data that the model has never seen during training
Estimate of the model’s ability to generalize to new, unseen data and provides a measure of how well the model will likely perform in real-world applications
Low test error indicates that the model has learned the underlying patterns well without overfitting or underfitting the training data
Significant difference between training and test errors indicates potential overfitting (if test error is much higher) or underfitting (if both errors are high)
Test error should be close to the training error to show good generalization
Low test error means the model generalizes well, while a high test error may imply poor generalization and the need for model adjustments
Methods to reduce test error and improve generalization include cross-validation, regularization, and tuning model complexity (such as adjusting polynomial degrees or pruning decision trees)

Error Sources

Noise

Refers to random, unpredictable variations or errors in data that do not represent the true underlying patterns or signals the model aims to capture
Can arise from many sources, including measurement errors, data entry mistakes, external environmental factors, or natural randomness
Noise can cause a model to learn irrelevant details in the training data (overfitting), reducing its ability to generalize to new data
Choosing models that are too complex can cause them to be overly sensitive to noise, while simpler models may ignore noise but also miss capturing genuine patterns
Measurement Noise: errors in data collection, such as imprecise sensors or human errors in data entry
Environmental Noise: external, uncontrollable factors that influence the data, like fluctuating market conditions affecting sales data
Irreducible Noise: the portion of noise that cannot be eliminated, even with an ideal model, often due to inherent randomness in the phenomenon being studied

Bias

Refers to the error introduced by approximating a real-world problem, which may have complex relationships, with a simpler model
Measures how far off the predictions are from the actual values on average
Model with high bias makes strong assumptions about the data and is too simple to capture the underlying pattern accurately
High Bias → Underfitting: the model is too simplistic, leading to poor performance both on the training data and on new (test) data, fails to capture the true relationship

Variance

Measures how sensitive the model is to fluctuations in the training data
Average squared difference of each model you train relative to the average prediction of each model
High variance models adapt too closely to the training data and may capture noise along with the true pattern
Model with high variance is typically too complex and overly tuned to the training data
High Variance → Overfitting: the model performs well on the training data but fails to generalize to new data, resulting in poor test performance

Bias-variance tradeoff refers to the tradeoff between two sources of error that affect the performance of a model: bias and variance

\[ Total Error = Bias^2 + Variance + Irreducible Error\]

Bias\[^2\] : error from incorrect model assumptions
Variance : error from sensitivity to small fluctuations in the training data
Irreducible Error : random noise in the data that cannot be eliminated

The key challenge is finding the optimal balance between bias and variance to minimize the total error
If the model is too simple (high bias), it will miss relevant patterns in the data (underfit)
If the model is too complex (high variance), it will fit the training data too well, capturing noise and performing poorly on unseen data (overfit)
Strategies to Address Bias-Variance Tradeoff

Cross-Validation: Use techniques like k-fold cross-validation to assess model performance on different subsets of the data
Regularization (Ridge/Lasso): Introduce penalty terms in the objective function to reduce model complexity (variance) while retaining essential patterns
Feature Engineering: Carefully select features to prevent overfitting and reduce bias
Ensemble Models: Use models like bagging or boosting (e.g., random forests, gradient boosting) to reduce variance
Early Stopping: In iterative models like neural networks, stop training when performance on a validation set starts to degrade

ML techniques

Training and Test Splits

Splitting your data into a training and a test set can help you choose a model that has better chances at generalizing and is not overfitted
The training data is used to fit the model, while the test data is used to measure error and performance
Training error tends to decrease with a more complex model


from sklearn.model_selection import train_test_split

# Define the feature(s) and target variable
y_col = "price" # define y column
X = data_df.drop(y_col, axis=1) # drop y column from features data
y = data_df.y_col

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("number of test samples :", X_test.shape[0])
print("number of training samples:",X_train.shape[0])

model_selection.train_test_split - split arrays or matrices into random train and test subsets

X, y: the allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas data frames
test_size: if float, it should be between 0.0 and 1.0 and represents the proportion of the dataset to include in the test split. If int (integer), it represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25
random_state: controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls

Transforming Target

Making the target variable normally distributed often will lead to better results

Using a visual approach - by plotting a histogram
Using a statistical test - this test outputs a p-value, the higher this p-value is, the closer the distribution is to normal

Linear Regression assumes a normally distributed residuals which can be aided by transforming y variable which is the target variable

Log Transformation - can transform data that is significantly skewed right to be more normally distributed
Square root Transformation
Box cox Transformation - parametrized transformation that tries to get distributions "as close to a normal distribution as possible"

\[ \text{boxcox}(y_i) = \frac{y_i^{\lambda} - 1}{\lambda}\]

scipy.stats.boxcox - return a dataset transformed by a Box-Cox power transformation


import matplotlib.pyplot as plt
from scipy.stats.mstats import normaltest # D'Agostino K^2 Test
normaltest(data_df.target.values)

log_target = np.log(data_df.target)
log_target.hist();

sqrt_target = np.sqrt(data_df.target)
sqrt_target.hist();

from scipy.stats import boxcox
bc_result = boxcox(data_df.target)
boxcox_target = bc_result[0]
lam = bc_result[1]
plt.hist(boxcox_target);

Regression (Supervised Learning)

Statistical method used to model the relationship between a dependent (target or outcome) variable and one or more independent (predictors or features) variables

Predicts the continuous output variables based on the independent input variable

Goal is to understand how changes in the independent variables are associated with changes in the dependent variable and to make predictions about the dependent variable based on known values of the independent variables

Application:

Sales forecasting
Satisfaction analysis
Price estimation
Employment income

Regression models:

Linear Regression
Polynomial Regression
Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)
Elastic Net Regression
Support Vector Regression (SVR)
Decision Tree Regression
Random Forest Regression
Gradient Boosting Regression
XGBoost Regression
K-Nearest Neighbors (KNN) Regression
Bayesian Regression
Principal Component Regression (PCR)
Partial Least Squares Regression (PLSR)
Quantile Regression

Assessing Performance

Cross-Validation

Statistical method used in machine learning and data science to assess the performance of a model by splitting the data into multiple subsets (folds) and training/testing the model on different combinations of these subsets.

The purpose is to provide a more accurate and robust estimate of the model’s performance on unseen data, helping to prevent overfitting and improve generalization

K-Fold Cross-Validation

dataset is randomly split into k equally sized folds or subsets
model is trained on k−1 folds and tested on the remaining fold, rotating through all folds so that each fold serves as a test set once
final performance score is the average of the scores from each fold, providing a reliable estimate of model accuracy

Leave-One-Out Cross-Validation (LOOCV)

k equals the number of data points
model is trained on all data points except one, and this process is repeated for each data point in the dataset
provides an unbiased performance estimate but is computationally intensive for large datasets

Stratified Cross-Validation

useful for imbalanced datasets, as it maintains the same proportion of classes (for classification problems) in each fold as in the full dataset, ensuring consistent class distribution across training and validation sets

Nested Cross-Validation

used when hyperparameter tuning is involved. It has two levels of cross-validation: the outer loop estimates model performance, while the inner loop is used for hyperparameter tuning, ensuring an unbiased evaluation of tuned models


from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Sample data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# Model
model = LinearRegression()

# KFold Cross-Validation
kfold = KFold(n_splits=5)
results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')

# Average MSE across folds
average_mse = -results.mean()
print("Average MSE:", average_mse)

model_selection.KFold K-Fold cross-validator creates number of k-fold splits, allowing cross validation

n_splits - number of folds, must be at least 2

model_selection.cross_val_score evaluates model's score through cross validation

estimator - the object to use to fit the data
X, y - the data to fit, the target variable to try to predict in the case of supervised learning
cv - determines the cross-validation splitting strategy (None, int, CV splitter)
scoring - a str or a scorer callable object / function with signature scorer (estimator, X, y) which should return only a single value

model_selection.cross_val_predict produces the out-of-bag prediction for each row

model_selection.GridSearchCV scans over parameters to select the best hyperparameter set with the best out-of-sample score

Regularization

Technique used to prevent overfitting by penalizing high-valued coefficients (reduces parameters and shrinks the model)

Specially useful in complex models like high-degree polynomial regression, deep neural networks, and decision trees, where the risk of overfitting is high

Modifies the cost function used to train the model by adding a penalty term, typically multiplied by a hyperparameter λ (regularization strength)

Regularization techniques have an analytical, a geometric, and a probabilistic interpretation

Ridge (L2 Regularization)

\[ \text{L2 penalty} = \lambda \sum_{i=1}^{n} w_i^2 \]

penalizes the size magnitude of the regression coefficients by adding a squad term
complexity penalty λ is applied proportionally to squared coefficient values
enforces the coefficients to be lower, but not 0
minimizes irrelevant features and does not remove them
faster to train
this imposes bias on the model, but also reduces variance
we can select the best regularization strength lambda via cross-validation
it’s a best practice to scale features (i.e. using StandardScaler) so penalties aren’t impacted by variable scale

LASSO (L1 Regularization) - Least Absolute Shrinkage and Selection Operator

\[ \text{L1 penalty} = \lambda \sum_{i=1}^{n} |w_i| \]

penalizes the absolute value of the coefficients
complexity penalty λ is proportional to the absolute value of coefficients
sets irrelevant features to 0
finds features you don't need
LASSO is more likely than Ridge to perform feature selection, for a fixed λ LASSO is more likely to result in coefficients being set to zero
LASSO’s feature selection property yields an interpretability advantage, but may underperform if the target truly depends on many of the features

Elastic Net (L1+L2 Regularization)

\[ \text{Elastic Net penalty} = \lambda_1 \sum_{i=1}^{n} |w_i| + \lambda_2 \sum_{i=1}^{n} w_i^2 \]

combines both L1 and L2 regularization, balancing between feature selection (L1) and feature shrinkage (L2)
penalizes the size magnitude of the regression and absolute value of the coefficients
sets irrelevant features to 0 and enforces the coefficients to be lower
introduces a new parameter α (alpha) that determines a weighted average of L1 and L2 penalties

Dropout Regularization (for Neural Networks)

temporarily “drops out” random neurons during each training step by setting their outputs to zero with a specified probability
prevents any single neuron from becoming too important, encouraging the network to learn redundant representations and reducing overfitting

Early Stopping

training is stopped as soon as the model’s performance on a validation set begins to deteriorate (prevents the model from learning noise in the training data)

Evaluation

Mean Absolute Error (MAE) - Computes the average absolute differences between the predicted and actual values. It provides a measure of the average magnitude of errors. Treats all errors equally by taking their absolute values.

\[ MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right| \]

\[n: \text{Total number of observations}\]
\[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]
\[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]
\[\left| y_i - \hat{y}_i \right|: \text{Absolute error for the}\] \[i^{th} \text{ prediction}\]

Properties:

Can take any non-negative value
A perfect model will have an MAE of 0, meaning no error between actual and predicted values
Gives equal importance to small and large errors, making it less sensitive to outliers
Retains the same unit as the dependent variable, providing a direct interpretation of the average error

Advantages/Disadvantages:

A: Robust to Outliers - it doesn’t square the errors
A: Simple to Understand and Calculate
D: Not Differentiable at Zero
D: Insensitive to Large Errors - may not penalize large errors

When to use:

When Outliers are Present - If the dataset contains outliers or noisy data
When an easily interpretable error metric is needed in the same unit as the target variable
Often used in time series forecasting to evaluate how close predictions are to the observed values

How it works


Actual (y)	Predicted (y_hat)	Error (y-y_hat)	Absolute Error
3.0	2.5	0.5	0.5
5.0	4.5	0.5	0.5
2.0	2.5	-0.5	0.5
7.0	6.0	1.0	1.0

Calculate Errors: For each data point, find the difference between the actual value and the predicted value
Take the Absolute Value: Ignore whether the error is positive or negative by taking the absolute value of each error
Average the Absolute Errors: Sum all the absolute errors and divide by the number of observations
0.5 + 0.5 + 0.5 + 1.0 = 2.5 / 4 = 0.625 - on average, the model’s predictions deviate from the actual values by 0.625 units

Mean Squared Error (MSE) - Squares the differences between predicted and actual values before averaging, emphasizing larger errors. It captures how well a model’s predictions align with the true outcomes, with more weight given to larger errors due to squaring.

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

\[n: \text{Total number of observations}\]
\[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]
\[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]
\[\left( y_i - \hat{y}_i \right)^2: \text{Squared error for the}\] \[i^{th} \text{ prediction}\]

Properties:

Always non-negative
A perfect model has an MSE of 0, indicating no difference between actual and predicted values
Penalizes large deviations more than small ones, making it sensitive to outliers
Expressed in the square of the unit of the target variable

Advantages/Disadvantages:

A: Popular in Machine Learning - used extensively in model evaluation and optimization
A: Mathematically Convenient - differentiable, which makes it suitable for gradient-based optimization techniques to minimize error during training
D: Sensitive to Outliers - large errors have a significant impact on MSE, making it less robust to datasets with noisy data or outliers
D: Squared units of MSE can make it harder to interpret

When to use:

Training Machine Learning Models - common loss function for many algorithms
Evaluating Models - provides insight into the magnitude of the average squared error
When Large Errors are Critical

How it works


Actual (y)	Predicted (y_hat)	Error (y-y_hat)	Squared Error
3.0	2.5	0.5	0.25
5.0	4.5	0.5	0.25
2.0	2.5	-0.5	0.25
7.0	6.0	1.0	1.00

Calculate Errors: Subtract the predicted value from the actual value for each observation
Square the Errors: Squaring ensures that all errors are positive and emphasizes larger errors more than smaller ones
Average the Squared Errors: Sum the squared errors and divide by the total number of observations
0.25 + 0.25 + 0.25 + 1.00 = 1.75 / 4 = 0.4375

Example in Python


from sklearn.metrics import mean_squared_error
import numpy as np

# Example usage
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.1, 7.8])

# Calculate MSE with sklearn
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Define function for MSE
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Calculate MSE
error = mse(y_true, y_pred)
print("Mean Squared Error:", error)

metrics.mean_squared_error(y_true, y_pred) - mean squared error regression loss

y_true: ground truth (correct) target values
y_pred: estimated target values

Root Mean Squared Error (RMSE) - The square root of Mean Squared Error (MSE) shares the same unit as the target variable, enhancing its interpretability

\[ RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 } \]

\[n: \text{Total number of observations}\]
\[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]
\[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]

Properties:

Can take any non-negative value
A perfect model would have an RMSE of 0, meaning there is no difference between actual and predicted values
Gives higher weight to large errors because of the squaring step, this makes it sensitive to outliers
Retains the same unit of measurement as the dependent variable

Advantages/Disadvantages:

A: Easy Interpretation - same units as the target variable
A: Effective for Penalizing Large Errors - particularly useful when large errors are undesirable and need to be minimized
D: Sensitive to Outliers - single large error can disproportionately increase the RMSE
D: Not Robust - if the data contains noise or outliers, RMSE may not be the best performance measure
D: Comparison Limitation - it is difficult to compare RMSE across different datasets with varying scales

When to use:

Regression Models
Forecasting - helps in understanding how much the forecast deviates from the observed values
Model Selection - often used to compare models, the one with the lowest RMSE is considered better

How it works


Actual (y)	Predicted (y_hat)	Error (y-y_hat)	Squared Error
3.0	2.5	0.5	0.25
5.0	4.5	0.5	0.25
2.0	2.5	-0.5	0.25
7.0	6.0	1.0	1.00

Calculate Errors: For each data point, find the difference between the actual value and the predicted value
Square the Errors: Ensures that negative errors don’t cancel out positive ones and gives more weight to larger errors
Average the Squared Errors: Compute the mean of these squared errors
Take the Square Root: The square root brings the result back to the original unit of measurement
0.25 + 0.25 + 0.25 + 1.00 = 1.75 / 4 = 0.4375 ^ 0.5 = 0.661

Example in Python


import numpy as np
from sklearn.metrics import mean_squared_error

# Example usage
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.1, 7.8])

# Calculate RMSE with sklearn
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print("Root Mean Squared Error:", rmse)

# Define function for RMSE
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

# Calculate RMSE
error = rmse(y_true, y_pred)
print("Root Mean Squared Error:", error)

R-squared (R2) - Quantifies the predictability of the dependent variable based on the independent variables, ranging from 0 to 1, with 1 denoting flawless predictions. The coefficient of determination represents statistical measure that indicates the proportion of variance in the dependent (target) variable that is explained by the independent (predictor) variables in a regression model.

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

\(SS_{res} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\) : Residual Sum of Squares (the sum of squared errors between actual and predicted values).
\(SS_{tot} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2\) : Total Sum of Squares (the total variance of the actual values from their mean).
\[y_i: \text{Actual value for the}\] \[i^{th} \text{ data point}\]
\[\hat{y}_i: \text{Predicted value for the}\] \[i^{th} \text{ data point}\]
\[\bar{y}: \text{Mean of the actual values}\]

Interpretation:

R² = 1: model explains all the variance in the target variable (perfect fit)
R² = 0: model explains none of the variance, meaning predictions are no better than simply using the mean of the actual values
R² < 0: indicates that the model performs worse than a simple mean-based model. It can happen if the predictions are far from the actual values

Advantages/Disadvantages:

A: Easy Interpretation - provides a clear percentage of how much variation in the target variable is explained by the model
A: Model Comparison - helps compare multiple models, model with a higher R² is generally better at explaining variance in the target variable
A: Widely Used - tandard measure in regression analysis and is useful for determining model quality
D: Does Not Measure Predictive Accuracy - high R² does not guarantee that the model will make accurate predictions for new data
D: Sensitive to the Number of Predictors - adding more predictors can artificially inflate R²
D: Only Works with Linear Relationships - may be misleading if the relationship between variables is non-linear

How it works


Actual (y)	Predicted (y_hat)	Mean of Actual ()
3.0	2.5	4.25
5.0	4.5	4.25
2.0	2.5	4.25
7.0	6.0	4.25

Calculate Total Variance: measures how much the actual values vary from the mean. It reflects the overall variability in the data
Calculate Residual Variance: measures how far the predictions are from the actual values (prediction errors)
Compute R²: compares the proportion of unexplained variance (residual variance) to the total variance. If the residual variance is small compared to the total variance, R² will be close to 1, indicating that the model fits the data well.
SStot: (3.0 - 4.25) ^ 2 + (5.0 - 4.25) ^ 2 + (2.0 - 4.25) ^ 2 + (7.0 - 4.25) ^ 2 = 17.0
SSres: (3.0 - 2.5) ^ 2 + (5.0 - 4.5) ^ 2 + (2.0 - 2.5) ^ 2 + (7.0 - 6.0) ^ 2 = 2.75
R²: 1 - (2.75 / 17.0) = 0.838 - meaning the model explains 83.8% of the variance in the target variable

Example in Python


import numpy as np
from sklearn.metrics import r2_score

# Example true and predicted values
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.1, 7.8]

# Calculate R-squared
r_squared = r2_score(y_true, y_pred)
print("R-squared:", r_squared)

# Convert to numpy arrays
y_true = np.array(y_true)
y_pred = np.array(y_pred)

# Calculate the mean of the true values
y_mean = np.mean(y_true)

# Calculate the total sum of squares (TSS) and residual sum of squares (RSS)
ss_total = np.sum((y_true - y_mean) ** 2)
ss_residual = np.sum((y_true - y_pred) ** 2)

# Calculate R-squared
r_squared = 1 - (ss_residual / ss_total)
print("R-squared:", r_squared)

metrics.r2_score(y_true, y_pred) - R2 (coefficient of determination) regression score function (best possible score is 1.0)

y_true: ground truth (correct) target values
y_pred: estimated target values

Adjusted R-squared - An adjustment of R-squared that penalizes the addition of unnecessary predictors, providing a better measure of model complexity

\[ R^2_{adj} = 1 - \left( \frac{SS_{res} / (n - k - 1)}{SS_{tot} / (n - 1)} \right) \]

\(SS_{res} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\) : Residual Sum of Squares.

\(SS_{tot} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2\) : Total Sum of Squares.
\[n: \text{Total number of observations}\]
\[k: \text{Total number of predictors}\]

Interpretation:

Adjusted R² will be smaller than R² if additional predictors do not significantly improve the model

Example in Python


import numpy as np
from sklearn.metrics import r2_score

# Example true and predicted values
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.1, 7.8]

# Calculate R-squared
r_squared = r2_score(y_true, y_pred)
print("R-squared:", r_squared)

# Number of observations and predictors
n = len(y_true)  # number of data points
p = 1  # number of predictors (set to 1 for simplicity, adjust for more features)

# Calculate Adjusted R-squared
r_squared_adjusted = 1 - ((1 - r_squared) * (n - 1) / (n - p - 1))
print("Adjusted R-squared:", r_squared_adjusted)

Linear Regression (simple / multiple)

Model the relationship between a dependent (response/target) continuous variable and one or more independent (predictors/features) variables by fitting a linear equation to observed data

Simple Linear Regression involves a single independent variable (relationship between this variable and the target is represented as a straight line)

Multiple Linear Regression extends simple linear regression by including multiple predictors

Used when relationship between the dependent and independent variables is linear, dataset is small to medium-sized and does not have too many complex features, outliers are minimal and the assumptions of linearity and homoscedasticity are reasonable

\( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon \)

\[y\] : dependent (response/target) variable we aim to predict (changes when there is any change in the values of the independent variables)
\[x_i\] : independent (predictors/features) variables (does not change based on the effects of other variables)
\[\beta_0\] : y-intercept (the value of y when all \[x_i=0\])
\[\beta_i\] : coefficients corresponding to each independent variable (change (increase/decrease) in y for a unit increase in \[x_i)\]
\[\epsilon\] : error term (variability in y that cannot be explained by the linear relationship with x)

Fitting a Linear Regression model (cost function???)

Ordinary Least Squares (OLS) minimizes the sum of squared residuals (errors). The sum of squared differences between observed values and predicted values

\( \text{Minimize } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \)

\[y_i\] : actual value
\[\hat{y}_i\] : predicted value

Assumptions

Linearity: Relationship between predictors and the target variable is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of errors across values of the independent variables
Normality of Errors: Residuals (differences between observed and predicted values) are normally distributed

Applications

Finance: Predicting stock prices or return on investment based on market indicators
Healthcare: Estimating health outcomes based on risk factors
Economics: Modeling economic indicators like GDP growth or unemployment rates
Marketing: Predicting sales based on advertising expenditure or customer demographics

Limitations

Assumes Linearity: Poor performance if the relationship is non-linear
Sensitive to Outliers: Outliers can disproportionately affect the model
Assumes Homoscedasticity: If the variance of errors varies, the model’s predictions are likely biased
Multicollinearity: High correlation between independent variables can lead to unreliable estimates

Example in Python



import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

df = pd.DataFrame(data)

# Define predictor and response variable
X = df[['Advertising']]
y = df['Sales']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

linear_model.LinearRegression - ordinary least squares Linear Regression

coef_ - estimated coefficients for the linear regression problem
intercept_ - independent term in the linear model
fit(X, y) - fit linear model, X: training data; y: target values
predict(X) - predict using the linear model
score(X, y) - return the coefficient of determination R2 of the prediction

Polynomial Regression (simple / multiple)

Type of regression analysis in which the relationship between the independent variable (predictor) and the dependent variable (response) is modeled as an n-th degree polynomial

Polynomial regression is capable of capturing non-linear relationships by fitting a curved line to the data

As the degree n increases, the curve becomes more flexible, allowing the model to capture more complex patterns in the data

Used when relationship between variables is non-linear, but still smooth (can be captured by a polynomial curve)

\( y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \epsilon \)

\[y\] : dependent (response/target) variable
\[x\] : independent (predictor/feature) variable
\[\beta_i\] : coefficients of the polynomial
\[n\] : degree of the polynomial (quadratic for n=2, cubic for n=3)
\[\epsilon\] : error term

Fitting a Polynomial Regression model (cost function)

Ordinary Least Squares (OLS) minimizes the sum of squared residuals (errors). The sum of squared differences between observed values and predicted values

\( \text{Minimize } \sum_{i=1}^{n} \left( y_i - \left( \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_n x_i^n \right) \right)^2 \)

\[y_i\] : actual value
\[\hat{y}_i\] : predicted value

Steps

Choose the Degree of the Polynomial: The degree n determines the complexity of the curve
Transform the Independent Variable: The original variable x is expanded to include higher powers (x, x^2, x^3...)
Fit the Model Using Linear Regression: Treat the transformed terms as new features and apply ordinary least squares (OLS) regression
Evaluate and Refine: Check the model’s performance, possibly adjusting the polynomial degree to balance bias and variance

Choosing the Degree

Underfitting: When the degree is too low, the model might not capture the underlying trend in the data, leading to high bias
Overfitting: When the degree is too high, the model might capture noise or random fluctuations in the data rather than the underlying pattern, leading to high variance
Optimal Degree: Cross-validation or evaluating metrics on test data can help determine the optimal polynomial degree, balancing bias and variance (bias-variance tradeoff)

Assumptions

Linearity of Parameters: Although the model itself is non-linear, it is linear in terms of the parameters (coefficients)
Independence of Errors: Observations are independent of each other
Homoscedasticity: Constant variance of errors across values of the independent variables
Normality of Errors: Residuals (differences between observed and predicted values) are normally distributed

Applications

Economics: Modeling complex relationships between economic indicators
Environmental Science: Modeling growth patterns, such as population growth over time
Engineering: Estimating stress-strain relationships in material science
Marketing: Understanding the impact of advertising spending on sales, where effects are non-linear

Advantages

Captures Non-Linear Relationships: By introducing polynomial terms, the model can represent curved patterns in the data
Flexibility: Polynomial regression is more flexible than simple linear regression, making it suitable for data with a non-linear trend
Easy to Implement: Can be implemented using simple linear regression tools, with the only modification being the transformation of the input variables

Limitations

Overfitting: High-degree polynomials can overfit the training data, leading to poor generalization to new data
Sensitive to Outliers: Polynomial regression is particularly sensitive to outliers, which can skew the polynomial curve
Extrapolation Challenges: Predictions outside the range of the data can be highly unreliable, as polynomial regression is often unstable in these regions
Complexity Increases Quickly: As the polynomial degree increases, the model complexity increases, making it harder to interpret

Example in Python


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(-1, 1)
y = np.array([1, 4, 9, 16, 25, 36, 49, 64, 81])

# Transform the data to include polynomial terms (e.g., x, x^2)
poly = PolynomialFeatures(degree=2)  # Change degree as needed
x_poly = poly.fit_transform(x)

# Fit the polynomial regression model
model = LinearRegression()
model.fit(x_poly, y)

# Predict values
y_pred = model.predict(x_poly)

# Evaluation
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)

# Plotting the results
plt.scatter(x, y, color='blue', label='Actual data')
plt.plot(x, y_pred, color='red', label='Polynomial regression fit')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

preprocessing.PolynomialFeatures - generate polynomial and interaction features

degree - int specifies the maximal degree of the polynomial features, or tuple for (min_degree, max_degree)
fit_transform - fit to data, then transform it (X: input samples; y: target values)

Ridge Regression (L2 Regularization)

Linear regression technique that includes a regularization term to mitigate overfitting, especially when working with multicollinear data

Penalty is added to the linear regression objective function based on the squared magnitude of the coefficients, effectively “shrinking” coefficients towards zero but never allowing them to be exactly zero

Used for scenarios with multicollinearity among features, high-dimensional data where overfitting is a concern, situations where interpretability is less critical than predictive performance

\[ \text{Cost} = \text{MSE} + \lambda \sum_{j=1}^{p} w_j^2 \]

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] : the Mean Squared Error
\[\lambda\] : regularization parameter that controls the strength of the penalty
\[w_j\] : coefficients (weights) of the model

As 𝜆 increases, the influence of the regularization term also increases, forcing the values of the coefficients 𝑤𝑗 to shrink, but not to exactly zero

How Ridge Regression Works

reduces the model’s variance by shrinking the coefficient estimates, making the model less sensitive to small fluctuations in the data
increasing 𝜆 introduces more bias, the goal is to find a balance that minimizes overall prediction error
when 𝜆 = 0 Ridge Regression is equivalent to ordinary least squares regression since no penalty is applied
as 𝜆 approaches infinity the model shrinks all coefficients towards zero, resulting in a model that only predicts the mean of the target variable
the optimal value for 𝜆 is determined through cross-validation
in datasets where features are correlated (multicollinear), Ridge Regression reduces the variance caused by multicollinearity by penalizing large coefficients
Ridge does not perform feature selection, so all features remain in the model with adjusted coefficients

Applications

Credit Risk Scoring: Predicting the probability of loan default with many features (e.g., customer demographics, financial history)
Marketing Campaign Response: Estimating customer response to an email campaign when some features (e.g., past purchases) are correlated
Biological Research: Identifying significant genes for a disease from thousands of genetic features (Lasso helps with feature selection)

Advantages

Reduces Overfitting: The L2 penalty stabilizes the model by controlling high-variance coefficients
Works with High-Dimensional Data: Performs well when there are more features than observations
Closed-Form Solution: For simple cases, Ridge Regression has a closed-form solution, making it computationally efficient

Disadvantages

Does Not Perform Feature Selection: Unlike Lasso Regression, Ridge does not eliminate features, which can lead to less interpretable models
Sensitivity to Feature Scaling: Like many distance-based techniques, Ridge Regression is sensitive to the scale of features, so standardization or normalization is typically required

Example in Python


from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate or load data
X = np.random.rand(100, 3)  # example features
y = X @ np.array([3, 5, 2]) + np.random.normal(0, 1, 100)  # example target with added noise

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features (important for regularized models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the Ridge Regression model
ridge = Ridge(alpha=1.0)  # alpha is the regularization parameter (lambda)
ridge.fit(X_train_scaled, y_train)

# Predict and evaluate the model
y_pred = ridge.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Model Coefficients:", ridge.coef_)

linear_model.Ridge - linear least squares with l2 regularization

alpha - constant that multiplies the L2 term, controlling regularization strength
coef_ - weight vector(s)
intercept_ - independent term in decision function
fit(X, y) - fit Ridge regression model, X: training data; y: target values
predict(X) - predict using the linear model
score(X, y) - return the coefficient of determination R2 of the prediction

Lasso Regression (L1 Regularization) - Least Absolute Shrinkage and Selection Operator

Regression technique that introduces a penalty equal to the absolute value of the magnitude of coefficients

Uses an L1 penalty (the sum of absolute values of coefficients) that results in feature selection as it drives some coefficients to zero, effectively removing less important features from the model

Used when feature selection is required (high-dimensional datasets with many irrelevant features), sparse solutions are desirable (only a few significant features should have non-zero coefficients), data overfitting is a concern

\[ \text{Cost} = \text{MSE} + \lambda \sum_{j=1}^{p} |w_j| \]

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] : the Mean Squared Error
\[\lambda\] : regularization parameter that controls the strength of the penalty
\[wj\] : coefficients (weights) of the model

As 𝜆 increases the penalty term grows, shrinking coefficients toward zero, when is sufficiently strong some coefficients may be forced to zero, resulting in automatic feature selection

How lasso Regression Works

can reduce some coefficients to exactly zero that is particularly helpful in sparse datasets where only a subset of features influences the target variable
By tuning 𝜆 you can control the number of features retained, balancing interpretability with predictive performance
increasing 𝜆 introduces more bias, as it forces the model to generalize by shrinking coefficients
Lowering 𝜆 allows more variance in the coefficients, but this can lead to overfitting if there is high noise in the data
If the data has highly correlated features, Lasso may arbitrarily select only one of them, ignoring the others

Applications

Credit Risk Scoring: Predicting the probability of loan default with many features (e.g., customer demographics, financial history)
Marketing Campaign Response: Estimating customer response to an email campaign when some features (e.g., past purchases) are correlated
Biological Research: Identifying significant genes for a disease from thousands of genetic features (Lasso helps with feature selection)

Advantages

Feature Selection and Sparsity: Lasso automatically performs feature selection, producing simpler and more interpretable models
Prevents Overfitting: The regularization term reduces the risk of overfitting by penalizing complex models
Useful for High-Dimensional Data: Particularly helpful in scenarios with many irrelevant features

Disadvantages

Feature Exclusion with Correlated Data: Lasso can exclude correlated features, leading to biased feature importance
Sensitivity to Feature Scaling: Lasso requires standardized or normalized data to perform effectively

Example in Python


from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate or load sample data
X = np.random.rand(100, 3)  # example features
y = X @ np.array([3, 5, 0]) + np.random.normal(0, 1, 100)  # target with sparse signal

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features (important for regularized models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the Lasso Regression model
lasso = Lasso(alpha=1.0)  # alpha is the regularization parameter (lambda)
lasso.fit(X_train_scaled, y_train)

# Predict and evaluate the model
y_pred = lasso.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Model Coefficients:", lasso.coef_)

linear_model.Lasso - linear model trained with L1 prior as regularizer

alpha - constant that multiplies the L1 term, controlling regularization strength
coef_ - weight vector(s)
intercept_ - independent term in decision function
fit(X, y) - fit Lasso regression model, X: training data; y: target values
predict(X) - predict using the linear model
score(X, y) - return the coefficient of determination R2 of the prediction

K-Nearest Neighbors (KNN) Regression

Non-parametric, instance-based algorithm that predicts the value of a target variable by averaging the values of the 𝑘-closest data points (neighbors) to a given input point

Doesn’t assume any underlying distribution of the data or a linear relationship between variables (flexible choice, especially for non-linear datasets)

Used when the data has local patterns and you need a non-parametric model, works well with small datasets but struggles with large datasets

\[ d(x, x') = \sqrt{\sum_{i=1}^{n} (x_i - x_i')^2} \] : Euclidean Distance between two points
\[ d(x, x') = \sum_{i=1}^{n} |x_i - x_i'| \] : Manhattan Distance
\[ \hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i \] : Prediction for KNN Regression as the average of the target values of the \( k \) nearest neighbors
\[ \hat{y} = \frac{\sum_{i=1}^{k} \frac{y_i}{d(x, x_i)}}{\sum_{i=1}^{k} \frac{1}{d(x, x_i)}} \] : Weighted KNN prediction

Key Concepts

relies on measuring the distance between data points to identify neighbors, commonly using the Euclidean distance
Manhattan, Minkowski, or Mahalanobis can also be used depending on the nature of the data and dimensionality
parameter 𝑘 represents the number of closest points used to make predictions, smaller 𝑘 values make the model sensitive to noise (high variance), while larger 𝑘 values lead to smoother predictions but can underfit (high bias)
typically 𝑘 is chosen through cross-validation, where the optimal value minimizes the prediction error on a validation set
for a given input x KNN regression identifies the 𝑘-nearest neighbors, retrieves their target values, and averages (makes KNN regression robust to outliers, as each prediction is influenced by multiple data points) them to produce the prediction
in weighted KNN neighbors closer to the query point have a higher influence on the prediction (particularly useful when neighbors vary significantly in their distance from the query point)

Applications

House Rent Estimation: Estimating rent based on nearby properties with similar characteristics
User Rating Prediction: Predicting product ratings by aggregating the ratings of similar users
Restaurant Recommendation: Predicting restaurant ratings for a user based on their similarity to other users' preferences

Advantages

Simplicity: KNN regression is straightforward to understand and implement
Non-Parametric: It doesn’t assume a specific form for the data distribution, which makes it adaptable to different patterns and useful for non-linear data
Interpretability: Each prediction can be explained by the nearest neighbors, which can provide insights into the relationships within the data

Disadvantages

Computationally Intensive: KNN regression requires calculating the distance from the query point to all other points, making it slow for large datasets
Sensitivity to Feature Scaling: Since KNN relies on distance calculations, feature scaling (e.g., normalization or standardization) is crucial for ensuring all features contribute equally to the distance metric
High Variance: The model’s performance depends heavily on the choice of 𝑘 and can fluctuate with small changes in the training data

Example in Python


from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate or load sample data
X = np.arange(1, 21).reshape(-1, 1)  # example feature (e.g., month)
y = np.random.normal(50, 10, 20)     # example target (e.g., sales)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and fit the KNN Regressor model
knn = KNeighborsRegressor(n_neighbors=3, weights='distance')  # using weighted KNN
knn.fit(X_train_scaled, y_train)

# Predict and evaluate the model
y_pred = knn.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

neighbors.KNeighborsRegressor - regression based on k-nearest neighbors

n_neighbors - number of neighbors to use by default for kneighbors queries
weights - weight function used in prediction (uniform, distance or callable)
coef_ - weight vector(s)
intercept_ - independent term in decision function
fit(X, y) - fit k-nearest neighbors regression model, X: training data; y: target values
predict(X) - predict the target for the provided data
score(X, y) - return the coefficient of determination R2 of the prediction

Support Vector Regression (SVR)

Type of Support Vector Machine (SVM) adapted for regression tasks, where the goal is to predict continuous values rather than classify data points

Attempts to find a function that fits the data within a specified margin of error (epsilon-insensitive margin), where the model disregards errors that fall within a distance 𝜖 of the true values

\[ L(y, f(x)) = \begin{cases} 0 & \text{if } |y - f(x)| \leq \epsilon \\ |y - f(x)| - \epsilon & \text{otherwise} \end{cases} \] : Epsilon-Insensitive Loss Function
\[ \min \frac{1}{2} ||w||^2 + C \sum_{i=1}^{n} \max(0, |y_i - f(x_i)| - \epsilon) \] : Objective function for SVR, balancing flatness and error minimization
\[ ||w||^2 \] : controls the flatness of the regression line
\[ C \] : regularization parameter that balances margin maximization and error minimization

Key Concepts

margin 𝜖 is set around the predicted line, errors within this margin are not penalized making it “insensitive” to small deviations, only points outside this margin contribute to the model’s error calculation
goal of SVR is to find a function that minimizes a combined objective: maximizing the margin while keeping errors outside the epsilon margin small
can handle non-linear relationships by using the kernel trick, allowing it to project data into higher-dimensional spaces where linear separability (or linear regression) is easier to achieve
support vectors are the data points that lie outside the epsilon margin, influencing the position and orientation of the regression line (only these points affect the SVR model)
regularization parameter 𝐶 controls the trade-off between maximizing the margin and minimizing the error beyond 𝜖 (high 𝐶 assigns a larger penalty to errors leading to overfitting)

Applications

House Rent Estimation: Estimating rent based on nearby properties with similar characteristics
User Rating Prediction: Predicting product ratings by aggregating the ratings of similar users
Restaurant Recommendation: Predicting restaurant ratings for a user based on their similarity to other users' preferences

Advantages

Robust to Outliers: SVR is robust to outliers because of its epsilon-insensitive margin, which can ignore minor deviations within the 𝜖-margin
Effective in High-Dimensional Spaces: SVR performs well in high-dimensional data, making it suitable for complex and sparse datasets
Non-Linear Capabilities: With kernels, SVR can capture non-linear patterns effectively, offering flexibility across different types of data

Example in Python


from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate or load data
# X = features, y = target
X = np.arange(1, 21).reshape(-1, 1)  # example feature (e.g., months)
y = np.random.normal(50, 10, 20)     # example target (e.g., sales)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize data (SVR often performs better with scaled data)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit SVR model with RBF kernel
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = svr.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

sklearn.svm.SVR - epsilon-support vector regression

kernel - specifies the kernel type to be used in the algorithm (linear, poly, rbf, sigmoid, precomputed)
C - regularization parameter
epsilon - specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value
fit(X, y) - fit the SVM model according to the given training data, X: training vectors; y: target values
predict(X) - perform regression on samples in X
score(X, y) - return the coefficient of determination R2 of the prediction

Logistic Regression

statistical and machine learning method commonly used for binary classification problems, where the goal is to predict the probability of a binary outcome (e.g., yes/no, true/false, 0/1) based on one or more predictor variables

logistic regression is fundamentally a classification algorithm that applies a logistic (or sigmoid) function to estimate probabilities

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \] : logistic or sigmoid function is used to map any real-valued number to a probability between 0 and 1
\[ z = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p \] : linear combination for z
\[ P(y = 1 | X) = \sigma(z) = \frac{1}{1 + e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}} \] : The probability of the positive class (class 1)
\[ P(y = 0 | X) = 1 - \sigma(z) = \frac{e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}}{1 + e^{-(w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p)}} \] : The probability of the negative class (class 0)
\[ y = \begin{cases} 1 & \text{if } P(y=1 | X) \geq 0.5 \\ 0 & \text{if } P(y=1 | X) < 0.5 \end{cases} \] : decision threshold (commonly 0.5) to classify the output
\[ \text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i) \right] \] : Log Loss (Binary Cross-Entropy Loss) cost function
\[ \hat{w} = \arg \min_{w} -\sum_{i=1}^{n} \left[ y_i \cdot \log(\sigma(z_i)) + (1 - y_i) \cdot \log(1 - \sigma(z_i)) \right] \] : Maximum Likelihood Estimation - the objective of logistic regression is to maximize the likelihood of the observed data, often by minimizing the negative log likelihood

Key Concepts

logistic function, or sigmoid function maps any real-valued number to a range between 0 and 1, which can then be interpreted as a probability
𝑧 is a linear combination of input features, where 𝑤 represents the model’s coefficients and 𝑥 the input features
sigmoid function transforms this linear output to a probability score, making it suitable for binary classification
outputs a probability score between 0 and 1, a threshold of 0.5 is used to classify the output: values greater than or equal to 0.5 are classified as one class (often labeled as 1)
optimized using a loss function known as log loss or binary cross-entropy, which measures the difference between the predicted probability and the actual label
logistic regression coefficients are estimated using maximum likelihood estimation, adjusts the weights so that the predicted probabilities align as closely as possible with the observed labels
assumes a linear relationship between the independent variables and the log-odds of the outcome
each observation should be independent of others, as logistic regression does not handle correlated data well
highly correlated features can affect the model's stability, so feature selection or dimensionality reduction techniques (e.g., PCA) are often applied beforehand

Advantages

Simple and Interpretable: Logistic regression is easy to interpret and provides straightforward probabilistic outputs
Efficient for Binary Classification: It is computationally efficient and performs well on linearly separable datasets
Works Well with Small Datasets: Logistic regression can perform well with relatively small datasets and is less prone to overfitting compared to more complex models when regularization is used

Disadvantages

Linear Decision Boundary: Logistic regression works best for linear decision boundaries. For more complex boundaries, it may underperform unless features are transformed or nonlinear techniques are used
Sensitive to Outliers: Logistic regression can be sensitive to outliers, particularly if regularization is not applied
Limited to Binary Classification: Standard logistic regression is inherently binary, although it can be extended to multiclass classification using techniques such as One-vs-Rest (OvR) and Softmax Regression

Example in Python


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Generate or load example data
X = [[1.1], [2.5], [3.3], [4.5], [5.1], [6.2], [7.4]]  # example features
y = [0, 0, 1, 1, 0, 1, 1]  # binary target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Classification (Supervised Learning)

A supervised learning approach that determines the class label for an unlabeled test case.

Categorizing some unknown items into a discrete set of categories or "classes"

The target attribute is a categorical or discrete variable

Application:

Which category a customer belongs to
Whether a customer switches to another provider/brand
Whether a customer responds to a particular advertising campaign

Evaluation:

Accuracy - The proportion of correctly classified instances out of the total instances. It's a simple and intuitive metric but can be misleading with imbalanced datasets
Precision - Precision quantifies the proportion of accurately predicted positive instances among all instances classified as positive, showcasing the model's capacity to identify true positives correctly
Recall (Sensitivity) - The ratio of correctly predicted positive observations to all actual positives. It measures the model's ability to find all the positive instances
F1 Score - The harmonic mean of precision and recall. It balances precision and recall, especially when dealing with imbalanced datasets
ROC Curve and AUC - ROC curves visualize the trade-off between true positive rate (TPR) and false positive rate (FPR) at various thresholds. AUC summarizes the ROC curve into a single value, indicating the model's ability to discriminate between positive and negative classes
Confusion Matrix - A table summarizing the number of correct and incorrect predictions provides insights into the model's performance across different classes
Classification Report - It comprehensively summarizes various classification metrics for each class, such as precision, recall, F1 score, and support