forked from ajay-dhangar/algo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
56 changed files
with
6,197 additions
and
683 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
--- | ||
id: k-nearest-neighbors | ||
title: k-Nearest Neighbors Algorithm | ||
sidebar_label: k-Nearest Neighbors | ||
description: "In this post, we'll explore the k-Nearest Neighbors (k-NN) Algorithm, one of the simplest and most intuitive algorithms in machine learning." | ||
tags: [machine learning, algorithms, classification, regression, k-NN] | ||
|
||
--- | ||
|
||
### Definition: | ||
**k-Nearest Neighbors (k-NN)** is a simple and widely used **supervised learning algorithm**. It can be applied to both **classification** and **regression** tasks. The algorithm classifies or predicts a data point based on how closely it resembles its neighbors. k-NN does not have an explicit training phase; instead, it stores the entire dataset and makes predictions by finding the **k nearest neighbors** of a given input and using their majority class (for classification) or average value (for regression) to make predictions. | ||
|
||
### Characteristics: | ||
- **Instance-Based Learning**: | ||
k-NN is a **lazy learner**, meaning it stores all training instances and delays computation until prediction time. | ||
|
||
- **Non-Parametric**: | ||
It makes no assumptions about the underlying data distribution, making it highly flexible but sensitive to noisy data. | ||
|
||
- **Distance-Based**: | ||
The algorithm relies on a **distance metric** (e.g., Euclidean distance) to measure how close or similar the data points are. | ||
|
||
### How k-NN Works: | ||
1. **Data Collection**: | ||
k-NN requires a labeled dataset of examples, where each example consists of feature values and a corresponding label (for classification) or continuous target (for regression). | ||
|
||
2. **Prediction**: | ||
To predict the label or value for a new, unseen data point: | ||
- Measure the distance between the new point and all points in the training set using a distance metric. | ||
- Select the **k nearest neighbors** based on the shortest distances. | ||
- For classification, assign the most frequent class (majority voting) among the k neighbors. For regression, calculate the average of the k neighbors’ values. | ||
|
||
### Distance Metrics: | ||
The most commonly used distance metrics in k-NN are: | ||
- **Euclidean Distance** (default for continuous variables): | ||
![image](https://github.com/user-attachments/assets/3e1f84fb-7ff8-426b-b89a-b6e356784e90) | ||
|
||
|
||
- **Manhattan Distance**: | ||
![image](https://github.com/user-attachments/assets/fe14fdcd-20c6-47eb-9225-51357cb33dd8) | ||
|
||
|
||
- **Minkowski Distance** (generalization of Euclidean and Manhattan): | ||
![image](https://github.com/user-attachments/assets/7c6c7797-d1b9-4eab-8239-20d0691d9c85) | ||
|
||
|
||
- **Hamming Distance** (used for categorical variables): | ||
Measures the number of positions at which two binary strings differ. | ||
|
||
### Choosing k: | ||
- **Small k**: | ||
A smaller k (e.g., k=1 or k=3) makes the model sensitive to noise and can lead to **overfitting** because it considers fewer neighbors. | ||
|
||
- **Large k**: | ||
A larger k provides a more generalized prediction but may lead to **underfitting** if it includes too many neighbors from different classes. | ||
|
||
- **Optimal k**: | ||
The ideal value of k is often found through experimentation or by using techniques like **cross-validation**. | ||
|
||
### Classification with k-NN: | ||
In classification tasks, k-NN assigns the class label based on the majority class among the k-nearest neighbors. Each neighbor votes for its class, and the most frequent class becomes the prediction. | ||
|
||
**Example**: | ||
Suppose we are classifying an unknown data point based on three nearest neighbors (k=3), and the classes of these neighbors are: | ||
- Neighbor 1: Class A | ||
- Neighbor 2: Class A | ||
- Neighbor 3: Class B | ||
|
||
Since Class A occurs more frequently, the new point is assigned to **Class A**. | ||
|
||
### Regression with k-NN: | ||
In regression tasks, k-NN predicts the target value based on the **average** of the values of its k nearest neighbors. | ||
|
||
**Example**: | ||
To predict the price of a house, k-NN will find k houses with similar features (square footage, number of rooms) and return the average price of those k houses as the predicted price. | ||
|
||
### Steps in k-NN Algorithm: | ||
1. **Data Storage**: | ||
k-NN stores the entire dataset of training examples. | ||
|
||
2. **Distance Calculation**: | ||
For a new input data point, compute the distance between the input and every point in the training set using a chosen distance metric. | ||
|
||
3. **Identify Neighbors**: | ||
Sort the distances and identify the k-nearest neighbors. | ||
|
||
4. **Prediction**: | ||
- For classification, count the occurrences of each class among the k neighbors and assign the class with the majority votes. | ||
- For regression, take the average of the k neighbors' target values. | ||
|
||
### Problem Statement: | ||
Given a dataset with labeled examples (for classification) or continuous targets (for regression), the goal is to classify or predict new input data points by finding the k most similar data points in the training set and using them to infer the output. | ||
|
||
### Key Concepts: | ||
- **Lazy Learning**: | ||
k-NN is called a lazy learner because it doesn’t explicitly learn a model during training but simply stores the training dataset. | ||
|
||
- **Similarity**: | ||
The similarity between data points is quantified by calculating the distance between their feature vectors. | ||
|
||
- **Majority Voting**: | ||
For classification, the class of a new data point is determined by the majority class among its k nearest neighbors. | ||
|
||
- **Averaging**: | ||
For regression, the predicted value is the average of the k nearest neighbors' target values. | ||
|
||
### Time Complexity: | ||
- **Training Time Complexity: $O(1)$** | ||
k-NN doesn’t require a training phase, so it takes constant time. | ||
|
||
- **Prediction Time Complexity: $O(n \cdot d)$** | ||
Predicting the class or value for a new data point requires computing the distance between the new point and all n training points, each of which has d dimensions. | ||
|
||
### Space Complexity: | ||
- **Space Complexity: $O(n \cdot d)$** | ||
The algorithm stores the entire dataset, which consists of n points in d dimensions. | ||
|
||
### Example: | ||
Consider a simple k-NN classification example for predicting whether a fruit is an apple or an orange based on its features (weight and color): | ||
|
||
- Dataset: | ||
``` | ||
| Weight (g) | Color (scale 1-10) | Fruit | | ||
|------------|--------------------|---------| | ||
| 150 | 8 | Apple | | ||
| 170 | 7 | Apple | | ||
| 130 | 6 | Orange | | ||
| 140 | 5 | Orange | | ||
``` | ||
|
||
**Step-by-Step Execution**: | ||
|
||
1. **Store Dataset**: | ||
Store the dataset as-is. | ||
|
||
2. **Calculate Distances**: | ||
For a new fruit with a weight of 160g and color value 7, compute the distance from this point to all existing data points. | ||
|
||
3. **Find k Nearest Neighbors**: | ||
If k=3, identify the 3 closest fruits to the new one based on the shortest distances. | ||
|
||
4. **Make Prediction**: | ||
Count the class occurrences (Apple or Orange) among the 3 nearest neighbors and assign the new fruit the most frequent class. | ||
|
||
### Python Implementation: | ||
Here’s a simple implementation of the k-NN algorithm using **scikit-learn**: | ||
|
||
```python | ||
from sklearn.neighbors import KNeighborsClassifier | ||
import numpy as np | ||
|
||
# Sample data | ||
X = np.array([[150, 8], [170, 7], [130, 6], [140, 5]]) # Features (Weight, Color) | ||
y = np.array(['Apple', 'Apple', 'Orange', 'Orange']) # Target (Fruit) | ||
|
||
# Create k-NN classifier | ||
knn = KNeighborsClassifier(n_neighbors=3) | ||
|
||
# Train the model | ||
knn.fit(X, y) | ||
|
||
# Make a prediction for a new fruit | ||
new_fruit = np.array([[160, 7]]) # New fruit with weight 160g and color 7 | ||
predicted_fruit = knn.predict(new_fruit) | ||
print(f"The predicted fruit is: {predicted_fruit[0]}") | ||
``` | ||
|
||
### Summary: | ||
The **k-Nearest Neighbors Algorithm** is a straightforward and intuitive method for both classification and regression tasks. It works by finding the k most similar examples in the training dataset and using them to predict the class or value of a new data point. While easy to implement, k-NN can be computationally expensive, especially on large datasets, as it requires calculating the distance to every training point at prediction time. | ||
|
||
--- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
--- | ||
id: linear-regression | ||
title: Linear Regression Algorithm | ||
sidebar_label: Linear Regression | ||
description: "In this post, we'll explore the Linear Regression Algorithm, one of the most basic and commonly used algorithms in machine learning." | ||
tags: [machine learning, algorithms, linear regression, regression] | ||
|
||
--- | ||
|
||
### Definition: | ||
**Linear Regression** is a supervised learning algorithm used for **predictive modeling** of continuous numerical variables. It establishes a linear relationship between a dependent variable (the target) and one or more independent variables (the features). The goal of linear regression is to model this relationship using a straight line (linear equation) to predict the target values based on input features. | ||
|
||
### Characteristics: | ||
- **Regression Model**: | ||
Unlike classification, linear regression predicts **continuous** values rather than categories. | ||
|
||
- **Linear Relationship**: | ||
Assumes a linear relationship between the features and the target variable, where changes in feature values proportionally affect the target. | ||
|
||
- **Minimizing Error**: | ||
The model minimizes the difference between the actual values and predicted values using a method called **Ordinary Least Squares** (OLS). | ||
|
||
### Types of Linear Regression: | ||
1. **Simple Linear Regression**: | ||
Used when there is one independent variable to predict the target. | ||
Example: Predicting house price based solely on square footage. | ||
|
||
2. **Multiple Linear Regression**: | ||
Used when there are two or more independent variables to predict the target. | ||
Example: Predicting house price based on square footage, number of rooms, and age of the house. | ||
|
||
### Linear Equation: | ||
In linear regression, the relationship between the target \( y \) and the input features \( X \) is modeled using the equation of a straight line: | ||
|
||
![image](https://github.com/user-attachments/assets/1875d7db-cf35-4ce7-a907-52c2366b2f94) | ||
|
||
|
||
Where: | ||
- \( y \) is the dependent variable (target). | ||
- ![image](https://github.com/user-attachments/assets/e1719652-349b-456c-a1e1-e45a751bc619) | ||
are the independent variables (features). | ||
- ![image](https://github.com/user-attachments/assets/63ea3e53-41fc-463d-b485-72cf47d8edcf) | ||
is the intercept![image](https://github.com/user-attachments/assets/20179b4e-8bbc-4c75-af5e-0c6697628740) | ||
. | ||
- ![image](https://github.com/user-attachments/assets/48f41c2a-c6fe-4c22-9ba5-e3c41ddd2588) | ||
are the coefficients (slopes), representing the change in \( y \) for a unit change in the corresponding ![image](https://github.com/user-attachments/assets/88f0e8e8-3f9b-4194-81fb-fb2e1dabef86) | ||
|
||
|
||
### How Linear Regression Works: | ||
1. **Data Collection**: | ||
Gather a dataset with one or more features (independent variables) and the corresponding target variable (dependent variable). | ||
|
||
2. **Model Training**: | ||
The algorithm attempts to find the best-fitting line by optimizing the parameters (intercept and slopes). This is achieved using **Ordinary Least Squares** (OLS), which minimizes the sum of squared residuals (the differences between actual and predicted values). | ||
|
||
3. **Making Predictions**: | ||
Once trained, the model can predict the target value \( y \) for new data points by applying the learned linear equation. | ||
|
||
4. **Residuals**: | ||
The residual is the difference between the actual and predicted value: | ||
![image](https://github.com/user-attachments/assets/883e638f-3a8f-49e8-abfa-46c5295dd923) | ||
|
||
The goal is to minimize these residuals. | ||
|
||
### Problem Statement: | ||
Given a dataset with independent variables (features), the objective is to learn a linear regression model that can predict the target variable based on new input values. | ||
|
||
### Key Concepts: | ||
- **Slope (Coefficient)**: | ||
The slope represents how much the target variable changes when the corresponding feature increases by one unit. In multiple regression, each feature has its own slope. | ||
|
||
- **Intercept**: | ||
The intercept is the predicted value of the target when all feature values are zero. | ||
|
||
- **Best-Fit Line**: | ||
Linear regression aims to find the line (or hyperplane for multiple regression) that best fits the data, meaning it minimizes the overall distance between the data points and the line. | ||
|
||
### Loss Function: | ||
The loss function used in linear regression is the **Mean Squared Error (MSE)**, which calculates the average of the squared differences between the actual and predicted values: | ||
|
||
![image](https://github.com/user-attachments/assets/8cfd9f8a-5473-4492-aeb3-d6cdcb10cf37) | ||
|
||
Where: | ||
- ![image](https://github.com/user-attachments/assets/5d2a3123-9c81-4843-b8d4-1dffec0c55a0) | ||
is the actual target value of the \(i\)-th data point. | ||
- ![image](https://github.com/user-attachments/assets/c811a466-140c-4a76-8c2a-a0e5df20a57c) | ||
is the predicted target value. | ||
- \( n \) is the total number of samples. | ||
|
||
### Gradient Descent (Alternative Training Method): | ||
Another approach to finding the best-fit line is **gradient descent**, which iteratively updates the model parameters by moving in the direction of the steepest decrease in the loss function. | ||
|
||
- **Update rule** for each parameter: | ||
![image](https://github.com/user-attachments/assets/99d5d6c9-fb83-4e6e-b3cf-ba19b29c9127) | ||
|
||
Where: | ||
- ![image](https://github.com/user-attachments/assets/1395eba3-d492-465c-879c-5f7ce406beb3) | ||
is the learning rate (controls step size). | ||
- ![image](https://github.com/user-attachments/assets/6493a7af-6685-47d9-b9a3-70f5cd8abc8b) | ||
is the loss function (MSE). | ||
|
||
The parameters are updated in each iteration to reduce the error. | ||
|
||
### Time Complexity: | ||
- **Best, Average, and Worst Case: $O(n)$** | ||
The time complexity for training a linear regression model is linear with respect to the number of samples \( n \) and features \( p \), making it efficient for large datasets. | ||
|
||
### Space Complexity: | ||
- **Space Complexity: $O(p)$** | ||
The space complexity is proportional to the number of features \( p \), as the model stores one coefficient per feature, plus the intercept. | ||
|
||
### Example: | ||
Consider a dataset for predicting the price of a house based on **square footage**: | ||
|
||
- Dataset: | ||
``` | ||
| Square Footage | Price ($) | | ||
|----------------|--------------| | ||
| 1500 | 200,000 | | ||
| 1700 | 230,000 | | ||
| 1800 | 250,000 | | ||
| 1900 | 270,000 | | ||
``` | ||
|
||
Step-by-Step Execution: | ||
|
||
1. **Fit the model**: | ||
Linear regression will learn the relationship between **square footage** (independent variable) and **price** (dependent variable). | ||
|
||
2. **Equation**: | ||
The model will output an equation like: | ||
![image](https://github.com/user-attachments/assets/8061012e-d2b6-46be-b51b-a1889207eac8) | ||
|
||
|
||
3. **Predict price**: | ||
For a new house with 2000 square feet, the model will predict the price using the equation. | ||
|
||
### Python Implementation: | ||
Here is a basic implementation of Linear Regression in Python using **scikit-learn**: | ||
|
||
```python | ||
from sklearn.linear_model import LinearRegression | ||
import numpy as np | ||
|
||
# Sample data | ||
X = np.array([[1500], [1700], [1800], [1900]]) # Features (Square Footage) | ||
y = np.array([200000, 230000, 250000, 270000]) # Target (Price) | ||
|
||
# Create linear regression model | ||
model = LinearRegression() | ||
|
||
# Train the model | ||
model.fit(X, y) | ||
|
||
# Make predictions | ||
predicted_price = model.predict([[2000]]) # Predict price for 2000 square footage | ||
print(f"Predicted price: ${predicted_price[0]:,.2f}") | ||
|
||
# Display the model's coefficients | ||
print(f"Intercept: {model.intercept_}") | ||
print(f"Coefficient: {model.coef_[0]}") | ||
``` | ||
|
||
### Summary: | ||
The **Linear Regression Algorithm** is one of the most fundamental techniques for predicting continuous outcomes. Its simplicity and interpretability make it a powerful tool for many real-world applications, particularly in finance, economics, and engineering. However, it assumes a linear relationship between variables and may not work well for datasets with non-linear patterns. | ||
|
||
--- |
Oops, something went wrong.