Skip to content

Commit

Permalink
Merge branch 'main' into dev-2
Browse files Browse the repository at this point in the history
  • Loading branch information
ajay-dhangar authored Oct 16, 2024
2 parents f5d415b + 47d4765 commit b0b934b
Show file tree
Hide file tree
Showing 56 changed files with 6,197 additions and 683 deletions.
171 changes: 171 additions & 0 deletions MachineLearning/K-NearestNeighbours.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
id: k-nearest-neighbors
title: k-Nearest Neighbors Algorithm
sidebar_label: k-Nearest Neighbors
description: "In this post, we'll explore the k-Nearest Neighbors (k-NN) Algorithm, one of the simplest and most intuitive algorithms in machine learning."
tags: [machine learning, algorithms, classification, regression, k-NN]

---

### Definition:
**k-Nearest Neighbors (k-NN)** is a simple and widely used **supervised learning algorithm**. It can be applied to both **classification** and **regression** tasks. The algorithm classifies or predicts a data point based on how closely it resembles its neighbors. k-NN does not have an explicit training phase; instead, it stores the entire dataset and makes predictions by finding the **k nearest neighbors** of a given input and using their majority class (for classification) or average value (for regression) to make predictions.

### Characteristics:
- **Instance-Based Learning**:
k-NN is a **lazy learner**, meaning it stores all training instances and delays computation until prediction time.

- **Non-Parametric**:
It makes no assumptions about the underlying data distribution, making it highly flexible but sensitive to noisy data.

- **Distance-Based**:
The algorithm relies on a **distance metric** (e.g., Euclidean distance) to measure how close or similar the data points are.

### How k-NN Works:
1. **Data Collection**:
k-NN requires a labeled dataset of examples, where each example consists of feature values and a corresponding label (for classification) or continuous target (for regression).

2. **Prediction**:
To predict the label or value for a new, unseen data point:
- Measure the distance between the new point and all points in the training set using a distance metric.
- Select the **k nearest neighbors** based on the shortest distances.
- For classification, assign the most frequent class (majority voting) among the k neighbors. For regression, calculate the average of the k neighbors’ values.

### Distance Metrics:
The most commonly used distance metrics in k-NN are:
- **Euclidean Distance** (default for continuous variables):
![image](https://github.com/user-attachments/assets/3e1f84fb-7ff8-426b-b89a-b6e356784e90)


- **Manhattan Distance**:
![image](https://github.com/user-attachments/assets/fe14fdcd-20c6-47eb-9225-51357cb33dd8)


- **Minkowski Distance** (generalization of Euclidean and Manhattan):
![image](https://github.com/user-attachments/assets/7c6c7797-d1b9-4eab-8239-20d0691d9c85)


- **Hamming Distance** (used for categorical variables):
Measures the number of positions at which two binary strings differ.

### Choosing k:
- **Small k**:
A smaller k (e.g., k=1 or k=3) makes the model sensitive to noise and can lead to **overfitting** because it considers fewer neighbors.

- **Large k**:
A larger k provides a more generalized prediction but may lead to **underfitting** if it includes too many neighbors from different classes.

- **Optimal k**:
The ideal value of k is often found through experimentation or by using techniques like **cross-validation**.

### Classification with k-NN:
In classification tasks, k-NN assigns the class label based on the majority class among the k-nearest neighbors. Each neighbor votes for its class, and the most frequent class becomes the prediction.

**Example**:
Suppose we are classifying an unknown data point based on three nearest neighbors (k=3), and the classes of these neighbors are:
- Neighbor 1: Class A
- Neighbor 2: Class A
- Neighbor 3: Class B

Since Class A occurs more frequently, the new point is assigned to **Class A**.

### Regression with k-NN:
In regression tasks, k-NN predicts the target value based on the **average** of the values of its k nearest neighbors.

**Example**:
To predict the price of a house, k-NN will find k houses with similar features (square footage, number of rooms) and return the average price of those k houses as the predicted price.

### Steps in k-NN Algorithm:
1. **Data Storage**:
k-NN stores the entire dataset of training examples.

2. **Distance Calculation**:
For a new input data point, compute the distance between the input and every point in the training set using a chosen distance metric.

3. **Identify Neighbors**:
Sort the distances and identify the k-nearest neighbors.

4. **Prediction**:
- For classification, count the occurrences of each class among the k neighbors and assign the class with the majority votes.
- For regression, take the average of the k neighbors' target values.

### Problem Statement:
Given a dataset with labeled examples (for classification) or continuous targets (for regression), the goal is to classify or predict new input data points by finding the k most similar data points in the training set and using them to infer the output.

### Key Concepts:
- **Lazy Learning**:
k-NN is called a lazy learner because it doesn’t explicitly learn a model during training but simply stores the training dataset.

- **Similarity**:
The similarity between data points is quantified by calculating the distance between their feature vectors.

- **Majority Voting**:
For classification, the class of a new data point is determined by the majority class among its k nearest neighbors.

- **Averaging**:
For regression, the predicted value is the average of the k nearest neighbors' target values.

### Time Complexity:
- **Training Time Complexity: $O(1)$**
k-NN doesn’t require a training phase, so it takes constant time.

- **Prediction Time Complexity: $O(n \cdot d)$**
Predicting the class or value for a new data point requires computing the distance between the new point and all n training points, each of which has d dimensions.

### Space Complexity:
- **Space Complexity: $O(n \cdot d)$**
The algorithm stores the entire dataset, which consists of n points in d dimensions.

### Example:
Consider a simple k-NN classification example for predicting whether a fruit is an apple or an orange based on its features (weight and color):

- Dataset:
```
| Weight (g) | Color (scale 1-10) | Fruit |
|------------|--------------------|---------|
| 150 | 8 | Apple |
| 170 | 7 | Apple |
| 130 | 6 | Orange |
| 140 | 5 | Orange |
```

**Step-by-Step Execution**:

1. **Store Dataset**:
Store the dataset as-is.

2. **Calculate Distances**:
For a new fruit with a weight of 160g and color value 7, compute the distance from this point to all existing data points.

3. **Find k Nearest Neighbors**:
If k=3, identify the 3 closest fruits to the new one based on the shortest distances.

4. **Make Prediction**:
Count the class occurrences (Apple or Orange) among the 3 nearest neighbors and assign the new fruit the most frequent class.

### Python Implementation:
Here’s a simple implementation of the k-NN algorithm using **scikit-learn**:

```python
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Sample data
X = np.array([[150, 8], [170, 7], [130, 6], [140, 5]]) # Features (Weight, Color)
y = np.array(['Apple', 'Apple', 'Orange', 'Orange']) # Target (Fruit)

# Create k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X, y)

# Make a prediction for a new fruit
new_fruit = np.array([[160, 7]]) # New fruit with weight 160g and color 7
predicted_fruit = knn.predict(new_fruit)
print(f"The predicted fruit is: {predicted_fruit[0]}")
```

### Summary:
The **k-Nearest Neighbors Algorithm** is a straightforward and intuitive method for both classification and regression tasks. It works by finding the k most similar examples in the training dataset and using them to predict the class or value of a new data point. While easy to implement, k-NN can be computationally expensive, especially on large datasets, as it requires calculating the distance to every training point at prediction time.

---
167 changes: 167 additions & 0 deletions MachineLearning/LinearRegression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
id: linear-regression
title: Linear Regression Algorithm
sidebar_label: Linear Regression
description: "In this post, we'll explore the Linear Regression Algorithm, one of the most basic and commonly used algorithms in machine learning."
tags: [machine learning, algorithms, linear regression, regression]

---

### Definition:
**Linear Regression** is a supervised learning algorithm used for **predictive modeling** of continuous numerical variables. It establishes a linear relationship between a dependent variable (the target) and one or more independent variables (the features). The goal of linear regression is to model this relationship using a straight line (linear equation) to predict the target values based on input features.

### Characteristics:
- **Regression Model**:
Unlike classification, linear regression predicts **continuous** values rather than categories.

- **Linear Relationship**:
Assumes a linear relationship between the features and the target variable, where changes in feature values proportionally affect the target.

- **Minimizing Error**:
The model minimizes the difference between the actual values and predicted values using a method called **Ordinary Least Squares** (OLS).

### Types of Linear Regression:
1. **Simple Linear Regression**:
Used when there is one independent variable to predict the target.
Example: Predicting house price based solely on square footage.

2. **Multiple Linear Regression**:
Used when there are two or more independent variables to predict the target.
Example: Predicting house price based on square footage, number of rooms, and age of the house.

### Linear Equation:
In linear regression, the relationship between the target \( y \) and the input features \( X \) is modeled using the equation of a straight line:

![image](https://github.com/user-attachments/assets/1875d7db-cf35-4ce7-a907-52c2366b2f94)


Where:
- \( y \) is the dependent variable (target).
- ![image](https://github.com/user-attachments/assets/e1719652-349b-456c-a1e1-e45a751bc619)
are the independent variables (features).
- ![image](https://github.com/user-attachments/assets/63ea3e53-41fc-463d-b485-72cf47d8edcf)
is the intercept![image](https://github.com/user-attachments/assets/20179b4e-8bbc-4c75-af5e-0c6697628740)
.
- ![image](https://github.com/user-attachments/assets/48f41c2a-c6fe-4c22-9ba5-e3c41ddd2588)
are the coefficients (slopes), representing the change in \( y \) for a unit change in the corresponding ![image](https://github.com/user-attachments/assets/88f0e8e8-3f9b-4194-81fb-fb2e1dabef86)


### How Linear Regression Works:
1. **Data Collection**:
Gather a dataset with one or more features (independent variables) and the corresponding target variable (dependent variable).

2. **Model Training**:
The algorithm attempts to find the best-fitting line by optimizing the parameters (intercept and slopes). This is achieved using **Ordinary Least Squares** (OLS), which minimizes the sum of squared residuals (the differences between actual and predicted values).

3. **Making Predictions**:
Once trained, the model can predict the target value \( y \) for new data points by applying the learned linear equation.

4. **Residuals**:
The residual is the difference between the actual and predicted value:
![image](https://github.com/user-attachments/assets/883e638f-3a8f-49e8-abfa-46c5295dd923)

The goal is to minimize these residuals.

### Problem Statement:
Given a dataset with independent variables (features), the objective is to learn a linear regression model that can predict the target variable based on new input values.

### Key Concepts:
- **Slope (Coefficient)**:
The slope represents how much the target variable changes when the corresponding feature increases by one unit. In multiple regression, each feature has its own slope.

- **Intercept**:
The intercept is the predicted value of the target when all feature values are zero.

- **Best-Fit Line**:
Linear regression aims to find the line (or hyperplane for multiple regression) that best fits the data, meaning it minimizes the overall distance between the data points and the line.

### Loss Function:
The loss function used in linear regression is the **Mean Squared Error (MSE)**, which calculates the average of the squared differences between the actual and predicted values:

![image](https://github.com/user-attachments/assets/8cfd9f8a-5473-4492-aeb3-d6cdcb10cf37)

Where:
- ![image](https://github.com/user-attachments/assets/5d2a3123-9c81-4843-b8d4-1dffec0c55a0)
is the actual target value of the \(i\)-th data point.
- ![image](https://github.com/user-attachments/assets/c811a466-140c-4a76-8c2a-a0e5df20a57c)
is the predicted target value.
- \( n \) is the total number of samples.

### Gradient Descent (Alternative Training Method):
Another approach to finding the best-fit line is **gradient descent**, which iteratively updates the model parameters by moving in the direction of the steepest decrease in the loss function.

- **Update rule** for each parameter:
![image](https://github.com/user-attachments/assets/99d5d6c9-fb83-4e6e-b3cf-ba19b29c9127)

Where:
- ![image](https://github.com/user-attachments/assets/1395eba3-d492-465c-879c-5f7ce406beb3)
is the learning rate (controls step size).
- ![image](https://github.com/user-attachments/assets/6493a7af-6685-47d9-b9a3-70f5cd8abc8b)
is the loss function (MSE).

The parameters are updated in each iteration to reduce the error.

### Time Complexity:
- **Best, Average, and Worst Case: $O(n)$**
The time complexity for training a linear regression model is linear with respect to the number of samples \( n \) and features \( p \), making it efficient for large datasets.

### Space Complexity:
- **Space Complexity: $O(p)$**
The space complexity is proportional to the number of features \( p \), as the model stores one coefficient per feature, plus the intercept.

### Example:
Consider a dataset for predicting the price of a house based on **square footage**:

- Dataset:
```
| Square Footage | Price ($) |
|----------------|--------------|
| 1500 | 200,000 |
| 1700 | 230,000 |
| 1800 | 250,000 |
| 1900 | 270,000 |
```

Step-by-Step Execution:

1. **Fit the model**:
Linear regression will learn the relationship between **square footage** (independent variable) and **price** (dependent variable).

2. **Equation**:
The model will output an equation like:
![image](https://github.com/user-attachments/assets/8061012e-d2b6-46be-b51b-a1889207eac8)


3. **Predict price**:
For a new house with 2000 square feet, the model will predict the price using the equation.

### Python Implementation:
Here is a basic implementation of Linear Regression in Python using **scikit-learn**:

```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1500], [1700], [1800], [1900]]) # Features (Square Footage)
y = np.array([200000, 230000, 250000, 270000]) # Target (Price)

# Create linear regression model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Make predictions
predicted_price = model.predict([[2000]]) # Predict price for 2000 square footage
print(f"Predicted price: ${predicted_price[0]:,.2f}")

# Display the model's coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficient: {model.coef_[0]}")
```

### Summary:
The **Linear Regression Algorithm** is one of the most fundamental techniques for predicting continuous outcomes. Its simplicity and interpretability make it a powerful tool for many real-world applications, particularly in finance, economics, and engineering. However, it assumes a linear relationship between variables and may not work well for datasets with non-linear patterns.

---
Loading

0 comments on commit b0b934b

Please sign in to comment.