This repo contains of basic concepts of Apache Pyspark.
- Create a new environment before starting the session.
- Create new environment using "python -m venv myenv"
- Activate the virtual environment (source <path_location>/Scripts/activate)
Basic Concepts covered under this files are -
- Pyspark Dataframe
- Reading the dataset
- Checking the datatypes (Schemas)
- Selecting columns
- Check describe
- Adding columns
- Renaming columns
- Dropping rows and columns
- Various parameter in dropping functionalities
- Handling missing values (mean, median, mode)
- Filter operations
- Group by and Aggregate functions