The Research and Development Team of Github want to know what time of the day they get more traffic and what resources are not popular enough. They will send details to the Marketing Team.
- You have been hired to give insights on Github Developer activity for June 2022.
- Here are some visualizations you need to produce:
- Traffic per hour
- Events popularity chart.
GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis
- Create a dashboard
The pipeline could be stream or batch: this is the first thing you'll need to decide
- If you want to run things periodically (e.g. hourly/daily), go with batch
- Containerisation: Docker
- Cloud: AWS
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Airflow
- Data Wareshouse: Redshift
- Batch processing: EMR
- Visualisation: Google Data Studio
- Github Archive data is ingested daily into the AWS S3 buckets from 1st of May.
- A Spark job is run on the data stored in the S3 bucket using AWS ElasticMapReduce (EMR)
- The results are written to a table defined in Redshift.
- A dashboard is created from the Redshift tables.
- Create an AWS account if you do not have one. AWS offers free tier for some services like S3, Redshift.
- Open the IAM console here
- In the navigation pane, choose Users and then choose Add users. More information here
- Select Programmatic access, For console password, create custom password.
- On the Set permissions page, Attach AdministratorAccess policy.
- Download credentials.csv file with login information and store it in
${HOME}/.aws/credentials.csv
Terraform is used to setup most of the services used for this project i.e S3 buckets, Redshift cluster. This section contains step to setup these aspects of the project.
You can use any virtual machine of your choice; Azure, GCP etc.. But AWS EC2 is preferable because of faster upload and download speeds to AWS services. To set up an AWS EC2 vm that works for this project, you will need to pay for it. Here is a link to help. Ubuntu OS is preferable.
To download and set up AWS cli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
- To configure aws credentials run
$ aws configure
AWS Access Key ID [None]: fill with value from credentials.csv
AWS Secret Access Key [None]: fill with value from fill with value from credentials.csv
Default region name [None]: your regiion
Default output format [None]: json
- Connect to your VM
- Install Docker
sudo apt-get update sudo apt-get install docker.io
- Docker needs to be configured so that it can run without
sudo
sudo groupadd docker sudo gpasswd -a $USER docker sudo service docker restart
- Logout of your SSH session and log back in
- Test that docker works successfully by running
docker run hello-world
- Check and copy the latest release for Linux from the official Github repository
- Create a folder called
bin/
in the home directory. Navigate into the/bin
directory and download the binary file therewget <copied-file> -O docker-compose
- Make the file executable
chmod +x docker-compose
- Add the
.bin/
directory to PATH permanently- Open the
.bashrc
file in the HOME directory
nano .bashrc
- Go to the end of the file and paste this there
export PATH="${HOME}/bin:${PATH}"
- Save the file (CTRL-O) and exit nano (CTRL-X)
- Reload the PATH variable
source .bashrc
- Open the
- You should be able to run docker-compose from anywhere now. Test this with
docker-compose --version
- Navigate to the
bin/
directory that you created and run thiswget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip
- Unzip the file
unzip terraform_1.1.7_linux_amd64.zip
You might have to install unzip
sudo apt-get install unzip
- Remove the zip file
rm terraform_1.1.7_linux_amd64.zip
- Terraform is already installed. Test it with
terraform -v
To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.
- Install the Remote-SSH extension from the Extensions Marketplace
- At the bottom left-hand corner, click the Open a Remote Window icon
- Click Connect to Host. Click the name of your config file host.
- In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.
git clone https://github.com/Nerdward/batch_gh_archive
We use Terraform to create a S3 bucket and Redshift
-
Navigate to the terraform folder
set the username and password for your redshift cluster using this
# Set secrets via environment variables export TF_VAR_username=(the username) export TF_VAR_password=(the password)
-
Initialise terraform
terraform init
-
Check infrastructure plan
terraform plan
-
Create new infrastructure
terraform apply
-
Confirm that the infrastructure has been created on the AWS console.batch_gh_archive
Airflow is run in a docker container. This section contains steps on initisialing Airflow resources
-
Navigate to the airflow folder
-
Create a logs folder
airflow/logs/
mkdir logs/
-
Build the docker image
docker-compose build
-
The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case
-
Initialise Airflow resources
docker-compose up airflow-init
-
Kick up all other services
docker-compose up
-
Open another terminal instance and check docker running services
docker ps
- Check if all the services are healthy
-
Forward port 8080 from VS Code. Open
localhost:8080
on your browser and sign into airflowBoth username and password is
airflow
You are already signed into Airflow. Now it's time to run the pipeline
-
Click on the DAG
Batch_Github_Archives
that you see there -
You should see a tree-like structure of the DAG you're about to run
-
At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this
The DAG would run from May 1 at 12:00am UTC till May 7
This should take a while -
While this is going on, check the AWS console to confirm that everything is working accordingly
If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.
-
When the pipeline is finished and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with
docker-compose down
-
Take a well-deserved break to rest. This has been a long ride.
Here are a few things I can do:
- Add tests
- Use make
- Add CI/CD pipeline
Some links to refer to: