First, fork the nira-interview repo into your own repo.
- Install Pyenv to switch between different Python versions.
https://github.com/pyenv/pyenv, windows: https://github.com/pyenv-win/pyenv-win - Install the python version specified in
/nira-interview/.python-version
using Pyenv.pyenv install 3.9.6
- Navigate into the /nira-interview directory and double check that you've set up pyenv correct.
When you runpython --version
, the version should match the version specified at/nira-interview/.python-version
, which is 3.9.6
Pyenv works by reading the .python-version file and automaticaly switching to the right python version. - Install poetry https://python-poetry.org/docs/
- Configure Poetry to create .venv folders in the project
poetry config virtualenvs.in-project true
- Navigate into the pipeline folder
cd /nira-interview/pipeline
- Install dependencies
poetry install
- Activate the virtual environment.
poetry shell
- Double check that the right version of python is being used in the virtual environment.
python --version
- Make sure that the dependencies were installed.
poetry show
- Spin up dagit.
poetry run dagit
- Navigate to
localhost:3000
. You should see dagster running there - In the jobs pane on the left, click the "nira_smoke_test_job" job. Click "Launchpad" and then "Launch run". You should see the job print "Successfully ran smoketest".
- Specify python interpreter in VSCode You should open the setting in VScode to "Python: Select interpreter". Input your own path, which should be
./pipeline/.venv/bin/python
- You should be ready if you get here
Dagster is an open source tool we use to orchestrate our pipelines. You can learn more about Dagster at dagster.io. They're an awesome company.
Dagster jobs are essentially a list of steps written in pipeline. Each step is called an op. If you open smoke_test_job.py
, you'll see the nira_smoke_test_job
python definition which is annotated with @job.
The job is made from a series of calls to ops. The two ops are also defined there. The output of smoke_test_op1
is passed into smoke_test_op2
.
Its that easy, Dagster jobs are constructed from ops.
Your task is to edit the interview_job
defined in interview_job.py
. First, lets see whats going on inside of interview_job.
- First, we read in a raw CSV of buses we need to run the pipeline on in
raw_buses_to_run
. - Then we calculate the MW available for each bus
get_mw_available_for_each_bus_very_slow
. You can see in the code that calculating this takes 5 minutes per bus! Super slow. - Then we convert MW to GW in
add_gw_available_column
. - Lastly, we write the final DF to disk in
output_interview_job
.
This pipeline has already been run and has results inside of pipeline/interview_job/output
. This pipeline has been run the slow way with the initial set of buses.
Sometimes, we have a new bus we need to run as well. But we don't want to rerun all the buses because that's too slow.
Your task is to figure out how to construct this pipeline so that we don't have to rerun all the buses, only the new ones, while still outputting one single CSV to disk.
A few constraints:
- You can tweak
get_mw_available_for_each_bus_very_slow
for testing purposes, but you are not allowed to change the code inside this file in the final submission. Don't get clever and just decrease the sleep() call to one second. - We are only ever adding new buses, you do not need to worry about buses being removed.
- For any given bus, the values calculated in
get_mw_available_for_each_bus_very_slow
will always be exactly the same (you can see this in the code).
Final deliverable:
- Send over a link to the forked repo you made the modifications in.
- Inside
raw_buses_to_run.py
, comment out line 4, and uncomment line 5. This will switch the raw buses csv to a new csv. You can go look in the CSVs, the only difference is one additional bus in the new one. Remember, there will only ever be bus additions in the new csv. - There should be only one new file inside the /output folder that contains all the buses results for the buses defined in
new_raw_buses.csv
. You should delete the original csv in the output folder that the repo started with. There should never be 2 csv's in the output folder. - Any new ops you need should be added to
interview_job.py
and also be implemented in their own file in the /ops folder.