Skip to content

Commit

Permalink
Update project_2_jupyter_2_pyspark.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kevin-crook-ucb authored Oct 8, 2019
1 parent 083aacc commit 83b43b7
Showing 1 changed file with 29 additions and 34 deletions.
63 changes: 29 additions & 34 deletions 2019_Fall/synch_session_commands/project_2_jupyter_2_pyspark.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ In Project 2, students have the option of using pyspark command line (as we did

I'm providing instructions here for how to make the modifications necessary to run Jupyter Notebook against a pyspark kernel.

## Step 1 - Modify your docker-compose.yml file entry for the Spark container to export port 8888 and map it externally. Remove any references to 8888 in the Hadoop container.
## Step 1 - Check and if necessary, modify your docker-compose.yml file

In your docker-compose.yml file, you will need to add the following specification for the spark container. It would generally follow the volumes section:
Check your docker-compose.yml file to make sure that the spark container has an expose section and ports section with the following entries:

```yml
expose:
Expand All @@ -19,37 +19,23 @@ In your docker-compose.yml file, you will need to add the following specificatio
- "8888:8888"
```
Also, check and see if you have a cloudera section. If you have a cloudera section, you should comment out (or remove) the 8888 in both the expose section and the ports section. If it's the only entry in the expose section, then comment out (or remove) the expose section. If it's the only entry in the ports section, then comment out (or remove) the ports section.
Also, check to make sure that if you have a cloudera container, verify that it does NOT have the above entries (it will cause a conflict on port 8888).
## Step 2 - Run an enhanced version of the pyspark command line to target Jupyter Notebook
Instead of starting a pyspark command line, use the following command to start a Jupyter Notebook for a pyspark kernel. In this command we set the ip address to 0.0.0.0:
Multi-line for readability:
```
docker-compose exec spark \
env \
PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' \
pyspark
```
If you need to make changes to your docker-compose.yml file, you will need to first tear down your cluster:
For convenience, the command above on 1 line (remember to leave the ip address as 0.0.0.0):
```
docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark
docker-compose down
```

You will get the usual URL for Jupyter Notebook with 0.0.0.0 for the host / ip address. Copy this and paste it into a notepad or similar.

Since we will be connecting from a Chrome browser on your laptop / desktop, in the notepad, you will need to change 0.0.0.0 to the external ip address for your Google cloud virtual machine. Be sure you use the External IP (not the Internal IP), and remember that it changes every time you stop and start the virtual machine
If necessary, add the sections to the spark container and comment out the sections in the cloudera container.

If you have been running a Jupyter Notebook to another source, there will be cookie conflicts between them. The solution is to run a new ingocnito window in the Google Chrome browser.
Then bring the cluster back up:

Open a new Google Chrome browser ingocnito window, and copy and paste the URL with the modified ip address from your notepad and the Jupyter Notebook should come up.

## Step 3 - Create a symbolic link in the Spark container to the /205 mount point
```
docker-compose up -d
```

Once inside the Jupyter Notebook, you will notice that the directory structure is not that of the w205 directory and that the w205 directory is not listed. It is mounted to /w205. One quick way to rememdy this is to simply create a symbolic link from the Jupyter Notebooks directory to the /w205 directory.
## Step 2 - Create a symbolic link in the Spark container to the /205 mount point

First exec a bash shell into the spark container:

Expand All @@ -68,23 +54,32 @@ Exit the container:
exit
```

Now you should see the w205 directory listed in the Jupyter Notebook directory structure.

As a side note, anytime I have a Jupyter Notebook directory and I'm not sure which directory I'm in, one easy way to just create an Python notebook and run some code cells to pull the current working directory.
## Step 2 - Run an enhanced version of the pyspark command line to target Jupyter Notebook

If it's on a Linux based system, you can use the following code cell:
Instead of starting a pyspark command line, use the following command to start a Jupyter Notebook for a pyspark kernel. In this command we set the ip address to 0.0.0.0:

Multi-line for readability:
```
!pwd
docker-compose exec spark \
env \
PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' \
pyspark
```

The following will always work, regardless of operating system, provided the os module is avaiable (it usually is):

For convenience, the command above on 1 line (remember to leave the ip address as 0.0.0.0):
```
import os
os.getcwd()
docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark
```

You will get the usual URL for Jupyter Notebook with 0.0.0.0 for the host / ip address. Copy this and paste it into a notepad or similar.

Since we will be connecting from a Chrome browser on your laptop / desktop, in the notepad, you will need to change 0.0.0.0 to the external ip address for your Google cloud virtual machine. Be sure you use the External IP (not the Internal IP), and remember that it changes every time you stop and start the virtual machine

If you have been running a Jupyter Notebook to another source, there will be cookie conflicts between them. The solution is to run a new ingocnito window in the Google Chrome browser.

Open a new Google Chrome browser incognito window, and copy and paste the URL with the modified ip address from your notepad and the Jupyter Notebook should come up.

## Troubleshooting Suggestions

Make sure you are using an incognito windows in the Google Chrome browser.
Expand Down

0 comments on commit 83b43b7

Please sign in to comment.