-
Managed Jupyter notebooks
-
Great for use with a Dataproc cluster to write PySpark jobs
-
Powerful interactive tool to explore, analyze, transform and visualize data and build machine learning models on GCP.
-
Supports Python, SQL (BQ), and JavaScript (for BQ user-defined functions)
- In Cell
- %%bq query –name queryname
- SQL underneath
- %%bq execute -q queryname
- %%bq query –name queryname
- In Cell
-
Runs on GCE instance, dedicated VPC and Cloud Source Repository
- datalab create
- datalab-network (VPC) is created
- datalab connect
- Cloud Source Repository
- Used for sharing notebook between users
- Locally
- Good if only one person using
- Docker on GCE
- Better
- Use by multiple people through SSH or CloudShell
- Uses resources on GCE
- Docker + Gateway
- Best
- Uses a gateway and proxy
- Runs locally
- Can be in Cloud Storage Repository (git repo)
- Use ungit to commit changes to notebooks
- Notebooks can be cloned from GCS to VM persistent disk.
- This clone => workspace => add/remove/modify files
- Notebooks autosave, but you need to commit.
- Opening a notebook => Backend kernel process manages session and variables.
- Each notebook has 1 python kernel
- Kernels are single-threaded
- Memory usage is heavy – execution is slow – pick machine type accordingly
- Enable Compute Engine API
- GCE access based on GCE IAM roles:
- Must have Compute Instance Admin and Service Account Actor roles to connect to datalab instance.
- Service Account Actor role deprecated. Use Service Account Token Creator instead.
- Must have Compute Instance Admin and Service Account Actor roles to connect to datalab instance.
- Notebook access per user only
- Sharing data performed via shared Cloud Source Repository
- Sharing is at the project level.
- 2 Options
- Team lead creates notebooks for users using –for user option:
- datalab create [instance] –for-user [email protected]
- Each user creates their own datalab instance/notebook
- Team lead creates notebooks for users using –for user option:
- Everyone accesses same shared repository of datalab/notebooks
- NO web console option
- Machine Type
- Standard n1 by default
- Multi-threading does not work, but can use high memory
- Custom machine types supported as well.
- Can disable creating shared cloud repository
- --no-create-repository
- SSH tunnels to notebook on port 8081
- datalab connect
- RSA key is passphrase
- Can configure idle timeouts on the actual webpage
- Free
- Only pay for GCE resources Datalab runs on and other GCP services you interact with.