Exercise 1: Data Engineering/Data Factory experience - Data ingestion from a spectrum of analytical data sources into OneLake
Before we start executing the steps, we will open a backup Click-by-Click lab using the following hyperlink in a new tab and navigate back to the VM browser:
Now, let's trigger the Simulator App to start streaming data to EventHub (to be used later in exercise 4).
Open the new tab in the browser and copy paste the below URL to verify app service streaming data.
Wait for the page to load. You will see a page like the one shown below.
In this exercise, you will act as the Data Engineer, Eva, to transfer Contoso's data from Azure SQL Database into the Lakehouse and initiate data preparation for the upcoming merger between Contoso and Litware Inc.
- Open Microsoft Fabric in a new tab by copy pasting the below link.
https://app.fabric.microsoft.com/home
Note: After pasting the link into the browser, the page will automatically log in if the user has completed the earlier steps of the Get Started instructions. In some cases, Microsoft Fabric may prompt the user to enter their login details to verify the account.
Note: Wait for the Microsoft Fabric page to load and close the top bar for a better view.
Note: The screenshots in the exercises might differ slightly from the actual lab interface. Adjust your screen resolution as needed to locate and select the required items.
- Copy paste the Worksapce name in the Name field and click on Apply to create the workspace.
Note: Only use the workspace name provided above.
Note: Close any pop-up that appears on the screen.
Now, let's see how each department can easily create a Lakehouse in the Contoso workspace without any provision. They simply provide a name, given the proper access rights of course!
Note: Verify that you are in the Workspace created in the earlier steps.
- Click on + New item button.
- In the pop-up window search for Lakehouse in the search bar and click on Lakehouse.
-
Copy the name lakehouse from the following and paste it in the Name field.
-
Click on the Lakehouse schemas checkbox and then click on the Create button.
lakehouse
In just a few seconds, the Lakehouse is ready. With the right access, you, as a Data Engineer, can effortlessly create a new Lakehouse. There is no need to set up any storage accounts or worry about network, infrastructure, key vault, Azure subscriptions, etc.
Now, this is something exciting! This section shows how easy it is to create Shortcuts without moving data. That is the power of OneLake! In this exercise, you will ingest the curated bounce rate data for Litware from ADLS Gen2 using the New Shortcut option. Let’s see how!
-
Click on the three dots (ellipses) on the right side of Files.
-
Click on New shortcut.
Note: Make sure you create a shortcut under files and not under tables in the lakehouse explorer pane.
- In the pop-up window, under External sources, select the Azure Data Lake Storage Gen2 source.
Note: Wait for the screen to load.
-
Select Create new Connection
-
In the screen below, we need to enter the connection details for the ADLS Gen2 shortcut, For this, we need to get the details from the Storage Account resource.
6.Copy the Data Lake Storage endpoint from the below and paste it into the URL field. Select Organization account for the Authentication Kind, and then click on Sign in.
- Click on your ID to complete the Sign in.
- Click on Next button.
- Select the data and litwaredata checkbox and then Click on the Next button.
- Click on the Create button.
- And there you go! Your shortcut is now ready! Click (do not expand) on the newly created shortcut named litwaredata.
Prior to Microsoft Fabric, departments in Contoso had to move the data they needed from other departments via time-consuming ETL processes. But look, now they have created shortcuts. No need to move any of this data. That is the power of OneLake!
Now, let’s see how Data Engineer, Eva, got the remaining data into OneLake by creating Delta tables using Spark Notebook. By using a Spark Notebook to create Delta tables, Eva can ensure more reliable, scalable, and efficient data management, which is essential for handling big data workflows.
- Click on Workspace
- Click on New Item and then select Notebook
Note: If the Pop-up appears click on Skip tour
- Click on the + Data Sources button and then select Lakehouses
- Select Existing Lakehouse with Schema and then click on Add.
- Select the lakehouse and then click on Add
- Once the notebook is created, paste the below code in the existing cell and run the cell by clicking on the Run cell icon.
import os
import pandas as pd
# List all CSV files in the 'litwaredata' folder
file_path = '/lakehouse/default/Files/litwaredata/'
csv_files = [file for file in os.listdir(file_path) if file.endswith('.csv')]
# Load each CSV file into a table
for file in csv_files:
table_name = file.split('.')[0]
df = pd.read_csv(file_path + file)
spark.createDataFrame(df).write.mode("ignore").format("delta").saveAsTable(table_name)
- Once the execution is successful you'll see a Green tick appears at the bottom of cell.
- Click on the Stop icon in the ribbon at the top to Stop the Spark session.
- Click on Lakehouse in the left navigation bar.
- Expand Tables and expand dbo under Tables. Click on the three dots (Ellipses) next to dbo and click Refresh from dropdown options.
- View the successfully loaded tables.
- Click on website_bounce_rate delta table and view the website bounce rate data.
You now have all the table in OneLake for Contoso to leverage. Next, we proceed with data transformation using Dataflow Gen2 to transform the sales data ingested from Litware.
Task 1.4: Leverage Dataflow Gen2 and Data pipelines for a "No Code-Low Code" experience to quickly ingest data with Fast Copy and transform it using Copilot
Using another great feature in Fabric’s Data Factory, called Fast Copy, Contoso’s Data Engineer, Eva, quickly ingests terabytes of data with dataflows, thanks to the scalable Copy Activity in the pipeline. With so much data from Litware, there is bound to be a lot of clean up needed. Let’s step into Eva’s shoes to explore how she used fast copy to ingest data and Copilot to transform it, just in time to derive meaningful customer insights before their big Thanksgiving Sale!
You will experience how easy it is to use Fast Copy to transform Litware's sales data into the Lakehouse.
- Click on the experience button at the bottom left corner of the screen (In this screenshot, Data Engineering is selected as an "Experience") and then select Data Factory.
- Click on Dataflow Gen2.
- Click on the New Query drop down and click on the Get data icon (not on the dropdown arrow at the bottom of the icon).
Note: If the New query dropdown is not visible click on Get data icon available at the same place.
There are 150+ source connections available to in Dataflow Gen2.
- In the pop-up window, scroll down to OneLake data hub and click on lakehouse.
- If you see a screen similar to the one shown below, click on the Next button otherwise move to the next step.
- Expand lakehouse, expand Files and expand data then scroll down.
- Scroll down and select the sales_data.csv checkbox, then click on the Create button.
- Collapse the Queries pane and take a look at the sales dataset (note that the first row of this dataset is not a header).
Let's use Copilot to perform data cleansing.
- Click on the Copilot button, paste the prompt provided below in the following text box and click on the send icon.
In the table sales_data csv, apply first row as headers.
Note: If Copilot needs additional context to understand your query, consider rephrasing the prompt to include more details.
- Scroll to the right-hand side and observe the GrossRevenue and NetRevenue columns. You notice that some rows contain empty or null values.
Let's use Copilot to remove empty rows.
- Similarly, paste the prompt below in Copilot and click on the send icon.
Remove empty rows from GrossRevenue and NetRevenue columns.
- Scroll to the right hand side and observe the GrossRevenue and NetRevenue columns (there are no empty rows with null values).
Note: Expand the queries pane collapsed earlier.
- Right-click on the query sales_data.csv, and select Require Fast Copy.
Note: Fast copy enhances the data handling capabilities within Fabric, making data transfers faster and more seamless across the platform.
Note: Due to time constraints, we will not publish and run the Dataflow from the Pipeline.
Note: If a pop-up page Options appears, scroll down to select Scale and tick Allow use of fast copy connectors checkbox then click on OK.
- Click on the close icon at top right of the Dataflow window.
Note: If necessary, scroll up to show the close icon.
- Click on Yes.
Congrats on completing this data transformation exercise!