Azure Databricks: Creating Clusters, notebooks and mounting Azure Blob storage


31961
9.9k shares, 31961 points

 I will show you how to create a Databricks and Blob storage resource on the Azure portal. Then we will create a cluster on Databricks and write a script to connect Azure Databricks to Azure blob storage in part 2 of this blog.

Let’s get started.

Create an Azure Databricks Service

You will need an Azure subscription to begin with. If you do not have one create one for free by going here.

Sign in to Azure portal and click on Create a resource and type Databricks

Click on create

You will see the following screen

  • Subscription: Select your subscription
  • Resource Group: Create a new resource group or use an existing one if you already have it created.
  • Workspace name: Enter a workspace name of your liking
  • Region: Select a region closest to you
  • Pricing tier: Go for the trial version if you are just getting started with Databricks.

Click on Review + create. In a couple of minutes your workspace should be created in the specified Resource group. Go to the resource and click on Launch Workspace. You will see the following screen:

Creating a cluster.

On the left-hand menu click on the Clusters option. Then you will see the below screen to create a cluster.

  • Cluster name: Give a suitable cluster name
  • Cluster mode: There are 3 options.
    • Standard- (By default choose this option). Choose this option if your workload is not extensive and if you are fine with resources getting hogged by a single user when multiple people are running workloads on the same cluster.
    • High concurrency- Choose this option when you know many people are going to be simultaneously working on this cluster. This mode does not support the Scala language.
    • Single Node – This cluster has no workers and runs Spark jobs on the driver node. In contrast, Standard mode clusters require at least one Spark worker node in addition to the driver node to execute Spark jobs. Use this for lightweight workloads.
  • Pool: To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. We will keep it as None for this tutorial.
  • Databricks Runtime version: Databricks runtimes are the set of core components that run on your clusters. There are many options available here some of them optimized for specific purposes. We will use a stable runtime instead of the latest one here.
  • Autoscaling: Enable autoscaling if you are workload is unpredictable. When you create an Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and a maximum number of workers for the cluster. When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling. We will keep it disabled here.
  • If you keep a cluster running it is going to cost you. So Azure Databricks provides you the option to automatically terminate a cluster if there has been no activity for a certain duration. Put a value between 10 mins and 60 mins depending upon your requirement.
  • Select a Worker and Driver type depending upon your workload.

Then hit Create Cluster on the top. It will take some time to create and start the cluster

Create a notebook

A notebook in the spark cluster is a web-based interface that lets you run code and visualizations using different languages. Once the cluster is up and running, you can create notebooks in it and also run Spark jobs. In the Workspace tab on the left vertical menu bar, click Create and select Notebook:

Name the notebook, choose your preferred language and select a cluster you want your notebook to be attached. Make sure the cluster is turned on

Create Blob storage and upload a csv file into the container

First, you need to create a Storage account on Azure. Go here if you are new to the Azure Storage service. Then we will upload a .csv file on this Blob Storage container that we will access from Azure Databricks.

After creating the storage account, we create a container inside it.

Create a new container by clicking on + Container as shown below

Type a name for the container, I am selecting the default access level as Private and finally hit the Create button

After creating the container open it and upload a csv file there by clicking the upload button as shown below. You can find the two csv files here. We will use these files to run a python script in the next part of the blog.

Now the Azure Storage account structure looks somewhat like this: Azure Storage Account (databricksblog1) >> Container (blogdemo) >> Block Blob (.csv files).

In the next part of the blog, we will mount blob storage to Databricks and do some transformations in the notebook.


Like it? Share with your friends!

31961
9.9k shares, 31961 points

What's Your Reaction?

hate hate
10000
hate
confused confused
3333
confused
fail fail
15000
fail
fun fun
13333
fun
geeky geeky
11666
geeky
love love
6666
love
lol lol
8333
lol
omg omg
3333
omg
win win
15000
win
Dimple

0 Comments

Your email address will not be published. Required fields are marked *

Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Poll
Voting to make decisions or determine opinions
Story
Formatted Text with Embeds and Visuals
List
The Classic Internet Listicles
Countdown
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Meme
Upload your own images to make custom memes
Video
Youtube and Vimeo Embeds
Audio
Soundcloud or Mixcloud Embeds
Image
Photo or GIF
Gif
GIF format