Using the Intel Cloud Optimization Modules for Databricks

Shreejan_Mistry · ‎08-14-2023

Part 4 of a 4-part series: Unlocking Cloud Automation with Intel’s Cloud Optimization Modules for Terraform

From the last installment of our series, Intel Cloud Optimization Modules for Databricks, we now have an understanding of Terraform, Intel Cloud Optimizations Modules, and Databricks Modules. Now that we have all of these topics – let’s look at how can we use the Intel Cloud Optimization Module for Databricks to deploy the Databricks Workspace and Cluster in AWS.

Defining Infrastructure Configurations:

Create a new Terraform configuration file (e.g., main.tf) and define your desired infrastructure state using the Intel Cloud Optimization Modules. Specify the required inputs, such as the Databricks workspace name, region, cluster size, and any additional customizations you require. For example, check out the example folder in the Databricks AWS Workspace Cluster repo which has the usage instructions and the file structure to define and apply infrastructure configuration for Databricks. Please make sure to look through the read-me file provided in the example folder.

As you may have noticed there are a few terraform configuration files within the example folder. Let’s take a brief overview of those files. Usually, a terraform module has this file structure as shown below.

module/

|----main.tf

This file contains the main configuration for your Terraform module. It defines the resources, data sources, variables, and any other Terraform constructs specific to your module.

|----variables.tf

In this file, you define the input variables for your module. These variables allow users of your module to customize its behavior.

|----outputs.tf

This file defines the output values of your module. Output values are used to expose specific information or results from your module for other Terraform configurations to consume.

|----providers.tf

If your module requires specific provider configurations, you can specify them in this file. Providers define the target infrastructure platform or service for Terraform to interact with.

|----versions.tf

In this file you will typically find the terrform block, where you define the required Terraform version using the required_version attribute. Additionally, you can specify provider version constraints in the providers block.

Now that we understand the terraform configuration and file structure - let’s implement an example of your own. Using the Intel Cloud Optimization Modules, we will use Databricks Example and create an AWS Databricks Workspace and Cluster. We will follow the file structure discussed above.

STEP 1: Create the file structure as defined in a folder and open the project folder in IDE. Will walk you through the contents of each file below.

STEP 2: Complete this prerequisite to deploy the AWS Databricks Workspace:

Create a Databricks account.
After logging in to the account, you can find your Databricks Account ID in the top right corner.
Create a VPC with subnets and security groups in your AWS console.
Configure the providers.tf file as shown in the example below. It is important to configure both providers as Databricks Workspace and Cluster use separate providers to deploy resources. You can also see how to use the databricks.cluster provider for the Databricks Cluster module in the main.tf file.
See the main.tf example below showing you how to pass the value for dbx_host (i.e. the URL of the Databricks workspace) in the Databricks Cluster Module.

Main.tf

One of the advantages of using Intel Cloud Optimization Modules is that all the configuration of resources and optimizations have already been done for you, All we need to do is leverage the modules to deploy the infrastructure using the ‘module’ construct of terraform in main.tf. Calling the Intel Terraform Modules and providing a few required variables is enough to accelerate your automation pipeline.

#This example creates an AWS databricks workspace with the default Credentials, Storage and Network Configurations and Databricks Cluster with Intel Optimizations. For more information on usage configuration, use the README.md
module "aws_databricks_workspace" {
source = "intel/aws-databricks-workspace/intel"
vpc_id = var.vpc_id
dbx_account_id = var.dbx_account_id
dbx_account_password = var.dbx_account_password
dbx_account_username = var.dbx_account_username
vpc_subnet_ids = var.vpc_subnet_ids
security_group_ids = var.security_group_ids
}

# This module example creates databricks cluster on an your AWS dbx workspace created above.
module "databricks_cluster" {
source = "intel/databricks-cluster/intel"
dbx_host = module.aws_databricks_workspace.dbx_host
dbx_cloud = var.dbx_cloud
providers = {
databricks = databricks.cluster
}
depends_on = [
module.aws_databricks_workspace
]
tags = {
"owner" = "user@example.com"
"module" = "Intel-Cloud-Optimization-Module"
}
}

Variables.tf

The variables.tf has the variable declaration for all the input variables that are required for the AWS Databricks Workspace and Databricks Cluster Module.

variable "dbx_account_password" {
type = string
description = "Account Login Password for the Databricks Account"
}

variable "dbx_account_username" {
type = string
description = "Account Login Username/Email for the Databricks Account"
}

variable "dbx_account_id" {
type = string
description = "Account ID Number for the Databricks Account"
}

variable "dbx_cloud" {
type = string
description = "Flag that decides which Cloud to use for the instance type in Databricks Cluster"
}

variable "vpc_id" {
type = string
description = "ID for the VPC that Databricks will be attaching to."
}

variable "vpc_subnet_ids" {
type = set(string)
description = "List of subnet IDs that will be utilized by Databricks."
}

variable "security_group_ids" {
type = set(string)
description = "List of security group IDs that will be utilized by Databricks."
}

In order to pass the values to the variables, I recommend to create a terraform.tfvars file. Creating this file provides the benefit of not having to pass them at runtime.

dbx_cloud = "aws"
dbx_account_id = <"ENTER YOUR DATABRICKS ACCT ID NUMBER">
dbx_account_password = <"ENTER YOUR DATABRICKS ACCT PASSWORD">
dbx_account_username = <"ENTER YOUR DATABRICKS ACCT USERNAME">
vpc_id = <"vpc-XXXXXX-XXX">
vpc_subnet_ids = <["subnet-XXXX", "subnet-XXXXX"]>
security_group_ids = <["sg-XXXX"]>

Outputs.tf

// Capture the Databricks workspace's URL.
output "dbx_host" {
description = "URL of the Databricks workspace"
value = module.aws_databricks_workspace.dbx_host
}

output "dbx_id" {
description = "ID of the Databricks workspace"
value = module.aws_databricks_workspace.dbx_id
}

output "dbx_account_id" {
description = "Account ID for the Databricks Account"
value = module.aws_databricks_workspace.dbx_account_id
sensitive = true
}

### Credentials #####
output "dbx_role_arn" {
description = "ARN that will be used for databricks cross account IAM role."
value = module.aws_databricks_workspace.dbx_role_arn
}

output "dbx_credentials_name" {
description = "Name that will be associated with the credential configuration in Databricks."
value = module.aws_databricks_workspace.dbx_credentials_name
}

output "dbx_create_role" {
description = "Flag to create AWS IAM Role or not"
value = module.aws_databricks_workspace.dbx_create_role
}
### Network #####
output "dbx_network_name" {
description = "Name that will be associated with the network configuration in Databricks."
value = module.aws_databricks_workspace.dbx_network_name
}

output "dbx_vpc_id" {
description = "ID for the VPC that Databricks will be attaching to."
value = module.aws_databricks_workspace.dbx_vpc_id
}

output "dbx_vpc_subnet_ids" {
description = "List of subnet IDs that will be utilized by Databricks."
value = module.aws_databricks_workspace.dbx_vpc_subnet_ids
}

output "dbx_security_group_ids" {
description = "List of security group IDs that will be utilized by Databricks."
value = module.aws_databricks_workspace.dbx_security_group_ids
}

### Storage #####

output "dbx_bucket_name" {
description = "Name of the existing S3 bucket that Databricks will consume."
value = module.aws_databricks_workspace.dbx_bucket_name
}

output "dbx_storage_configuration_name" {
description = "Name of the existing S3 bucket that Databricks will consume."
value = module.aws_databricks_workspace.dbx_storage_configuration_name
}

output "dbx_create_bucket" {
description = "Flag to create AWS S3 bucket or not"
value = module.aws_databricks_workspace.dbx_create_bucket
}

### Databricks Cluster #####
output "dbx_cluster_name" {
description = "Name of the databricks cluster"
value = module.databricks_cluster.dbx_cluster_name
}

output "dbx_cluster_spark_version" {
description = "Spark version of the databricks cluster"

Providers.tf

The example uses two separate modules to deploy the Databricks environment. It is important to note that both modules use two separate providers to configure the infrastructure. See below to see how to configure providers for this example.

// Initialize the Databricks provider in "normal" (workspace) mode.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication
provider "databricks" {
host = "https://accounts.cloud.databricks.com"
username = var.dbx_account_username
password = var.dbx_account_password
}

// Intializing the following provider is REQUIRED step in order to add the databricks_global_init_script and databricks_cluster resource to your Databricks Workspace
provider "databricks" {
alias = "cluster"
host = module.aws_databricks_workspace.dbx_host
username = var.dbx_account_username
password = var.dbx_account_password
}

Versions.tf

The versions.tf file is where you can add required version constraints on the providers that are being used and also terraform version constraints.

terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "~> 1.14.2"

}
aws = {
source = "hashicorp/aws"
version = "~> 4.15.0"
}
time = {
source = "hashicorp/time"
version = "~> 0.9.1"
}
random = {
source = "hashicorp/random"
version = "~> 3.4.3"
}
}
}

Initializing and Deploying:

Run the Terraform init command in the directory containing your configuration file to initialize the Terraform project. This command downloads the necessary provider plugins and modules. After initialization, execute terraform plan which creates an execution plan. This lets you preview the changes that Terraform plans to make to your infrastructure. After the plan, execute Terraform apply to create the Databricks Workspace and Clusters and associated resources based on your configuration. Terraform will automatically provision the necessary resources and provide you with a summary of the changes to apply. Confirm the changes to deploy the Databricks clusters.

Once the application finishes, the Databricks Workspace and Cluster configuration with all the Intel Optimizations have been deployed and you can access the workspace using the URL provided in your terminal. In order to confirm the list of configurations and optimizations, once Terraform applies finishes, look at your terminal which should have all the configurations listed and also the URL to access your Databricks Workspace. It looks like the image below:

Congratulations for making it all the way through to the end of the series. That took some patience and commitment for sure. After this, you should have the knowledge and the guidance to deploy your own Databricks Environment. I urge you to try out the Intel Cloud Optimization Module for Databricks and ENJOY DATABRICKING!!

By combining the power of Terraform’s infrastructure-as-code approach with Intel's Cloud Optimization Modules for Databricks on AWS and Azure, organizations can streamline the deployment, management, and optimization of their Databricks clusters. This integration allows for efficient resource utilization, enhanced performance, and optimized cost management. With the ability to automate best practices and easily scale infrastructure, businesses can unlock the full potential of their cloud investments while focusing on their core data analytics and processing tasks.

Here are some useful links if you'd like to learn more: