In most enterprise Databricks environments (like in MSC or large analytics ecosystems), system tables such as system.job_run_logs or system.cluster_events may beIn most enterprise Databricks environments (like in MSC or large analytics ecosystems), system tables such as system.job_run_logs or system.cluster_events may be

Optimizing Databricks Cluster Cost and Utilization Without System Tables

2026/01/09 14:41
7분 읽기
이 콘텐츠에 대한 의견이나 우려 사항이 있으시면 crypto.news@mexc.com으로 연락주시기 바랍니다

In most enterprise Databricks environments (like in MSC or large analytics ecosystems), system tables such as system.jobrunlogs or system.cluster_events may be restricted or disabled due to security or governance policies.

However, tracking cluster utilization and cost is crucial for :

  • Understanding how efficiently jobs use compute
  • Identifying idle clusters or cost leaks
  • Forecasting infrastructure budget
  • Building custom cost dashboards

This blog demonstrates a step-by-step approach to compute cluster utilization and cost using only Databricks REST APIs — no system tables required.

Project Use Case

In our MSC data platform, we run multiple Databricks clusters across development, test, and production. \n We had three major challenges:

  1. No access to system tables (restricted by admin policies)
  2. Ephemeral clusters for jobs created dynamically by ADF or orchestration pipelines
  3. No direct view of how cluster utilization translates to cost

Hence, we built a lightweight utilization analyzer that :

  • Pulls data from Databricks REST APIs
  • Calculates job runtime vs cluster runtime
  • Estimates cost using DBU and VM rates
  • Outputs an easy-to-consume DataFrame

The problem & approach

The identified challenge

Teams often need to know:

  • Which clusters are idle (running with low job activity)?
  • What is the utilization % (job runtime vs cluster uptime)?
  • How much is each cluster costing (DBU + VM)?

When Unity Catalog system tables (e.g., system.jobrunlogs) are unavailable, the default SQL-based approach fails. The REST API becomes the reliable fallback.

High-level approach used in the notebook

  1. List clusters via /api/2.0/clusters/list.
  2. Estimate cluster uptime using timestamps inside the cluster JSON (created/start/terminated fields). (This is a pragmatic fallback when /clusters/events is unavailable.)
  3. Get recent job runs using /api/2.1/jobs/runs/list with time filters (or limit).
  4. Match job runs to clusters using clusterinstance.clusterid (or other cluster metadata).
  5. Compute utilization: utilization % = totaljobruntime / totalclusteruptime.
  6. Estimate cost using a simple formula: cost = runninghours × (DBU/hr × assumed DBU) + runninghours × nodes × VM $/hr.

This notebook purposely uses bounded queries (last N runs, time window) so it runs fast.

\ 1. Setup & Configuration

# Databricks Cluster Utilization & Cost Analyzer (no system tables) # Author: GPT-5 | Works on any workspace with REST API access # Requirements: Databricks Personal Access Token, Workspace URL # You can run this inside a Databricks notebook or externally.   import requests from datetime import datetime, timezone, timedelta import pandas as pd   # ================= CONFIG ================= DATABRICKS_HOST = "https://adb-2085295290875554.14.azuredatabricks.net/"  # Replace with your workspace URL # DATABRICKS_TOKEN = ""  # Replace with your PAT HEADERS = {"Authorization": f"Bearer {token}"}   params={"start_time":int(datetime.now().timestamp()*1000),"end_time":int((datetime.now()+timedelta(days=1)).timestamp()*1000),"order":"DESCENDING"}   # Time window (e.g., last 7 days) DAYS_BACK = 7 SINCE_TS_MS = int((datetime.now(timezone.utc) - timedelta(days=DAYS_BACK)).timestamp() * 1000) UNTIL_TS_MS = int(datetime.now(timezone.utc).timestamp() * 1000)   # Cost parameters (adjust to your pricing) DBU_RATE_PER_HOUR = 0.40         # $ per DBU/hr VM_COST_PER_NODE_PER_HOUR = 0.60 # $ per cloud VM node/hr DEFAULT_DBU_PER_CLUSTER_PER_HOUR = 8  # Typical for small-medium jobs cluster     # ==========================================

\ This section initializes:

  • Workspace URL & token for authentication
  • Time range for which you want to analyze utilization
  • Cost assumptions:
  • DBU rate ($/hr per DBU)
  • VM node cost
  • Approximate DBU consumption

In enterprise setups, these rates can be fetched dynamically via your FinOps or billing APIs.

  1. API Wrapper Function

    \

#  Api GET request def api_get(path, params=None):     url = f"{DATABRICKS_HOST.rstrip('/')}{path}"     try:         r = requests.get(url, headers=HEADERS, params=params, timeout=60)         if r.status_code == 404:             print(f"Skipping :{path} (404 Not Found)")             return {}         r.raise_for_status()         return r.json()     except Exception as e:         print(f"Error: {e}")         return {}

\ This helper function standardizes all REST API GET calls. \n It:

  • Builds the full endpoint URL

  • Handles 404 gracefully (important when clusters or runs have expired)

  • Returns parsed JSON

    Why it matters: This function ensures clean API communication without breaking your notebook flow if any cluster data is missing.

    \

  1. List All Active Clusters

    \

# ---------- STEP 1: Get All Clusters Related Details ---------- def list_clusters():     clusters = []     res = api_get("/api/2.0/clusters/list")     return res.get("clusters", [])

\ This retrieves all clusters available in your workspace. \n It’s equivalent to viewing your “Compute” tab programmatically. \n The response contains:

  • Cluster IDs

  • Names

  • Node counts

  • Creator information

  • Creation & termination times

    Use case: Helps identify which clusters are consuming resources in the selected window.

    4. Estimate Cluster Runtime

    \

# ---------- STEP 2: Get Cluster Events Runtime ---------- def get_cluster_runtime(cluster):     events = []     offset = 0     limit = 200     # while True:     # params = {"cluster_id": cluster_id}       created = cluster.get("creator_user_name")     created_time = cluster.get("start_time") or cluster.get("created_time")     terminated_time = cluster.get("terminated_time")     if not created_time:         return 0     end_ts = terminated_time or UNTIL_TS_MS     start_ms = max(created_time, SINCE_TS_MS)     runtime_ms = max(0, end_ts - start_ms)     return runtime_ms /1000/3600

\ We calculate total running hours for each cluster:

  • Uses creation and termination timestamps

  • Handles currently running clusters (terminated_time missing)

  • Normalizes to hours

    Why it’s important: This value is the denominator for utilization — representing total cluster uptime during the window.

    5. Get Recent Job Runs

    \

# ------------------Get Recent Job Runs ---------------------------- def get_recent_job_runs():     params ={"start_time":int(datetime.now().timestamp()*1000),"end_time":int((datetime.now()+timedelta(days=1)).timestamp()*1000),"order":"DESCENDING"}     res = api_get("/api/2.1/jobs/runs/list", params)     return res.get("runs", [])

\ Instead of fetching the entire job history (which is slow), \n This function retrieves the mostrecent 10 job runs for quick diagnostics.

In production, you can filter by:

  • Specific job_id
  • completed_only=true
  • Date window (starttimefrom, starttimeto)

\

  1. Compute Utilization and Cost

    \

# -------------------------------------Compute Cost and parse cluster utilization detials ---------------------   def compute_utilization_and_cost(clusters, job_runs):     records =[]     now_ms = int(datetime.now(timezone.utc).timestamp() * 1000)     for c in clusters:         cid = c.get("cluster_id")         cname = c.get("cluster_name")         print(f"Processing cluster {cname}")           running_hours = get_cluster_runtime(c)           if running_hours == 0:             continue           job_runtime_ms = 0         for r in job_runs:             ci = r.get("cluster_instance",{})             if ci.get("cluster_id") == cid:                 s = r.get("start_time") or SINCE_TS_MS                 e = r.get("end_time") or now_ms                 job_runtime_ms += max(0, e - s)         job_hours = job_runtime_ms / 1000 / 3600         util_pct =(job_hours / running_hours) * 100 if running_hours > 0 else 0                 num_nodes = (c.get("num_workers") or c.get("autoscale",{}).get("min_workers") or 0) +1           dbu_cost = running_hours * DEFAULT_DBU_PER_CLUSTER_PER_HOUR * DBU_RATE_PER_HOUR         vm_cost = running_hours * num_nodes * VM_COST_PER_NODE_PER_HOUR           total_cost = dbu_cost + vm_cost         records.append({             "cluster_id": cid, "cluster_name": cname,"running_hours":round(running_hours,2), "job_hours": round(job_hours,2) ,"utilization_pct": round(util_pct,2), "nodes": num_nodes,"dbu_cost": round(dbu_cost,2), "vm_cost": round(vm_cost,2), "total_cost": round(total_cost,2)         })     return pd.DataFrame(records)

This is the heart of the logic:

  • Loops through each cluster

  • Calculates total job runtime per cluster (using job runs API)

  • Derives utilization percentage = (jobhours / clusterrunning_hours) × 100

  • Estimate cost:

    • DBU cost based on rate × DBU/hr
    • VM cost = nodecount × nodecost/hr × running_hours

    Why this matters: \n This gives a unified picture ofefficiency and expense — useful for identifying clusters with high cost but low utilization.

    7. Orchestrate the Pipeline

    \

# ---------- MAIN ---------- print(f"Collecting data for last {DAYS_BACK} days...") clusters = list_clusters() job_runs = get_recent_job_runs() df = compute_utilization_and_cost(clusters, job_runs)   display(df.sort_values("utilization_pct", ascending=False))

\ This final block:

  • Retrieves data

  • Performs cost computation

  • Displays the sorted Data Frame

    In practice, this Data Frame can be :

  • Exported to Excel or Delta Table

  • Sent to Power BI dashboards

  • Integrated into FinOps automation pipelines

    \

    Results Example

| clustername | runninghours | jobhours | utilizationpct | nodes | total_cost | |----|----|----|----|----|----| | etl-job-prod | 36.5 | 28.0 | 76.7% | 4 | $142.8 | | dev-debug | 12.0 | 1.2 | 10.0% | 2 | $18.4 | | nightly-adf | 48.0 | 45.0 | 93.7% | 6 | $260.4 |

\

\ \

  1. Real-World Benefit

    By implementing this analyzer:

  • Engineering teams can track cluster cost even without audit access.

  • Managers get visibility into underutilized clusters.

  • DevOps can automatically terminate low-usage clusters.

  • Finance can validate Databricks invoices with internal metrics.

    In our MSC project, we used this as part of our data platform observability stack — combining REST API data, ADF job logs, and cost trends into a unified dashboard.

\

면책 조항: 본 사이트에 재게시된 글들은 공개 플랫폼에서 가져온 것으로 정보 제공 목적으로만 제공됩니다. 이는 반드시 MEXC의 견해를 반영하는 것은 아닙니다. 모든 권리는 원저자에게 있습니다. 제3자의 권리를 침해하는 콘텐츠가 있다고 판단될 경우, crypto.news@mexc.com으로 연락하여 삭제 요청을 해주시기 바랍니다. MEXC는 콘텐츠의 정확성, 완전성 또는 시의적절성에 대해 어떠한 보증도 하지 않으며, 제공된 정보에 기반하여 취해진 어떠한 조치에 대해서도 책임을 지지 않습니다. 본 콘텐츠는 금융, 법률 또는 기타 전문적인 조언을 구성하지 않으며, MEXC의 추천이나 보증으로 간주되어서는 안 됩니다.

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!