<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://schiiss.github.io/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://schiiss.github.io/blog/" rel="alternate" type="text/html" /><updated>2026-04-20T08:02:25-07:00</updated><id>https://schiiss.github.io/blog/feed.xml</id><title type="html">Conner Schiissler - AI - DevOps - Cloud - Data</title><subtitle>Tech Blog</subtitle><author><name>Conner Schiissler</name></author><entry><title type="html">Building an AI Energy Analyst: LangGraph + Databricks Genie for Multi-Domain Market Research</title><link href="https://schiiss.github.io/blog/energy/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/" rel="alternate" type="text/html" title="Building an AI Energy Analyst: LangGraph + Databricks Genie for Multi-Domain Market Research" /><published>2026-04-19T09:00:00-07:00</published><updated>2026-04-19T09:00:00-07:00</updated><id>https://schiiss.github.io/blog/energy/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research</id><content type="html" xml:base="https://schiiss.github.io/blog/energy/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/"><![CDATA[<p><img src="/blog/assets/images/blog_images/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/blog_image.png" alt="Building an AI Energy Analyst: LangGraph + Databricks Genie for Multi-Domain Market Research" style="display:block; margin:0 auto;" /></p>

<p>It is the ‘Week of Agents’ at Databricks, and this has inspired me to create a blog around leveraging agents to help dissect and analyze complex energy markets. Like I mentioned in my <a href="https://www.linkedin.com/posts/conner-schiissler_energy-oilandgas-dataengineering-activity-7444367912158687233-KvXE?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACXEibYBngZiCRvQiwlsg8p1A85--baPNfw">previous blog post</a>, I’ve been intrigued by energy markets as of late and I am trying to identify ways to keep up to date with what is happening within North America and across the world when it comes to energy markets. The more I dig in, the more I appreciate just how fragmented the information landscape is.</p>

<p>An energy market analyst on any given Wednesday morning is watching the EIA petroleum status report drop, tracking overnight geopolitical developments out of the Middle East, monitoring ERCOT load forecasts ahead of a Texas heat wave, and trying to synthesize all of it into a coherent view across commodities. These signals don’t live in the same place, don’t update on the same cadence, and require genuine domain knowledge to interpret together.</p>

<p>I also wanted to look at this as an opportunity to get more hands-on experience with novel cutting-edge technology like agents. Instead of a trader switching between five tabs and three data providers, what if a single conversational interface could pull from all those sources as they sit in your data lake like live inventory data from the EIA, latest available public data on ERCOT curtailment risk, breaking news with AI-classified sentiment and synthesize a coherent answer on demand?</p>

<p>This post walks through how it was built using LangGraph, Databricks Genie, and Databricks Model Serving.</p>

<hr />

<h2 id="what-has-been-built">What Has Been Built</h2>

<p>The research agent routes questions to specialised sub-agents, each backed by a Genie space with domain-specific data, and synthesises the results into a single answer.</p>

<div style="background:#1a1a2e; border-radius:12px; padding:24px 16px; margin:24px 0;">
  <div class="mermaid">
%%{init: {'theme': 'dark', 'themeVariables': {'background': '#1a1a2e', 'mainBkg': '#161b22', 'nodeBorder': '#30363d', 'nodeTextColor': '#e2e8f0', 'lineColor': '#4a5568', 'fontFamily': 'monospace', 'fontSize': '14px'}}}%%
graph LR
    A["💬 User Question"]:::user --&gt; B["🤖 LangGraph ReAct Agent\nLlama 3.3 70B on Databricks"]:::agent
    B --&gt; C["query_market_data"]:::tool
    B --&gt; D["query_ercot_4cp"]:::tool
    B --&gt; E["query_news_sentiment"]:::tool
    C --&gt; F["Genie: commodity prices\n+ EIA fundamentals"]:::genie
    D --&gt; G["Genie: curtailment risk\n+ 4CP history"]:::genie
    E --&gt; H["Genie: news articles\n+ AI sentiment"]:::genie

    classDef user   fill:#1e3a5f,stroke:#63b3ed,color:#e2e8f0
    classDef agent  fill:#2d1b69,stroke:#a371f7,color:#e2e8f0
    classDef tool   fill:#1a2a1a,stroke:#68d391,color:#e2e8f0
    classDef genie  fill:#2a1a1a,stroke:#f6ad55,color:#e2e8f0
  </div>
</div>

<p>For a single-domain question (“What’s the WTI price?”), the agent calls one tool. For a cross-domain question (“Give me a full market brief”), it calls all three and synthesises. The routing decision is made by the LLM and it reads the tool descriptions and decides what’s needed.</p>

<hr />

<h2 id="the-data-foundation">The Data Foundation</h2>

<p>Before there’s an agent, there’s a lakehouse. This project runs 7 Lakeflow Spark Declarative Pipelines on Databricks serverless, ingesting from public energy APIs into Unity Catalog:</p>

<table>
  <thead>
    <tr>
      <th>Pipeline</th>
      <th>Source</th>
      <th>Key tables</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">commodity_prices_pipeline</code></td>
      <td>Yahoo Finance (<code class="language-plaintext highlighter-rouge">yfinance</code>)</td>
      <td><code class="language-plaintext highlighter-rouge">silver_commodity_ohlcv</code> — WTI, Brent, Henry Hub, RBOB, Heating Oil</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">eia_fundamentals_pipeline</code></td>
      <td>EIA API v2 + DPR Excel</td>
      <td><code class="language-plaintext highlighter-rouge">silver_petroleum_inventory</code>, <code class="language-plaintext highlighter-rouge">silver_natural_gas_storage</code>, <code class="language-plaintext highlighter-rouge">silver_eia_dpr</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">ercot_data_pipeline</code></td>
      <td>ERCOT Public API</td>
      <td><code class="language-plaintext highlighter-rouge">silver_ercot_load_hourly</code>, <code class="language-plaintext highlighter-rouge">silver_ercot_hub_prices</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">ercot_4cp_pipeline</code></td>
      <td>ERCOT historical XLS (1998–2020)</td>
      <td><code class="language-plaintext highlighter-rouge">gold_ercot_4cp_risk_model</code>, <code class="language-plaintext highlighter-rouge">gold_ercot_4cp_candidates</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">weather_data_pipeline</code></td>
      <td>Open-Meteo</td>
      <td><code class="language-plaintext highlighter-rouge">silver_weather_hdd_cdd</code> — HDD/CDD by ERCOT zone</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">news_sentiment_pipeline</code></td>
      <td>NewsAPI + Databricks AI Functions</td>
      <td><code class="language-plaintext highlighter-rouge">silver_news_sentiment</code> — classified events + sentiment scores</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">gold_market_insights_pipeline</code></td>
      <td>Joins across silver tables</td>
      <td><code class="language-plaintext highlighter-rouge">gold_crude_fundamentals</code>, <code class="language-plaintext highlighter-rouge">gold_ercot_spark_spread</code></td>
    </tr>
  </tbody>
</table>

<p>All pipelines are deployed and scheduled via a Databricks Asset Bundle, with two scheduled jobs running on weekday evenings and Wednesday mornings to keep the data fresh.
On top of this lakehouse sit three Genie spaces, each a natural language SQL interface scoped to a specific domain:</p>

<ul>
  <li>Market Data: prices, EIA inventory/production, ERCOT hub prices and spark spreads</li>
  <li>ERCOT 4CP Analysis: historical peak demand, 2024–2025 curtailment risk model, weather data</li>
  <li>News &amp; Sentiment: rolling ~4-week news window with AI event classification and sentiment scores</li>
</ul>

<p>Genie handles the SQL generation and execution. The agent’s job is to know which Genie to ask.</p>

<hr />

<h2 id="building-the-agent">Building the Agent</h2>

<h3 id="step-1-wrap-each-genie-as-a-langchain-tool">Step 1: Wrap Each Genie as a LangChain Tool</h3>

<p>Each tool is a thin wrapper around the Databricks SDK’s Genie Conversation API. We start a conversation, poll until it completes, and extract the text response and SQL result rows.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">databricks.sdk</span> <span class="kn">import</span> <span class="n">WorkspaceClient</span>
<span class="kn">from</span> <span class="nn">langchain_core.tools</span> <span class="kn">import</span> <span class="n">tool</span>

<span class="n">GENIE_SPACES</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"market_data"</span><span class="p">:</span>    <span class="s">"01f120a563181069924649e4eadfb75e"</span><span class="p">,</span>
    <span class="s">"ercot_4cp"</span><span class="p">:</span>      <span class="s">"01f1227bc4c2113bbb95bcccd6dd634c"</span><span class="p">,</span>
    <span class="s">"news_sentiment"</span><span class="p">:</span> <span class="s">"01f12d623a761d688dd9ab160d273a94"</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">def</span> <span class="nf">_query_genie</span><span class="p">(</span><span class="n">space_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">timeout</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">120</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">WorkspaceClient</span><span class="p">()</span>
    <span class="n">resp</span> <span class="o">=</span> <span class="n">w</span><span class="p">.</span><span class="n">genie</span><span class="p">.</span><span class="n">start_conversation</span><span class="p">(</span><span class="n">space_id</span><span class="o">=</span><span class="n">space_id</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="n">question</span><span class="p">)</span>

    <span class="n">deadline</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">+</span> <span class="n">timeout</span>
    <span class="k">while</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">&lt;</span> <span class="n">deadline</span><span class="p">:</span>
        <span class="n">msg</span> <span class="o">=</span> <span class="n">w</span><span class="p">.</span><span class="n">genie</span><span class="p">.</span><span class="n">get_message</span><span class="p">(</span>
            <span class="n">space_id</span><span class="o">=</span><span class="n">space_id</span><span class="p">,</span>
            <span class="n">conversation_id</span><span class="o">=</span><span class="n">resp</span><span class="p">.</span><span class="n">conversation_id</span><span class="p">,</span>
            <span class="n">message_id</span><span class="o">=</span><span class="n">resp</span><span class="p">.</span><span class="n">message_id</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="n">status</span> <span class="o">=</span> <span class="n">msg</span><span class="p">.</span><span class="n">status</span><span class="p">.</span><span class="n">value</span> <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">msg</span><span class="p">.</span><span class="n">status</span><span class="p">,</span> <span class="s">"value"</span><span class="p">)</span> <span class="k">else</span> <span class="nb">str</span><span class="p">(</span><span class="n">msg</span><span class="p">.</span><span class="n">status</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">status</span> <span class="o">==</span> <span class="s">"COMPLETED"</span><span class="p">:</span>
            <span class="c1"># Extract natural language response + formatted SQL result rows
</span>            <span class="n">parts</span> <span class="o">=</span> <span class="p">[]</span>
            <span class="k">for</span> <span class="n">att</span> <span class="ow">in</span> <span class="n">msg</span><span class="p">.</span><span class="n">attachments</span> <span class="ow">or</span> <span class="p">[]:</span>
                <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">att</span><span class="p">,</span> <span class="s">"text"</span><span class="p">)</span> <span class="ow">and</span> <span class="n">att</span><span class="p">.</span><span class="n">text</span><span class="p">:</span>
                    <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">att</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
                <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">att</span><span class="p">,</span> <span class="s">"query"</span><span class="p">)</span> <span class="ow">and</span> <span class="n">att</span><span class="p">.</span><span class="n">query</span><span class="p">:</span>
                    <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="n">c</span><span class="p">.</span><span class="n">name</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">att</span><span class="p">.</span><span class="n">query</span><span class="p">.</span><span class="n">result</span><span class="p">.</span><span class="n">statement_response</span><span class="p">.</span><span class="n">manifest</span><span class="p">.</span><span class="n">schema</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span>
                    <span class="n">rows</span> <span class="o">=</span> <span class="n">att</span><span class="p">.</span><span class="n">query</span><span class="p">.</span><span class="n">result</span><span class="p">.</span><span class="n">statement_response</span><span class="p">.</span><span class="n">result</span><span class="p">.</span><span class="n">data_array</span> <span class="ow">or</span> <span class="p">[]</span>
                    <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="s">" | "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">cols</span><span class="p">)]</span> <span class="o">+</span> <span class="p">[</span><span class="s">" | "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">r</span><span class="p">)</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rows</span><span class="p">[:</span><span class="mi">25</span><span class="p">]]))</span>
            <span class="k">return</span> <span class="s">"</span><span class="se">\n\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span> <span class="ow">or</span> <span class="s">"No data returned."</span>

        <span class="k">elif</span> <span class="n">status</span> <span class="ow">in</span> <span class="p">(</span><span class="s">"FAILED"</span><span class="p">,</span> <span class="s">"QUERY_RESULT_EXPIRED"</span><span class="p">):</span>
            <span class="k">return</span> <span class="sa">f</span><span class="s">"Genie query failed: </span><span class="si">{</span><span class="n">status</span><span class="si">}</span><span class="s">"</span>

        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>

    <span class="k">return</span> <span class="s">"Genie query timed out."</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">query_market_data</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Query live commodity prices and EIA supply/demand fundamentals.
    Covers: WTI crude, Brent, Henry Hub, RBOB, heating oil prices;
    US petroleum inventory; natural gas storage; EIA drilling productivity;
    ERCOT hub prices, load actuals, and spark spreads."""</span>
    <span class="k">return</span> <span class="n">_query_genie</span><span class="p">(</span><span class="n">GENIE_SPACES</span><span class="p">[</span><span class="s">"market_data"</span><span class="p">],</span> <span class="n">question</span><span class="p">)</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">query_ercot_4cp</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Query ERCOT 4CP curtailment risk and peak demand data.
    Covers: historical Four Coincident Peaks 1998-2020; 2024-2025 curtailment risk
    model with enhanced_risk_score and curtailment_alert flags; weather (HDD/CDD)
    by ERCOT load zone; projected load thresholds vs actuals."""</span>
    <span class="k">return</span> <span class="n">_query_genie</span><span class="p">(</span><span class="n">GENIE_SPACES</span><span class="p">[</span><span class="s">"ercot_4cp"</span><span class="p">],</span> <span class="n">question</span><span class="p">)</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">query_news_sentiment</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Query recent energy market news with AI sentiment analysis.
    Covers: rolling ~4-week news window; AI-classified event types (supply disruption,
    geopolitical event, weather event, regulatory change, etc.); sentiment labels
    (positive, negative, neutral, mixed); commodity and country tags."""</span>
    <span class="k">return</span> <span class="n">_query_genie</span><span class="p">(</span><span class="n">GENIE_SPACES</span><span class="p">[</span><span class="s">"news_sentiment"</span><span class="p">],</span> <span class="n">question</span><span class="p">)</span>
</code></pre></div></div>

<p>The tool docstrings are really important here, this is how the LLM decides which tool to call. The more precise the description, the better the routing.</p>

<h3 id="step-2-wire-it-together-with-langgraph">Step 2: Wire It Together with LangGraph</h3>

<p>With the tools defined, we can wire them together in LangGraph:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">create_agent</span>
<span class="kn">from</span> <span class="nn">databricks_langchain</span> <span class="kn">import</span> <span class="n">ChatDatabricks</span>

<span class="n">llm</span> <span class="o">=</span> <span class="n">ChatDatabricks</span><span class="p">(</span>
    <span class="n">endpoint</span><span class="o">=</span><span class="s">"databricks-meta-llama-3-3-70b-instruct"</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">4096</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span><span class="n">llm</span><span class="p">,</span> <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">query_market_data</span><span class="p">,</span> <span class="n">query_ercot_4cp</span><span class="p">,</span> <span class="n">query_news_sentiment</span><span class="p">],</span>
                     <span class="n">system_prompt</span><span class="o">=</span><span class="n">SYSTEM_PROMPT</span><span class="p">)</span>
</code></pre></div></div>

<p><a href="https://docs.langchain.com/oss/python/migrate/langchain-v1#migrate-to-create_agent"><code class="language-plaintext highlighter-rouge">create_agent</code></a> is the LangChain v1 API for building agents, replacing the deprecated <code class="language-plaintext highlighter-rouge">langgraph.prebuilt.create_react_agent</code>. It builds a ReAct graph where the LLM reasons about which tool to call, calls it, observes the result, and repeats until it has enough information to answer. <code class="language-plaintext highlighter-rouge">ChatDatabricks</code> is provided by the <a href="https://python.langchain.com/docs/integrations/chat/databricks/"><code class="language-plaintext highlighter-rouge">databricks-langchain</code></a> package and handles authentication and routing to any Databricks Model Serving endpoint.</p>

<p>The system prompt defines the agent’s identity and reinforces the routing logic:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are an expert energy market research assistant...

Tool routing:
- query_market_data    → prices, EIA supply/demand, ERCOT power prices/load/spark spreads
- query_ercot_4cp      → 4CP peaks, curtailment risk scores, weather, load forecasting
- query_news_sentiment → recent headlines, event classification, sentiment trends

For multi-domain questions, call multiple tools and synthesise the results into a single answer.
</code></pre></div></div>

<h3 id="step-3-deploy-to-databricks-model-serving">Step 3: Deploy to Databricks Model Serving</h3>

<p>We wrap the agent in an MLflow <code class="language-plaintext highlighter-rouge">PythonModel</code> so it can be deployed as a REST endpoint and queried from the playground, notebooks, or any HTTP client.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">EnergyResearchAgent</span><span class="p">(</span><span class="n">mlflow</span><span class="p">.</span><span class="n">pyfunc</span><span class="p">.</span><span class="n">PythonModel</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">load_context</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
        <span class="n">_configure_credentials</span><span class="p">()</span>  <span class="c1"># inject DATABRICKS_HOST from env
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">agent</span> <span class="o">=</span> <span class="n">_build_graph</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">model_input</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">model_input</span><span class="p">,</span> <span class="s">"to_dict"</span><span class="p">):</span>
            <span class="n">model_input</span> <span class="o">=</span> <span class="n">model_input</span><span class="p">.</span><span class="n">to_dict</span><span class="p">(</span><span class="n">orient</span><span class="o">=</span><span class="s">"records"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>

        <span class="n">messages</span> <span class="o">=</span> <span class="p">[</span><span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="n">m</span><span class="p">[</span><span class="s">"content"</span><span class="p">])</span>
                    <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">model_input</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"messages"</span><span class="p">,</span> <span class="p">[])</span>
                    <span class="k">if</span> <span class="n">m</span><span class="p">[</span><span class="s">"role"</span><span class="p">]</span> <span class="o">==</span> <span class="s">"user"</span><span class="p">]</span>

        <span class="n">result</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">agent</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"messages"</span><span class="p">:</span> <span class="n">messages</span><span class="p">})</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"choices"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"assistant"</span><span class="p">,</span>
                                     <span class="s">"content"</span><span class="p">:</span> <span class="n">result</span><span class="p">[</span><span class="s">"messages"</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">content</span><span class="p">},</span>
                         <span class="s">"finish_reason"</span><span class="p">:</span> <span class="s">"stop"</span><span class="p">,</span> <span class="s">"index"</span><span class="p">:</span> <span class="mi">0</span><span class="p">}],</span>
            <span class="s">"object"</span><span class="p">:</span> <span class="s">"chat.completion"</span><span class="p">,</span>
        <span class="p">}</span>
</code></pre></div></div>

<p>Logging and registering to Unity Catalog:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mlflow</span><span class="p">.</span><span class="n">set_registry_uri</span><span class="p">(</span><span class="s">"databricks-uc"</span><span class="p">)</span>

<span class="k">with</span> <span class="n">mlflow</span><span class="p">.</span><span class="n">start_run</span><span class="p">():</span>
    <span class="n">mlflow</span><span class="p">.</span><span class="n">pyfunc</span><span class="p">.</span><span class="n">log_model</span><span class="p">(</span>
        <span class="n">name</span><span class="o">=</span><span class="s">"energy_research_agent"</span><span class="p">,</span>
        <span class="n">python_model</span><span class="o">=</span><span class="n">EnergyResearchAgent</span><span class="p">(),</span>
        <span class="n">code_paths</span><span class="o">=</span><span class="p">[</span><span class="s">"agents/energy_research_agent/agent.py"</span><span class="p">],</span>
        <span class="n">pip_requirements</span><span class="o">=</span><span class="p">[</span><span class="s">"langgraph&gt;=1.0"</span><span class="p">,</span> <span class="s">"langchain&gt;=1.0"</span><span class="p">,</span>
                          <span class="s">"databricks-langchain"</span><span class="p">,</span> <span class="s">"databricks-sdk&gt;=0.40"</span><span class="p">],</span>
        <span class="n">registered_model_name</span><span class="o">=</span><span class="s">"energy_research_agent"</span><span class="p">,</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>One important detail: the serving container has no default Databricks credentials. The fix is to inject them via secrets when creating the endpoint:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ServedEntityInput</span><span class="p">(</span>
    <span class="n">entity_name</span><span class="o">=</span><span class="s">"energy_research_agent"</span><span class="p">,</span>
    <span class="n">entity_version</span><span class="o">=</span><span class="s">"3"</span><span class="p">,</span>
    <span class="n">workload_size</span><span class="o">=</span><span class="s">"Small"</span><span class="p">,</span>
    <span class="n">scale_to_zero_enabled</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">environment_vars</span><span class="o">=</span><span class="p">{</span>
        <span class="s">"DATABRICKS_HOST"</span><span class="p">:</span> <span class="s">"https://&lt;your-workspace&gt;.cloud.databricks.com"</span><span class="p">,</span>
        <span class="s">"DATABRICKS_TOKEN"</span><span class="p">:</span> <span class="s">"{{secrets/&lt;your-scope&gt;/databricks_pat}}"</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">)</span>
</code></pre></div></div>

<hr />

<h2 id="seeing-inside-mlflow-tracing">Seeing Inside: MLflow Tracing</h2>

<p>Deploying the agent is one thing. Understanding what it’s actually doing is another. With <code class="language-plaintext highlighter-rouge">mlflow.langchain.autolog</code>, every run produces a full execution trace. Tool calls, LLM steps, inputs, outputs, and latency, all viewable in the Experiments UI.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mlflow</span><span class="p">.</span><span class="n">langchain</span><span class="p">.</span><span class="n">autolog</span><span class="p">(</span><span class="n">log_traces</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="k">with</span> <span class="n">mlflow</span><span class="p">.</span><span class="n">start_run</span><span class="p">(</span><span class="n">run_name</span><span class="o">=</span><span class="s">"full market brief"</span><span class="p">):</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"messages"</span><span class="p">:</span> <span class="p">[</span><span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span>
        <span class="s">"Give me a full energy market brief: WTI price, ERCOT curtailment risk, "</span>
        <span class="s">"and any recent news that could move prices."</span>
    <span class="p">)]})</span>
</code></pre></div></div>

<p>Opening the trace for the cross-domain question, you can see the full ReAct loop:</p>

<ol>
  <li><strong>LLM thinks</strong>: “This question spans prices, ERCOT risk, and news. I need all three tools.”</li>
  <li><strong>Calls <code class="language-plaintext highlighter-rouge">query_market_data</code></strong>: asks Genie for latest WTI and nat gas prices; gets back SQL results</li>
  <li><strong>Calls <code class="language-plaintext highlighter-rouge">query_ercot_4cp</code></strong>: asks Genie for the current risk score and latest curtailment alert</li>
  <li><strong>Calls <code class="language-plaintext highlighter-rouge">query_news_sentiment</code></strong>: asks Genie for recent articles and aggregate sentiment on crude</li>
  <li><strong>LLM synthesises</strong>: writes a coherent market brief grounded in the actual data values</li>
</ol>

<p>Each step shows the exact question sent to Genie, the SQL it generated, the raw result rows, and how long each step took. For a question that hits all three Genie spaces, total latency is typically 30–60 seconds which is dominated by Genie query time, not LLM reasoning.</p>

<hr />

<h2 id="tuning-your-genie-sub-agents-from-instructions-to-certified-sql">Tuning Your Genie Sub-Agents: From Instructions to Certified SQL</h2>

<p>I had to do a ton of fine-tuning on the Genie spaces to get the correct responses back.</p>

<p>There are three tiers of guidance you can give a Genie space, in increasing order of reliability:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Mechanism</th>
      <th>How it works</th>
      <th>Reliability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td><strong>Instructions</strong></td>
      <td>Natural language guidance block</td>
      <td>Genie <em>may</em> follow it</td>
    </tr>
    <tr>
      <td>2</td>
      <td><strong>Sample questions</strong></td>
      <td>Example questions shown in the UI</td>
      <td>Steers question interpretation</td>
    </tr>
    <tr>
      <td>3</td>
      <td><strong>Certified SQL examples</strong> (<code class="language-plaintext highlighter-rouge">example_question_sqls</code>)</td>
      <td>Exact question → SQL pairs</td>
      <td>Genie uses this SQL verbatim</td>
    </tr>
  </tbody>
</table>

<h3 id="the-news-genie-problem-wrong-join">The News Genie Problem: Wrong Join</h3>

<p>Out of the box, when asked “What geopolitical events are affecting energy markets?”, the news Genie returned this:</p>

<blockquote>
  <p><em>“The only geopolitical event type affecting energy markets in the data is geopolitical_supply_risk, which is referenced in 50 articles.”</em></p>
</blockquote>

<p>Technically correct. Completely useless. Genie took the path of least resistance and it counted rows in <code class="language-plaintext highlighter-rouge">silver_news_sentiment</code> and called it done. The actual article titles, descriptions, and publication dates sitting in <code class="language-plaintext highlighter-rouge">bronze_news_articles</code> were never touched.</p>

<p><strong>Instructions fixed this.</strong> Adding explicit guidance to always JOIN the two tables and never return counts when the user wants headlines produced the response you’d actually want:</p>

<blockquote>
  <p><em>“Recent geopolitical events affecting energy markets include ongoing conflicts in the Middle East, particularly the U.S.-Israeli war with Iran, which has led to supply disruptions and volatility in oil prices. Notable data points include:</em></p>
  <ul>
    <li><em>Strait of Hormuz closure impacting global oil and gas transit</em></li>
    <li><em>Middle East volatility causing oil price swings</em></li>
    <li><em>EU postponing Russian oil ban due to Iran war</em></li>
    <li><em>Big Oil seeking new resources as infrastructure is damaged”</em></li>
  </ul>
</blockquote>

<p>The key instructions for the news Genie space:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Always JOIN bronze_news_articles with silver_news_sentiment on
bronze_news_articles.article_id = silver_news_sentiment.article_id when answering
questions about news events, headlines, or market developments.

Never return only counts or category summaries — show actual article titles
and descriptions. ORDER BY published_at DESC.

For geopolitical questions, filter WHERE event_type IN
('geopolitical_event', 'geopolitical_supply_risk') and return the full
article titles and descriptions joined from bronze_news_articles.
</code></pre></div></div>

<h3 id="the-resulting-agent-response">The Resulting Agent Response</h3>

<p>With both fixes in place, a cross-domain question that previously returned stale data and category counts now produces:</p>

<p><strong>Question:</strong> <em>“What are the most recent energy market headlines and is overall sentiment bullish or bearish on crude oil, as well as the current prices?”</em></p>

<p><strong>Agent response:</strong></p>
<blockquote>
  <p><em>Based on the recent energy market headlines, the overall sentiment on crude oil is leaning towards neutral and positive, with 5 neutral and 3 positive headlines out of 8 total headlines. The current prices of crude oil are $111.54 for WTI and $109.03 for Brent, as of April 2, 2026. These prices reflect the market’s response to the geopolitical events and other factors mentioned in the headlines, such as the record highs in Brent crude oil price and the halt in crude oil production by Kuwait.</em></p>
</blockquote>

<p>Sentiment from news articles should probably be treated as directional, not predictive.</p>

<p>This is the agent firing both tools, <code class="language-plaintext highlighter-rouge">query_market_data</code> for current prices (now correctly hitting <code class="language-plaintext highlighter-rouge">silver_commodity_ohlcv</code> via the certified SQL example) and <code class="language-plaintext highlighter-rouge">query_news_sentiment</code> for headlines and scores. Before the Genie tuning, this same question returned a stale March price and a single line counting news categories. After tuning, it returns a grounded, multi-signal market view.</p>

<p><a href="/blog/assets/images/blog_images/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/agent_question_and_reply.png" target="_blank"><img src="/blog/assets/images/blog_images/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/agent_question_and_reply.png" alt="Agent Question and Response" /></a></p>

<p>Here is what the MlFlow trace looks like as well:</p>

<p><a href="/blog/assets/images/blog_images/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/agent_trace.png" target="_blank"><img src="/blog/assets/images/blog_images/building-an-ai-energy-analyst-langgraph-databricks-genie-for-multi-domain-market-research/agent_trace.png" alt="Agent Trace" /></a></p>

<h3 id="genie-tuning-principles">Genie Tuning Principles</h3>

<ol>
  <li><strong>Start with instructions for JOIN and output format issues.</strong> They’re easy to write and usually fix Genie picking the wrong path when the correct answer requires combining two tables.</li>
  <li><strong>Use certified SQL examples for table selection issues.</strong> When Genie consistently chooses the wrong table despite instructions, <code class="language-plaintext highlighter-rouge">example_question_sqls</code> is the reliable fix. It bypasses Genie’s SQL generation entirely for known question patterns.</li>
  <li><strong>Clarify column semantics.</strong> Series codes like <code class="language-plaintext highlighter-rouge">WCRSTUS1</code> are opaque. Spelling out what they mean (crude stocks, production, imports) helps Genie pick the right series.</li>
  <li><strong>Write instructions that address known failure modes directly.</strong> “Never return counts when the user asks about specific events” is more actionable than “be detailed”.</li>
</ol>

<p>This investment in Genie quality pays compounding dividends. Every query that flows through the agent benefits, whether it comes from a user in the Genie UI, from the LangGraph agent, or from a scheduled notebook. Time spent tuning the sub-agents is time saved debugging the orchestration layer.</p>

<hr />

<h2 id="what-makes-this-pattern-powerful">What Makes This Pattern Powerful</h2>

<p>A few things stand out about this approach compared to alternatives:</p>

<p><strong>Genie as a sub-agent is genuinely useful.</strong> Genie handles all the SQL complexity, schema understanding, joins, aggregations, formatting. The LangGraph agent just needs to ask it a question in plain English and get back a structured result. You get the flexibility of natural language querying without exposing raw database access to the agent.</p>

<p><strong>Scoped Genie spaces dramatically improve accuracy.</strong> I have found a single Genie covering all tables writes worse SQL on domain-specific questions because the schema is too broad. Three tightly-scoped Genie spaces write better SQL on their respective domains, and the agent router compensates for the added complexity.</p>

<p><strong>Tool descriptions drive routing quality.</strong> The LLM’s routing decisions are entirely based on the tool docstrings. Writing precise, example-rich descriptions, what the tool covers, what it doesn’t, when to use it, is the highest-leverage thing you can do to improve agent behaviour. No fine-tuning required.</p>

<p><strong>MLflow Tracing makes debugging tractable.</strong> Without tracing, a wrong or incomplete answer is hard to diagnose. Was it the routing? The Genie SQL? The LLM synthesis? With traces, you can see exactly where things went wrong.</p>

<hr />

<h2 id="whats-next">What’s Next</h2>

<p>This was a lot of fun to build and I think as I continue bringing this data into the lakehouse there will continue to be interesting analytical and AI opportunities.</p>

<p>A few natural extensions from here:</p>

<ul>
  <li><strong>Databricks App</strong>: wrap the endpoint in a Streamlit chat UI so the agent is accessible to non-technical users without an API client</li>
  <li><strong>Memory</strong>: use Lakebase to persist conversation history and user preferences across sessions, enabling multi-session research threads</li>
  <li><strong>Evaluation</strong>: use MLflow’s <code class="language-plaintext highlighter-rouge">genai.evaluate()</code> to score routing accuracy and answer quality against a golden set of questions</li>
</ul>

<p>Thanks for reading 😀!</p>]]></content><author><name>Conner Schiissler</name></author><category term="Energy" /><category term="Databricks" /><category term="Oil &amp; Gas" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">What the Spark Spread Tells Us About Texas Grid Stress</title><link href="https://schiiss.github.io/blog/energy/what-spark-spread-tells-you-about-texas-grid-stress/" rel="alternate" type="text/html" title="What the Spark Spread Tells Us About Texas Grid Stress" /><published>2026-03-28T07:00:00-07:00</published><updated>2026-03-28T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/energy/what-spark-spread-tells-you-about-texas-grid-stress</id><content type="html" xml:base="https://schiiss.github.io/blog/energy/what-spark-spread-tells-you-about-texas-grid-stress/"><![CDATA[<p><img src="/blog/assets/images/blog_images/what-spark-spread-tells-you-about-texas-grid-stress/blog_image.png" alt="What the Spark Spread Tells Us About Texas Grid Stress" style="display:block; margin:0 auto;" /></p>

<blockquote>
  <p>Note: All prices and spark spreads shown here are derived from ERCOT HB_NORTH real-time settlement point prices (15-minute intervals), Henry Hub front-month proxy (NG=F continuous futures), and publicly available weather data, all ingested and modeled in Databricks for this project. Spark spreads are calculated using assumed heat rates (7.0 MMBtu/MWh for CCGT, 10.0 for peakers) and represent fuel-cost margins only, not total plant profitability. Values are based on my processed dataset and may differ from official ERCOT settlements, nodal prices, or plant-specific economics. HB_NORTH broadly represents the Dallas/Fort Worth region.</p>
</blockquote>

<p>Over the past few months, I’ve been intrigued by energy markets, especially since I’ve been working more with the trading team at Plains on use cases like crude blending. To get a better feel for the data energy traders analyze daily, I pulled in data from <a href="https://www.ercot.com/">ERCOT</a>, the <a href="https://www.eia.gov/">U.S. Energy Information Administration</a>, and other sources into Databricks, focusing on the Texas market. This also gave me a chance to get hands-on with <a href="https://learn.microsoft.com/en-us/azure/databricks/ldp/concepts">Lakeflow Spark Declarative Pipelines</a>. While I wasn’t the biggest fan of its predecessor, DLT, I’ve heard along with the new branding there are significant improvements to the product as well.</p>

<p>The idea I had to start was to look at these various data sources to analyze how extreme weather events impact gas and power prices.</p>

<p>On January 26, 2026, a cold front pushed into Dallas and temperatures plummeted more than 30 degrees below the seasonal average. Over the next three days, something remarkable happened to energy markets:</p>

<table>
  <thead>
    <tr>
      <th>Date</th>
      <th>Dallas Temp</th>
      <th>Gas Price</th>
      <th>Avg Power Price</th>
      <th>CCGT Spark Spread</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Jan 21</td>
      <td>57°F</td>
      <td>$4.88/MMBtu</td>
      <td>$22.44/MWh</td>
      <td>-$11.69</td>
    </tr>
    <tr>
      <td>Jan 26</td>
      <td>21°F</td>
      <td>$6.80/MMBtu</td>
      <td>$168/MWh</td>
      <td>+$119.91</td>
    </tr>
    <tr>
      <td>Jan 27</td>
      <td>30°F</td>
      <td>$6.95/MMBtu</td>
      <td>$116/MWh</td>
      <td>+$67.05</td>
    </tr>
    <tr>
      <td>Jan 28</td>
      <td>34°F</td>
      <td>$7.46/MMBtu</td>
      <td>$200/MWh</td>
      <td>+$147.67</td>
    </tr>
    <tr>
      <td>Jan 29</td>
      <td>42°F</td>
      <td>$3.92/MMBtu</td>
      <td>$27/MWh</td>
      <td>-$0.81</td>
    </tr>
  </tbody>
</table>

<p>Gas prices jumped sharply. Power prices more than quintupled. And then, as temperatures recovered on January 29, the spread turned negative and while January 30 showed some residual elevation as the grid normalized, the event was effectively over within two days.</p>

<blockquote>
  <p><strong>The spark spread captures the core economics in a single number.</strong></p>
</blockquote>

<div id="cold-snap-chart" style="width:100%;"></div>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>

<script>
(function() {
  // Data from energyiq.market_data.gold_ercot_spark_spread + silver_weather_hdd_cdd
  // HB_NORTH, Jan 20 – Feb 5 2026. Weekends excluded (no ERCOT settlement data).
  var dates = [
    '2026-01-20','2026-01-21','2026-01-22','2026-01-23',
    '2026-01-26','2026-01-27','2026-01-28','2026-01-29','2026-01-30',
    '2026-02-02','2026-02-03','2026-02-04','2026-02-05'
  ];

  var powerPrice  = [43.51, 22.44, 37.36, 25.41, 167.51, 115.73, 199.89, 26.62, 65.95, 16.49, 19.63, 22.42, 18.41];
  var sparkSpread = [16.16, -11.69, 2.04, -11.52, 119.91, 67.05, 147.67, -0.81, 35.47, -6.17, -3.55, -1.83, -6.15];
  var temperature = [42.0, 56.6, 48.3, 47.0, 21.4, 29.7, 34.0, 41.5, 37.0, 49.7, 58.7, 47.4, 49.9];

  var annotation = {
    x: '2026-01-26', y: 167.51,
    xref: 'x', yref: 'y',
    text: 'Cold front arrives<br>Avg 21°F in Dallas',
    showarrow: true,
    arrowhead: 2,
    arrowcolor: '#fc8181',
    font: { color: '#fc8181', size: 11 },
    ax: 70, ay: -40
  };

  var tracePower = {
    x: dates, y: powerPrice,
    name: 'Avg Power Price ($/MWh)',
    type: 'scatter', mode: 'lines+markers',
    line: { color: '#63b3ed', width: 2 }, marker: { size: 5 },
    hovertemplate: '%{x}<br>Power: $%{y}/MWh<extra></extra>'
  };

  var traceSpread = {
    x: dates, y: sparkSpread,
    name: 'CCGT Spark Spread ($/MWh)',
    type: 'scatter', mode: 'lines+markers',
    line: { color: '#f6ad55', width: 2 }, marker: { size: 5 },
    hovertemplate: '%{x}<br>Spread: $%{y}/MWh<extra></extra>'
  };

  var traceTemp = {
    x: dates, y: temperature,
    name: 'Dallas Avg Temp (°F)',
    type: 'scatter', mode: 'lines+markers',
    line: { color: '#fc8181', width: 2 }, marker: { size: 5 },
    xaxis: 'x2', yaxis: 'y2',
    hovertemplate: '%{x}<br>Temp: %{y}°F<extra></extra>'
  };

  function buildLayout(mobile) {
    var ann = mobile
      ? Object.assign({}, annotation, { ax: -60, ay: -30, font: { color: '#fc8181', size: 10 } })
      : annotation;
    return {
      grid: { rows: 2, columns: 1, pattern: 'independent', roworder: 'top to bottom' },
      paper_bgcolor: '#1a1a2e',
      plot_bgcolor: '#1a1a2e',
      font: { color: '#e2e8f0', family: 'monospace', size: mobile ? 11 : 12 },
      margin: mobile
        ? { t: 20, b: 80, l: 50, r: 10 }
        : { t: 50, b: 40, l: 60, r: 20 },
      height: mobile ? 420 : 500,
      showlegend: true,
      legend: mobile
        ? { orientation: 'h', x: 0, y: -0.18, font: { size: 10 } }
        : { orientation: 'h', x: 0, y: 1.08, font: { size: 12 } },
      xaxis:  { gridcolor: '#2d3748', linecolor: '#4a5568', tickformat: '%b %d' },
      yaxis:  { title: '$/MWh', gridcolor: '#2d3748', linecolor: '#4a5568', zeroline: true, zerolinecolor: '#718096' },
      xaxis2: { gridcolor: '#2d3748', linecolor: '#4a5568', tickformat: '%b %d' },
      yaxis2: { title: '°F', gridcolor: '#2d3748', linecolor: '#4a5568' },
      annotations: [ann]
    };
  }

  var config = { responsive: true, displayModeBar: false };
  var isMobile = window.innerWidth < 600;

  Plotly.newPlot('cold-snap-chart', [tracePower, traceSpread, traceTemp], buildLayout(isMobile), config);

  window.addEventListener('resize', function() {
    var mobile = window.innerWidth < 600;
    if (mobile !== isMobile) {
      isMobile = mobile;
      Plotly.relayout('cold-snap-chart', buildLayout(mobile));
    }
  });
})();
</script>

<p>The two charts above tell the same story from different angles. The top chart tracks two lines: average power price (what electricity sold for at the HB_NORTH hub) and the CCGT spark spread (the profit margin for a gas plant generating that electricity). On January 21 both lines are flat and low, a mild day with no stress on the grid. Then the cold front arrives on January 26 and both lines spike sharply upward, the power price climbing to $168/MWh and the spread jumping to $120. They stay elevated through January 28 before collapsing back to near zero on January 29 the moment temperatures recovered.</p>

<p>The bottom chart shows why. Dallas average temperature is plotted across the same date range. The V-shaped dip, bottoming out at an average of 21°F on January 26, is the cold snap. The visual alignment between the temperature dip and the price spike in the top chart is the entire point: weather drove demand, demand drove prices, and prices drove the spread. The mechanism isn’t abstract, you can see it happen in real time across two charts stacked on top of each other.</p>

<p>The fact that both lines in the top chart move together matters too. In a summer heat wave, power prices spike but gas prices stay cheap, so the spread explodes far above the power price line. In this winter event, gas prices spiked alongside power prices. Gas infrastructure tightened as heating demand and power generation competed for the same fuel. The spread was still strongly positive, but it was more compressed than a typical August afternoon. That compression is the signature of a winter stress event versus a summer one.</p>

<hr />

<h2 id="what-is-the-spark-spread">What Is The Spark Spread?</h2>

<p><a href="https://www.eia.gov/todayinenergy/detail.php?id=9911">The spark spread is a common metric for estimating the profitability of natural gas-fired electric generators</a>. The spark spread is the profit margin a gas-fired power plant earns for generating one megawatt-hour of electricity. It answers one question: is it worth burning gas right now? They are calculated with the following equation:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Spark spread ($/MWh) = power price ($/MWh) – [natural gas price ($/mmBtu) × heat rate (mmBtu/MWh)]
</code></pre></div></div>

<p>Heat rate measures how many MMBtu of fuel a power plant must burn to generate 1 MWh of electricity. Lower heat rates mean higher efficiency.</p>

<ul>
  <li><strong>CCGT</strong> (combined cycle gas turbine, ~49% efficient): <code class="language-plaintext highlighter-rouge">7.0 MMBtu/MWh</code>: the workhorses that run most of the day.</li>
  <li><strong>Peaker</strong> (simple cycle, ~34% efficient): <code class="language-plaintext highlighter-rouge">10.0 MMBtu/MWh</code>: expensive to run, called only when the grid is stressed.</li>
</ul>

<p>A positive spread means the plant covers its fuel cost and makes money generating electricity. A negative spread means every megawatt-hour it produces loses money on fuel alone.</p>

<p>On January 28, with power at $200/MWh and gas at $7.46/MMBtu, the CCGT spread was $200 − (7.0 × $7.46) = $147.67. Many gas plants were earning extremely high margins. On January 21, with power at just $22.44 and a spread of -$11.69, few plants would choose to run purely for energy revenue.</p>

<p>This is the same plant, same fuel, a week apart.</p>

<hr />

<h2 id="where-spark-spread-falls-short">Where Spark Spread Falls Short</h2>

<p>The spark spread is widely used precisely because it is simple, but that simplicity comes with real limitations worth understanding before drawing conclusions from it.</p>

<p>In <em>Energy Trading &amp; Investing</em>, David W. Edwards notes that spark spread can overestimate actual plant profitability by <strong>20–30%</strong> once operational constraints are accounted for. The core criticism is that spark spread models are reactive. They assume a generator simply turns on when the spread is positive and turns off when it isn’t. Real plant dispatch doesn’t work that way.</p>

<p>A few of the key gaps:</p>

<ul>
  <li><strong>Ramp time</strong>: CCGTs take 1–4 hours to reach full output; peakers 10–30 minutes. A price spike that resolves before a cold plant can respond means the realized margin is lower than the spread implies. The optimal dispatch schedule has to be anticipated, not just observed.</li>
  <li><strong>Start-up costs</strong>: Lighting off a plant burns fuel and causes mechanical wear. A brief positive spread may not justify the cost of starting, especially if the window is short.</li>
  <li><strong>Fixed costs excluded</strong>: Spark spread covers fuel cost only. It does not account for pipeline costs, fuel-related finance charges, variable O&amp;M, taxes, or fixed expenses. A plant can show a positive spark spread and still lose money overall.</li>
  <li><strong>Hub vs. nodal prices</strong>: HB_NORTH is a trading hub. Actual plant economics depend on their nodal price, which can diverge significantly due to transmission congestion.</li>
</ul>

<p>None of this makes spark spread useless. It remains one of the most widely used signals for understanding generator economics and grid stress. But as the <a href="https://www.eia.gov/todayinenergy/detail.php?id=9911">EIA</a> puts it, spark spread is an indicator of market conditions and not necessarily an exact measure of profitability for any one specific generator. Read it as directional, not precise.</p>

<hr />

<h2 id="two-types-of-grid-stress-and-they-look-completely-different">Two Types of Grid Stress (And They Look Completely Different)</h2>

<p>The Texas grid experiences stress in two different ways and the spark spread behaves differently in each.</p>

<h3 id="summer-heat-waves-when-gas-plants-win">Summer Heat Waves: When Gas Plants Win</h3>

<p>August is often the best month to own a gas plant in Texas. August has the highest average CCGT spread at $19.82, making it the most profitable month to own a gas plant in Texas. The next highest months are May ($18.58) and January ($13.14), but August stands out as the clear leader. Temperatures regularly exceed 100°F, air conditioning load pushes demand to annual peaks, and gas prices stay low in summer because heating demand is absent. As a result, power prices spike while fuel costs stay cheap.</p>

<p>On August 20, 2024, the average power price at HB_NORTH hit $226/MWh with gas at $2.20/MMBtu. The CCGT spread that day was $210/MWh. The maximum interval price hit $4,853/MWh. In the entire month of August 2024, there was not a single day with a negative spark spread in my dataset, and according to the dataset I generated, CCGT plants were profitable 88% of all 15-minute intervals.</p>

<h3 id="shoulder-season-when-gas-plants-lose-money-every-day">Shoulder Season: When Gas Plants Lose Money Every Day</h3>

<p>Flip the calendar to late February or early March and the picture reverses completely. Temperatures are mild, demand is low, but gas prices remain elevated from winter. The spread collapses.</p>

<p>These are what we call the dead days, when the spread is so negative that not a single 15-minute interval is profitable:</p>

<table>
  <thead>
    <tr>
      <th>Date</th>
      <th>Avg Power</th>
      <th>Gas Price</th>
      <th>CCGT Spread</th>
      <th>% of Day Profitable</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Feb 26, 2025</td>
      <td>$8.64</td>
      <td>$3.91</td>
      <td>-$18.70</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>Nov 28, 2025 (Thanksgiving)</td>
      <td>$17.53</td>
      <td>$4.85</td>
      <td>-$16.42</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>Nov 11, 2025</td>
      <td>$15.17</td>
      <td>$4.57</td>
      <td>-$16.79</td>
      <td>6%</td>
    </tr>
    <tr>
      <td>Jan 2, 2026</td>
      <td>$10.92</td>
      <td>$3.62</td>
      <td>-$14.41</td>
      <td>0%</td>
    </tr>
  </tbody>
</table>

<p>On February 26, 2025, not one 15-minute interval across the entire day covered a CCGT plant’s fuel cost.</p>

<hr />

<h2 id="the-seasonal-calendar">The Seasonal Calendar</h2>

<p>The following statistics are calculated from my dataset covering Dec 2023–Mar 2026.</p>

<p>Zoom out, and the pattern holds true year over year. The table below shows multi-year averages at HB_NORTH (Dec 2023–Mar 2026):</p>

<table>
  <thead>
    <tr>
      <th>Month</th>
      <th>Avg CCGT Spread</th>
      <th>% of Day Profitable</th>
      <th>Negative Spread Days</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>August</td>
      <td>$19.82</td>
      <td>86%</td>
      <td>0</td>
    </tr>
    <tr>
      <td>May</td>
      <td>$18.58</td>
      <td>53%</td>
      <td>9</td>
    </tr>
    <tr>
      <td>March</td>
      <td>$1.70</td>
      <td>40%</td>
      <td>33</td>
    </tr>
    <tr>
      <td>November</td>
      <td>$2.22</td>
      <td>40%</td>
      <td>22</td>
    </tr>
    <tr>
      <td>December</td>
      <td>$2.28</td>
      <td>42%</td>
      <td>30</td>
    </tr>
  </tbody>
</table>

<p>March has the most negative-spread days of any month, as winter gas prices linger while mild temperatures keep power demand low. December is close behind, cold enough to spike gas, but often not cold enough to push power prices high enough to compensate. The rare exception: when a proper freeze arrives, like January 2026.</p>

<div id="seasonal-chart" style="width:100%;"></div>
<script>
(function() {
  // Data from energyiq.market_data.gold_ercot_spark_spread
  // HB_NORTH, grouped by month (Jan–Dec 2025)
  var monthLabels = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'];

  var ccgtSpread   = [13.14, 4.04, 1.70, 10.05, 18.58, 7.70, 9.37, 19.82, 8.84, 6.07, 2.22, 2.28];
  var peakerSpread = [2.47, -4.57, -7.29, 2.28, 9.80, -2.01, 1.06, 12.39, 0.69, -2.82, -8.89, -8.41];
  var pctProfitable = [53.9, 43.4, 40.0, 52.3, 52.7, 57.9, 70.2, 85.9, 71.5, 53.3, 40.2, 42.0];

  var traceCCGT = {
    x: monthLabels, y: ccgtSpread,
    name: 'CCGT Spread ($/MWh)',
    type: 'bar', marker: { color: '#63b3ed' },
    hovertemplate: '%{x}<br>CCGT: $%{y}/MWh<extra></extra>'
  };

  var tracePeaker = {
    x: monthLabels, y: peakerSpread,
    name: 'Peaker Spread ($/MWh)',
    type: 'bar', marker: { color: '#f6ad55' },
    hovertemplate: '%{x}<br>Peaker: $%{y}/MWh<extra></extra>'
  };

  var tracePct = {
    x: monthLabels, y: pctProfitable,
    name: '% of Day Profitable',
    type: 'bar',
    marker: { color: '#68d391' },
    xaxis: 'x2', yaxis: 'y2',
    hovertemplate: '%{x}<br>%{y}% of intervals profitable<extra></extra>'
  };

  function buildLayout(mobile) {
    return {
      grid: { rows: 2, columns: 1, pattern: 'independent', roworder: 'top to bottom' },
      paper_bgcolor: '#1a1a2e',
      plot_bgcolor: '#1a1a2e',
      font: { color: '#e2e8f0', family: 'monospace', size: mobile ? 11 : 12 },
      margin: mobile
        ? { t: 20, b: 80, l: 50, r: 10 }
        : { t: 50, b: 40, l: 60, r: 20 },
      height: mobile ? 420 : 500,
      barmode: 'group',
      showlegend: true,
      legend: mobile
        ? { orientation: 'h', x: 0, y: -0.18, font: { size: 10 } }
        : { orientation: 'h', x: 0, y: 1.08, font: { size: 12 } },
      xaxis:  { gridcolor: '#2d3748', linecolor: '#4a5568' },
      yaxis:  { title: '$/MWh', gridcolor: '#2d3748', linecolor: '#4a5568', zeroline: true, zerolinecolor: '#718096' },
      xaxis2: { gridcolor: '#2d3748', linecolor: '#4a5568' },
      yaxis2: { title: '% Profitable', gridcolor: '#2d3748', linecolor: '#4a5568', range: [0, 100] }
    };
  }

  var config = { responsive: true, displayModeBar: false };
  var isMobile = window.innerWidth < 600;

  Plotly.newPlot('seasonal-chart', [traceCCGT, tracePeaker, tracePct], buildLayout(isMobile), config);

  window.addEventListener('resize', function() {
    var mobile = window.innerWidth < 600;
    if (mobile !== isMobile) {
      isMobile = mobile;
      Plotly.relayout('seasonal-chart', buildLayout(mobile));
    }
  });
})();
</script>

<p>The charts show the economics: August dominates with $20/MWh CCGT spreads, 86% profitable intervals. Shoulder months dip negative. Peakers struggle everywhere except summer peaks.</p>

<p>Peakers burn 43% more gas per MWh, so they need much higher prices to break even. They exist because August profits cover the dead months. The spark spread tells you exactly when the grid is stressed and how much plants are making (or losing).</p>

<hr />

<h2 id="the-demand-side-mirror-what-4cp-has-to-do-with-it">The Demand Side Mirror: What 4CP Has to Do With It</h2>

<p>Gas plant operators aren’t the only ones watching the spark spread. Every large industrial customer in Texas, such as refineries, petrochemical plants, data centers, and aluminum smelters, has someone whose entire job in summer is monitoring ERCOT load in real time.</p>

<p>The reason is something called <strong><a href="https://medium.com/industrial-sun-insights/understanding-ercots-4cp-demand-charge-759c02034120">Four Coincident Peaks (4CP)</a></strong>. Each year, ERCOT identifies the 4 highest 15-minute demand intervals during June through September. Those four moments determine each large customer’s share of transmission charges for the following 12 months. For a major industrial facility, the bill can run into millions of dollars. Getting caught at full load during a 4CP event or curtailing prematurely on a false alarm has significant financial consequences.</p>

<p>In a <a href="https://medium.com/industrial-sun-insights/understanding-ercots-4cp-demand-charge-759c02034120">blog</a> I read, they highlight just how extreme 4CP costs can be. For example, a CenterPoint customer facing a $56.51/kW 4CP rate would pay roughly <strong>$2.8M</strong> annually for a 50 MW load. This is a cost determined entirely by their demand during just four 15-minute intervals across the summer. In effect, a single hour of peak demand can drive millions in yearly charges.</p>

<p>In one of the tables we have generated we have a column called <code class="language-plaintext highlighter-rouge">economic_ratio_ccgt</code> that represents the ratio of intervals where the spark spread for a CCGT (Combined Cycle Gas Turbine) plant is economically positive compared to the total intervals in a day. On the hottest summer days in our dataset, that ratio hits <strong>1.0</strong>, meaning all 96 intervals were profitable, power prices stayed elevated all day, and the grid was running tight from open to close. Those are the days when 4CP risk is highest.</p>

<p>The same signal that tells a gas plant operator “run hard all day” tells an industrial energy manager “this might be the one.” Generators are sprinting. Industrials are watching every interval. Both are responding to the same underlying condition and that is scarcity on the Texas grid</p>

<hr />

<h2 id="building-the-data-platform-on-databricks">Building the Data Platform on Databricks</h2>

<p>Analyzing spark spreads across 700+ settlement nodes, 96 intervals per day, and multiple years of history requires more than a few CSVs and notebooks. To make this analysis repeatable, I built a small data platform to continuously ingest, clean, and model power, gas, and weather data.</p>

<p>The platform is built on <a href="https://www.databricks.com/">Databricks</a> using Lakeflow Spark Declarative Pipelines (SDP). At a high level, it follows a medallion-style architecture:</p>

<ul>
  <li><strong>Bronze</strong>: Raw ingestion of ERCOT hub prices, Henry Hub gas data, and weather feeds</li>
  <li><strong>Silver</strong>: Cleaned and aligned time series across all sources</li>
  <li><strong>Gold</strong>: Derived metrics like spark spreads, profitability ratios, and daily aggregates</li>
</ul>

<p>This structure makes it easy to move from raw market data to something analytically useful without constantly rewriting transformation logic.</p>

<p>SDP (formerly Delta Live Tables) handles much of the pipeline orchestration and data quality enforcement. One area where it stands out is how naturally it supports things like slowly changing dimensions and built-in expectations (for example, dropping or flagging bad data at ingestion time). You can read more about the concepts <a href="https://learn.microsoft.com/en-us/azure/databricks/ldp/concepts">here</a> and expectations <a href="https://learn.microsoft.com/en-us/azure/databricks/ldp/expectations">here</a>.</p>

<p>That said, I still prefer the flexibility of writing custom PySpark when I need tighter control or easier local testing.</p>

<p>Overall, the goal wasn’t to build a perfect platform, but a fast, flexible one that makes it easy to explore questions like:</p>

<ul>
  <li>How do spark spreads behave during extreme weather?</li>
  <li>When do gas prices compress generator margins?</li>
  <li>What does “grid stress” actually look like in the data?</li>
</ul>

<p>The diagram below shows how the pieces fit together. From raw ERCOT prices and gas data to a daily spark spread signal:</p>

<style>
  .sd-wrap * { box-sizing: border-box; }

  .sd-wrap {
    color: #e6edf3;
    font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
    display: flex;
    flex-direction: column;
    align-items: center;
    padding: 28px 20px 20px;
    width: 100%;
  }

  /* ── Layout ── */
  .sd-wrap .sd-diagram {
    display: flex;
    align-items: center;
    gap: 0;
    width: 100%;
    max-width: 1060px;
  }

  .sd-wrap .zone {
    display: flex;
    flex-direction: column;
    align-items: center;
    flex: 1;
    border-radius: 14px;
    padding: 22px 18px 26px;
    position: relative;
    align-self: stretch;
  }

  .sd-wrap .zone.sources  { border: 1px solid #1f3a5c; background: rgba(31,74,122,0.06); }
  .sd-wrap .zone.pipeline { border: 1px solid #4a2b8a; background: rgba(74,43,138,0.06); }
  .sd-wrap .zone.output   { border: 1px solid #7a3b1a; background: rgba(122,59,26,0.06); }

  .sd-wrap .zone-label {
    font-size: 0.64rem;
    font-weight: 700;
    letter-spacing: 0.14em;
    text-transform: uppercase;
    margin-bottom: 20px;
    padding: 5px 14px;
    border-radius: 20px;
    white-space: nowrap;
  }

  .sd-wrap .zone.sources  .zone-label { background: #0d1f35; color: #58a6ff; border: 1px solid #1f4c8a; }
  .sd-wrap .zone.pipeline .zone-label { background: #1a0d35; color: #a371f7; border: 1px solid #4a2b8a; }
  .sd-wrap .zone.output   .zone-label { background: #301508; color: #f0883e; border: 1px solid #7a3b1a; }

  /* ── Connector ── */
  .sd-wrap .connector-wrap {
    display: flex;
    align-items: center;
    justify-content: center;
    width: 72px;
    flex-shrink: 0;
  }

  .sd-wrap svg.connector { overflow: visible; width: 72px; height: 200px; }

  .sd-wrap .flow-line {
    fill: none;
    stroke-width: 1.5;
    stroke-dasharray: 4 4;
    animation: sdDashFlow 1s linear infinite;
  }

  .sd-wrap .flow-line.ercot   { stroke: #38bdf8; }
  .sd-wrap .flow-line.finance { stroke: #fb923c; }
  .sd-wrap .flow-line.weather { stroke: #2dd4bf; }
  .sd-wrap .flow-line.gold    { stroke: #ffffff; opacity: 0.25; }

  @keyframes sdDashFlow { to { stroke-dashoffset: -8; } }

  /* ── Source cards ── */
  .sd-wrap .source-cards { display: flex; flex-direction: column; gap: 16px; width: 100%; }

  .sd-wrap .source-card {
    background: #161b22;
    border-radius: 10px;
    padding: 15px 16px;
    border: 1px solid #21262d;
    position: relative;
    transition: border-color 0.3s;
  }

  .sd-wrap .source-card:hover { border-color: #30363d; }

  .sd-wrap .badge {
    font-size: 0.58rem;
    font-weight: 700;
    letter-spacing: 0.1em;
    text-transform: uppercase;
    padding: 2px 8px;
    border-radius: 4px;
    margin-bottom: 8px;
    display: inline-block;
  }

  .sd-wrap .badge.ercot   { background: #1a3a4a; color: #38bdf8; }
  .sd-wrap .badge.finance { background: #2a1a1a; color: #fb923c; }
  .sd-wrap .badge.weather { background: #0f2a1f; color: #2dd4bf; }

  .sd-wrap .card-title  { font-size: 0.82rem; font-weight: 600; margin-bottom: 6px; color: #e6edf3; }
  .sd-wrap .card-detail { font-size: 0.66rem; color: #8b949e; line-height: 1.6; }

  .sd-wrap .source-dot {
    position: absolute;
    right: -6px; top: 50%;
    transform: translateY(-50%);
    width: 10px; height: 10px;
    border-radius: 50%;
    border: 2px solid #0d1117;
    z-index: 2;
  }

  .sd-wrap .dot-ercot   { background: #38bdf8; box-shadow: 0 0 7px #38bdf8; }
  .sd-wrap .dot-finance { background: #fb923c; box-shadow: 0 0 7px #fb923c; }
  .sd-wrap .dot-weather { background: #2dd4bf; box-shadow: 0 0 7px #2dd4bf; }

  /* ── Pipeline layers ── */
  .sd-wrap .pipeline-brand {
    font-size: 0.58rem;
    font-weight: 700;
    letter-spacing: 0.12em;
    text-transform: uppercase;
    color: #a371f7;
    text-align: center;
    margin-bottom: 16px;
    padding-bottom: 10px;
    border-bottom: 1px solid #21262d;
    width: 100%;
  }

  .sd-wrap .layer {
    border-radius: 8px;
    padding: 12px 14px;
    margin-bottom: 6px;
    width: 100%;
  }

  .sd-wrap .layer.bronze { background: #1e1510; border: 1px solid #5c3d1a; }
  .sd-wrap .layer.silver { background: #111820; border: 1px solid #1f4c8a; }
  .sd-wrap .layer.gold   {
    background: #1e1a08;
    border: 1px solid #b8860b;
    animation: sdGoldPulse 3s ease-in-out infinite;
  }

  @keyframes sdGoldPulse {
    0%, 100% { box-shadow: 0 0 14px rgba(251,191,36,0.10); }
    50%       { box-shadow: 0 0 30px rgba(251,191,36,0.26); }
  }

  .sd-wrap .layer-badge {
    font-size: 0.57rem;
    font-weight: 700;
    letter-spacing: 0.1em;
    text-transform: uppercase;
    margin-bottom: 7px;
    display: flex;
    align-items: center;
    gap: 5px;
  }

  .sd-wrap .sd-circle { width: 7px; height: 7px; border-radius: 50%; flex-shrink: 0; }

  .sd-wrap .layer.bronze .layer-badge { color: #cd7f32; }
  .sd-wrap .layer.bronze .sd-circle   { background: #cd7f32; }
  .sd-wrap .layer.silver .layer-badge { color: #94a3b8; }
  .sd-wrap .layer.silver .sd-circle   { background: #94a3b8; }
  .sd-wrap .layer.gold   .layer-badge { color: #fbbf24; }
  .sd-wrap .layer.gold   .sd-circle   { background: #fbbf24; box-shadow: 0 0 5px #fbbf24; }

  .sd-wrap .layer .tables {
    font-size: 0.64rem;
    color: #8b949e;
    line-height: 1.8;
    font-family: 'Courier New', monospace;
  }

  .sd-wrap .layer.gold .tables { color: #c9961a; }

  .sd-wrap .formula-box {
    margin-top: 10px;
    background: #0d1117;
    border: 1px dashed #b8860b;
    border-radius: 6px;
    padding: 8px 10px;
    font-size: 0.62rem;
    color: #fbbf24;
    font-family: 'Courier New', monospace;
    text-align: center;
    line-height: 1.6;
  }

  .sd-wrap .layer-arrow {
    text-align: center;
    margin: 3px 0;
    display: flex;
    flex-direction: column;
    align-items: center;
    gap: 2px;
  }

  .sd-wrap .arrow-track {
    width: 2px;
    height: 20px;
    background: linear-gradient(to bottom, #30363d, #a371f7);
    border-radius: 1px;
    position: relative;
    overflow: hidden;
  }

  .sd-wrap .arrow-track::after {
    content: '';
    position: absolute;
    top: -100%;
    left: 0;
    width: 100%;
    height: 50%;
    background: linear-gradient(to bottom, transparent, #a371f7, transparent);
    animation: sdArrowFlow 1.4s ease-in-out infinite;
  }

  @keyframes sdArrowFlow {
    0%   { top: -100%; }
    100% { top: 100%;  }
  }

  .sd-wrap .arrow-head {
    width: 0; height: 0;
    border-left: 4px solid transparent;
    border-right: 4px solid transparent;
    border-top: 6px solid #a371f7;
    animation: sdArrowPulse 1.4s ease-in-out infinite;
  }

  @keyframes sdArrowPulse {
    0%, 100% { opacity: 0.4; }
    50%       { opacity: 1; }
  }

  /* ── Output cards ── */
  .sd-wrap .output-cards { display: flex; flex-direction: column; gap: 14px; width: 100%; }

  .sd-wrap .output-card {
    background: #161b22;
    border-radius: 10px;
    padding: 14px 15px;
    border: 1px solid #21262d;
    position: relative;
  }

  .sd-wrap .output-dot {
    position: absolute;
    left: -6px; top: 50%;
    transform: translateY(-50%);
    width: 10px; height: 10px;
    border-radius: 50%;
    background: #fbbf24;
    border: 2px solid #0d1117;
    box-shadow: 0 0 7px #fbbf24;
    z-index: 2;
  }

  .sd-wrap .output-card .icon       { font-size: 1.15rem; margin-bottom: 6px; }
  .sd-wrap .output-card .card-title { font-size: 0.79rem; font-weight: 600; color: #e6edf3; margin-bottom: 5px; }
  .sd-wrap .output-card .card-detail {
    font-size: 0.63rem;
    color: #8b949e;
    line-height: 1.6;
    font-family: 'Courier New', monospace;
  }

  /* ── Preview wrapper ── */
  #spark-diagram {
    margin: 0 0 24px;
    width: 100%;
    cursor: zoom-in;
    overflow: hidden;
  }

  .sd-wrap {
    width: 1060px;
    transform-origin: top left;
  }

  .sd-expand-hint {
    display: flex;
    justify-content: flex-end;
    align-items: center;
    gap: 4px;
    font-size: 0.6rem;
    color: #484f58;
    letter-spacing: 0.06em;
    text-transform: uppercase;
    margin-bottom: 8px;
    cursor: zoom-in;
  }

  .sd-expand-hint svg {
    width: 11px;
    height: 11px;
    opacity: 0.5;
  }

  /* ── Modal ── */
  #sd-modal {
    display: none;
    position: fixed;
    inset: 0;
    background: rgba(0, 0, 0, 0.85);
    z-index: 9999;
    align-items: center;
    justify-content: center;
    cursor: zoom-out;
  }

  #sd-modal.sd-open { display: flex; }

  #sd-modal-inner {
    background: #0d1117;
    border-radius: 14px;
    max-width: 95vw;
    max-height: 92vh;
    overflow: auto;
    cursor: default;
    display: flex;
    flex-direction: column;
  }

  #sd-modal-close {
    position: fixed;
    top: 16px;
    right: 16px;
    background: #21262d;
    border: 1px solid #30363d;
    border-radius: 6px;
    color: #8b949e;
    font-size: 1.3rem;
    cursor: pointer;
    line-height: 1;
    padding: 4px 8px;
    z-index: 10000;
  }

  #sd-modal-close:hover { color: #e6edf3; background: #30363d; }
</style>

<div class="sd-expand-hint" id="sd-expand-hint">
  <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M15 3h6v6M9 21H3v-6M21 3l-7 7M3 21l7-7" /></svg>
  click to expand
</div>

<div id="spark-diagram">
  <div class="sd-wrap">
    <div class="sd-diagram">

      <!-- ══ ZONE 1: DATA SOURCES ══ -->
      <div class="zone sources">
        <div class="zone-label">Data Sources</div>
        <div class="source-cards">

          <div class="source-card">
            <span class="badge ercot">ERCOT</span>
            <div class="card-title">Texas Grid Hub Prices</div>
            <div class="card-detail">
              15-min Settlement Point Prices<br />
              HB_NORTH &middot; HB_HOUSTON<br />
              HB_WEST &middot; HB_SOUTH<br />
              <br />
              Public API &middot; No key required
            </div>
            <div class="source-dot dot-ercot"></div>
          </div>

          <div class="source-card">
            <span class="badge finance">Yahoo Finance</span>
            <div class="card-title">Henry Hub Gas Futures</div>
            <div class="card-detail">
              NG=F front-month contract<br />
              Daily closing price ($/MMBtu)<br />
              <br />
              via yfinance &middot; No key required
            </div>
            <div class="source-dot dot-finance"></div>
          </div>

          <div class="source-card">
            <span class="badge weather">Open-Meteo</span>
            <div class="card-title">Weather &amp; Temperature</div>
            <div class="card-detail">
              Daily min/max/avg temp (°F)<br />
              HDD &middot; CDD by ERCOT load zone<br />
              <br />
              Free archive API &middot; No key required
            </div>
            <div class="source-dot dot-weather"></div>
          </div>

        </div>
      </div>

      <!-- CONNECTOR 1 -->
      <div class="connector-wrap">
        <svg class="connector" viewBox="0 0 72 260" preserveAspectRatio="none">
          <path id="sd-p-ercot" class="flow-line ercot" d="M0,55  C36,55  36,130 72,130" />
          <path id="sd-p-finance" class="flow-line finance" d="M0,130 C36,130 36,130 72,130" />
          <path id="sd-p-weather" class="flow-line weather" d="M0,205 C36,205 36,130 72,130" />
          <circle r="3" fill="#38bdf8" filter="drop-shadow(0 0 3px #38bdf8)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="0s"><mpath href="#sd-p-ercot" /></animateMotion>
          </circle>
          <circle r="3" fill="#fb923c" filter="drop-shadow(0 0 3px #fb923c)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="0.8s"><mpath href="#sd-p-finance" /></animateMotion>
          </circle>
          <circle r="3" fill="#2dd4bf" filter="drop-shadow(0 0 3px #2dd4bf)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="1.6s"><mpath href="#sd-p-weather" /></animateMotion>
          </circle>
        </svg>
      </div>

      <!-- ══ ZONE 2: PIPELINE ══ -->
      <div class="zone pipeline">
        <div class="zone-label">Lakeflow Spark Declarative Pipelines</div>
        <div class="pipeline-brand">Databricks Lakeflow &middot; Serverless &middot; Unity Catalog</div>

        <div class="layer bronze">
          <div class="layer-badge"><div class="sd-circle"></div>Bronze &mdash; Raw Ingestion</div>
          <div class="tables">
            bronze_ercot_spp_hubs<br />
            bronze_commodity_prices<br />
            bronze_weather_daily
          </div>
        </div>

        <div class="layer-arrow">
          <div class="arrow-track"></div>
          <div class="arrow-head"></div>
        </div>

        <div class="layer silver">
          <div class="layer-badge"><div class="sd-circle"></div>Silver &mdash; Cleaned &amp; Validated</div>
          <div class="tables">
            silver_ercot_hub_prices<br />
            silver_commodity_ohlcv<br />
            silver_weather_hdd_cdd
          </div>
        </div>

        <div class="layer-arrow">
          <div class="arrow-track"></div>
          <div class="arrow-head"></div>
        </div>

        <div class="layer gold">
          <div class="layer-badge"><div class="sd-circle"></div>Gold &mdash; Spark Spread</div>
          <div class="tables">gold_ercot_spark_spread</div>
          <div class="formula-box">
            Spark Spread = Power Price ($/MWh)<br />
            &minus; [ Gas Price ($/MMBtu) &times; Heat Rate ]<br />
            <br />
            CCGT heat rate: 7.0 MMBtu/MWh<br />
            Peaker heat rate: 10.0 MMBtu/MWh
          </div>
        </div>

      </div>

      <!-- CONNECTOR 2 -->
      <div class="connector-wrap">
        <svg class="connector" viewBox="0 0 72 200" preserveAspectRatio="none">
          <path id="sd-o-dash" class="flow-line gold" d="M0,100 C36,100 36,70  72,70" />
          <path id="sd-o-genie" class="flow-line gold" d="M0,100 C36,100 36,130 72,130" />
          <circle r="3" fill="#a371f7" filter="drop-shadow(0 0 3px #a371f7)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="0s"><mpath href="#sd-o-dash" /></animateMotion>
          </circle>
          <circle r="3" fill="#a371f7" filter="drop-shadow(0 0 3px #a371f7)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="1.2s"><mpath href="#sd-o-dash" /></animateMotion>
          </circle>
          <circle r="3" fill="#a371f7" filter="drop-shadow(0 0 3px #a371f7)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="0.6s"><mpath href="#sd-o-genie" /></animateMotion>
          </circle>
          <circle r="3" fill="#a371f7" filter="drop-shadow(0 0 3px #a371f7)">
            <animateMotion dur="2.4s" repeatCount="indefinite" begin="1.8s"><mpath href="#sd-o-genie" /></animateMotion>
          </circle>
        </svg>
      </div>

      <!-- ══ ZONE 3: OUTPUTS ══ -->
      <div class="zone output">
        <div class="zone-label">Analytics &amp; Insights</div>
        <div class="output-cards">

          <div class="output-card">
            <div class="output-dot"></div>
            <div class="icon">&#128202;</div>
            <div class="card-title">AI/BI Dashboard</div>
            <div class="card-detail">
              Visual exploration of spark spread,<br />
              temperature overlay, and seasonal patterns
            </div>
          </div>

          <div class="output-card">
            <div class="output-dot"></div>
            <div class="icon">&#129302;</div>
            <div class="card-title">Genie Space</div>
            <div class="card-detail">
              Natural language Q&amp;A<br />
              on spark spread data
            </div>
          </div>

        </div>
      </div>

    </div>
  </div>
</div>

<div id="sd-modal">
  <button id="sd-modal-close" aria-label="Close">&#x2715;</button>
  <div id="sd-modal-inner">
  </div>
</div>

<script>
  (function () {
    var container = document.getElementById('spark-diagram');
    var preview = container && container.querySelector('.sd-wrap');
    var modal = document.getElementById('sd-modal');
    var modalInner = document.getElementById('sd-modal-inner');
    var closeBtn = document.getElementById('sd-modal-close');
    if (!preview || !modal) return;

    // Scale the preview to fit the content column
    var NATURAL_W = 1060;
    function fit() {
      // Reset transform to measure natural height
      preview.style.transform = '';
      container.style.height = '';
      var naturalH = preview.offsetHeight;
      var scale = Math.min(1, container.clientWidth / NATURAL_W);
      if (scale < 1) {
        preview.style.transform = 'scale(' + scale + ')';
        container.style.height = (naturalH * scale) + 'px';
      }
    }
    fit();
    window.addEventListener('resize', fit);

    function openModal() {
      var existing = modalInner.querySelector('.sd-wrap');
      if (existing) existing.remove();
      var clone = preview.cloneNode(true);
      clone.style.transform = '';
      clone.style.width = '';
      modalInner.appendChild(clone); // appends after toolbar
      modal.classList.add('sd-open');
      document.body.style.overflow = 'hidden';
    }

    function closeModal() {
      modal.classList.remove('sd-open');
      document.body.style.overflow = '';
    }

    container.addEventListener('click', openModal);
    var hint = document.getElementById('sd-expand-hint');
    if (hint) hint.addEventListener('click', openModal);
    closeBtn.addEventListener('click', function (e) { e.stopPropagation(); closeModal(); });
    modal.addEventListener('click', function (e) { if (e.target === modal) closeModal(); });
    document.addEventListener('keydown', function (e) { if (e.key === 'Escape') closeModal(); });
  })();
</script>

<blockquote>
  <p>Note: I’m experimenting with animated diagrams here, let me know what you think!</p>
</blockquote>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>It’s cool how much insight can come from combining a simple market concept with the right data infrastructure. The spark spread is a straightforward calculation, but when you view it across thousands of settlement points, weather events, fuel markets, and seasonal demand patterns, it becomes a powerful lens into how the Texas grid behaves under stress.</p>

<p>This project also turned into a great excuse to experiment with a modern data stack. Using Databricks SDP, I was able to ingest and model data from multiple sources including ERCOT market data, natural gas prices, weather feeds, and EIA datasets. Bringing these together in a medallion-style architecture made it easy to iterate quickly, test new ideas, and explore the relationships between weather, fuel costs, and electricity prices.</p>

<p>Another surprisingly helpful part of this journey has been working with Claude Code and a few MCP servers. Having an AI assistant available while exploring a new domain made it much easier to understand unfamiliar energy market concepts, locate useful datasets, and quickly prototype analysis pipelines. It has felt less like searching for information and more like collaborating with a research assistant while building the platform.</p>

<p>This is still very much the beginning of the exploration. I’m currently experimenting with bringing in additional datasets and expanding the platform to look at other ERCOT signals such as nodal congestion, real-time price spikes, and transmission constraints. There is a lot of interesting structure in these markets, and the combination of modern data platforms and AI-assisted development makes it easier than ever to dig into it.</p>

<p>Thanks for reading 😊!</p>]]></content><author><name>Conner Schiissler</name></author><category term="Energy" /><category term="Databricks" /><category term="Oil &amp; Gas" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">🗺️ Mapping the Permian with Claude and Databricks</title><link href="https://schiiss.github.io/blog/genai/mapping-the-permian-with-claude-and-databricks/" rel="alternate" type="text/html" title="🗺️ Mapping the Permian with Claude and Databricks" /><published>2026-03-21T07:00:00-07:00</published><updated>2026-03-21T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/genai/mapping-the-permian-with-claude-and-databricks</id><content type="html" xml:base="https://schiiss.github.io/blog/genai/mapping-the-permian-with-claude-and-databricks/"><![CDATA[<p align="center">
  <img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/blog_image.png" alt="blog image" />
</p>

<blockquote>
  <p>Working with the folks over at <a href="https://collide.io/">Collide</a> in Texas inspired me to build this project to learn the concepts and tooling they were leveraging. They are doing lots of amazing work in the oil and gas space with AI and process automation. If you’re interested in learning more about the business of oil and gas and how tech intersects with the energy industry, check out their podcast: <a href="https://www.youtube.com/@collide_io">Collide Podcast on YouTube</a>.</p>
</blockquote>

<h2 id="introduction">Introduction</h2>

<p>I have been having a blast over the last few weekends, vibe‑coding and leveraging <strong>Claude Code</strong> with <strong>Sonnet 4.6</strong> and the <a href="https://github.com/databricks-solutions/ai-dev-kit">ai-dev-kit</a> from Databricks to build some interesting applications geared towards <strong>midstream oil and gas</strong>.</p>

<p>I had been relatively skeptical of vibe coding up until a few weeks ago, and, self‑admittedly, have not done a ton of it. I have been a bit underwhelmed by tools like GitHub Copilot, and I don’t seem to be the only one. Scrolling through Reddit, I found <a href="https://www.reddit.com/r/GithubCopilot/comments/1ok1btl/copilots_code_quality_has_dropped_claude_sonnet">a post</a> as recent as 4 months ago. The author claimed that Copilot’s code quality had “significantly dropped” over the past few months, even when using Claude Sonnet 4.5 inside VS Code.</p>

<p>And that sentiment isn’t isolated. There are multiple threads where developers describe Copilot as excellent for boilerplate and quick inline suggestions, but <strong>inconsistent when it comes to deeper reasoning, multi-file refactors, or nuanced debugging</strong>.</p>

<p>But my skepticism around vibe coding largely stemmed from those early Copilot experiences. The suggestions often felt like autocomplete on steroids rather than a true thinking partner. If that’s your baseline experience, it’s easy to dismiss the whole category.</p>

<p>At Plains, we’re collaborating with commercial teams to unlock value from our data lake datasets. I wanted to prototype something visually compelling. A <strong>geospatial app</strong> that would demonstrate the power of combining structured data, unstructured contracts, and maps. And who doesn’t love geospatial data?!</p>

<p>I thought to leverage Claude Code and the new Databricks ai-dev-kit to generalize what we have started building at Plains for the purpose of this blog post. I also wanted to take the opportunity to test out ai-dev-kit and generate synthetic data for features and functionality that are both aspirational and that we have actually built out internally to demonstrate what we can do once we have all the data landed and modeled within our data lake. I appreciate the synthetic data is not perfectly accurate to what we would see in the field, but it is not bad given it was generated using Claude and the Databricks MCP server.</p>

<p>We have many of our datasets landed in the data lake, so although the data itself is proprietary, the patterns and tooling we are building are broadly applicable. For this post I generalized it into a <strong>synthetic Permian Basin dataset</strong> for things like wells, crude and gas pipelines, acreage dedications, processing plants, and compressor/pump stations and built a full interactive mapping application on top of it, all running on Databricks.</p>

<hr />

<h2 id="the-use-case">The Use Case</h2>

<p>We are seeing a large amount of data being locked away in <strong>unstructured contracts</strong>. Like I mentioned in a <a href="https://www.linkedin.com/posts/conner-schiissler_dataengineering-genai-oilandgas-activity-7342899284138184708-Yaof?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACXEibYBngZiCRvQiwlsg8p1A85--baPNfw">previous post</a>, the ‘oil and gas industry is burdened by complex contracts that are often lengthy, unstructured, and scattered across various formats’, and we have built several capabilities internally at Plains to OCR, extract structured information, and query said data at scale and in one recent example exceeding <strong>25,000 contracts</strong>. Contract data in isolation is useful for compliance and record keeping. <strong>Contract data connected to a geospatial context becomes a strategic asset</strong>, especially when you pair it with other geospatial datasets to visualize pipelines, wells, and in the unstructured contract example, acreage dedications which are inherently spatial. They define where volumes are contractually committed across counties, basins, and lease blocks.</p>

<p>A commercial analyst may know things like the contract start and end date, the operator’s name, and the <strong>minimum volume commitment (MVC)</strong>, but other questions might be a bit more difficult. Things like:</p>

<ul>
  <li>Where does that acreage physically sit relative to our pipelines?</li>
  <li>Is production in that block trending up or declining?</li>
</ul>

<p>The goal was to generate synthetic data like data sources I have seen in the energy industry like <strong><a href="https://www.enverus.com/solutions/energy-analytics/land/drillinginfo-and-rigdata/?utm_source=google&amp;utm_medium=cpc&amp;utm_campaign=&amp;utm_content=&amp;utm_term=enverus%20rig%20data&amp;gad_source=1&amp;gad_campaignid=20044252526&amp;gbraid=0AAAAAD_nCTMvciplpcYhimHtCTpVnsZvJ&amp;gclid=CjwKCAiAqprNBhB6EiwAMe3yhrTaAMtzTAGyv5icSo9RZf7WNdOe9sVaGueOg1PtwLWr_fLYog2Z3BoCp0cQAvD_BwE">Enverus</a></strong>, pipeline data modelled in <strong><a href="https://pods.org/data-models/">PODS</a></strong>, acreage dedication contracts etc. and demonstrate how we geospatially visualize this data in an easy to use and understand map interface, allowing the commercial team to answer tough questions like:</p>

<blockquote>
  <p><em>‘What does this contract mean spatially, operationally, and competitively?’</em></p>
</blockquote>

<hr />

<h2 id="what-was-built">What Was Built</h2>

<p>The app was built to directly answer those use case questions. An analyst can pull up the map, click on an acreage dedication to see its spatial footprint relative to pipelines, check whether nearby wells are producing strongly or declining, open the linked contract PDF, and query across all of it in plain English, all without leaving Databricks. Specifically:</p>

<ul>
  <li><strong>“Where does that acreage sit relative to our pipelines?”</strong> → acreage dedication and pipeline layers are overlaid on the same map, making spatial relationships immediately visible.</li>
  <li><strong>“Is production trending up or declining?”</strong> → clicking any well surfaces a <strong>BBL/MCF time-series chart</strong> pulled live from Unity Catalog.</li>
  <li><strong>“What does this contract actually say?”</strong> → clicking an acreage block opens the linked PDF contract directly in the app.</li>
</ul>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/screenshot_of_app_home.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/screenshot_of_app_home.png" alt="Web App Home Page" /></a></p>

<p>A lot of the boilerplate and foundational software components were in place before vibe coding began. I had already decided on a language and framework (React + Node.js), a mapping library (Mapbox GL), where I was going to host it (Databricks Apps), and how the app would authenticate and query data (Unity Catalog via a SQL warehouse). Having those decisions locked in upfront meant Claude wasn’t being asked to make architectural choices and instead it was being asked to build features within a well-defined structure. That’s a meaningful distinction. Vibe coding works best when the AI is filling in the what, not deciding the how.</p>

<p>The app is a <strong>React + <a href="https://docs.mapbox.com/mapbox-gl-js/guides/">Mapbox GL</a></strong> frontend served by an <strong>Express.js</strong> backend, deployed as a <strong>Databricks App</strong>. All the operational data lives in <strong>Unity Catalog</strong> and is queried at runtime through a SQL warehouse. The data being represented on the map is coming from a mix of structured and unstructured datasets. I have learned over the past 6 months that people really seem to like viewing data on a map. Geospatial data can be incredibly powerful and can help tell a cool story.</p>

<p>The choice of React + Mapbox GL was deliberate. Mapbox is the closest thing to an industry standard for interactive geospatial work outside of enterprise tools like <a href="https://www.esri.com/en-us/home">Esri</a>. It handles GeoJSON layers, dynamic filters, and camera animations natively, and react-map-gl gives you a clean component wrapper around it.</p>

<p>Express.js was chosen because it lets you serve both the React single-page app and all the API routes from a single Node.js process, which maps cleanly to how Databricks Apps works. One command, one process, no separate infra.</p>

<p>Bringing the app to the data (rather than extracting data to an external host) was a conscious decision: the Express server runs inside Databricks, which means it authenticates to Unity Catalog natively using the injected OAuth token with no data leaving the platform.</p>

<p>Every layer of the map can be interacted with, including:</p>

<ul>
  <li>Wells</li>
  <li>Crude and gas pipelines</li>
  <li>Acreage dedications</li>
  <li>Compressor/pump stations</li>
  <li>Processing plants</li>
</ul>

<p>Results are cached in-memory for 5 minutes so navigating around the map doesn’t re-query the warehouse on every interaction.</p>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/map_layers.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/map_layers.png" alt="Map Layers" /></a></p>

<p>Similar to a dataset like Enverus, we can view <strong>well production history</strong> in a time-series chart to get an idea for <strong>BBL</strong> and <strong>MCF</strong> being produced from a given well.</p>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/well_production_stats.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/well_production_stats.png" alt="Well Production Stats" /></a></p>

<p>We also integrated the geospatial experience with <strong><a href="https://learn.microsoft.com/en-us/azure/databricks/genie/">Genie</a></strong>, where a user can ask questions like:</p>

<blockquote>
  <p><em>‘Which acreage blocks have no active contracts, show their locations?’</em></p>
</blockquote>

<p>Genie will interact with the tables in Unity Catalog and once a response is returned, the corresponding acreages will be highlighted on the map for exploration:</p>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/search_acreage_dedications.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/search_acreage_dedications.png" alt="Search Acreage Dedications" /></a></p>

<p>Getting the results from Genie to highlight on the map was a fun challenge. Genie generates SQL against Unity Catalog, the server re-executes that query via the warehouse, extracts the matching well or acreage IDs from the result, and passes them back to the frontend so Mapbox can highlight just those features while dimming everything else on the map.</p>

<p>Another cool query I was playing around with:</p>

<blockquote>
  <p><em>‘Show me active oil wells in Reeves or Loving County with an IP30 above 500 BOPD that have no active dedication and production numbers look strong. Include well name, operator, formation, IP30, and cumulative oil’</em></p>
</blockquote>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/complex_genie_query.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/complex_genie_query.png" alt="Complex Genie Query" /></a></p>

<p>I leveraged the <strong>ai-dev-kit</strong> (i.e. Databricks MCP server) heavily to create objects within Databricks. Things like Catalogs, Schema, and Tables were all created and populated using natural language in Claude Code. I asked Claude Code to do things like:</p>

<blockquote>
  <p><em>‘Create some dummy PDF contracts, upload them to a Databricks volume in the Permian catalog, and link to them in the app when selecting an acreage dedication, and open the PDF within the app’</em></p>
</blockquote>

<p>…which allowed me to quickly generate dummy data and link it directly into the interface I was building:</p>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/dummy_pdf_data.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/dummy_pdf_data.png" alt="Dummy PDF Data" /></a></p>

<p>Opening one of those associated contracts reveals some pretty convincing dummy data:</p>

<p><a href="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/dummy_pdf_data_opened.png" target="_blank"><img src="/blog/assets/images/blog_images/mapping-the-permian-with-claude-and-databricks/dummy_pdf_data_opened.png" alt="Dummy PDF Data Opened" /></a></p>

<p>The dummy data generated is not perfect and you will notice in the above screenshots there are things like overlapping acreage dedications, but I think overall Claude and its integration with the Databricks MCP server did a solid job.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>My eyes have been opened leveraging tools like <strong>Claude Code</strong> and the new <strong>ai-dev-kit</strong> from Databricks to quickly prototype applications. So long as you understand how the Databricks platform works and the sequence of steps you need to orchestrate, you can guide Claude to build some cool stuff on Databricks.</p>

<h3 id="lessons-learned">Lessons Learned</h3>

<p>I attempted to leverage Claude to deploy my JavaScript application to Databricks Apps and it did not work very well.</p>

<p>The deployment failure was the most instructive moment of the whole project. I asked Claude to deploy the Node.js app to Databricks Apps before I fully understood the steps myself. After several failed attempts I had to step in, manually work out the correct deployment sequence, and only then delegate the <em>automation</em> of that process back to Claude to create a <strong>GitHub Actions pipeline</strong> to automate the process of deploying the app going forward. That pattern generalized well: <strong>Claude is a force multiplier once you understand the problem space, but it will struggle when you don’t understand the platform well enough to verify its output.</strong></p>

<p>A few other things I’d keep in mind for next time:</p>

<ul>
  <li><strong>The MCP server excels at data operations</strong>: Creating catalogs, schemas, tables, uploading data to volumes, but is less reliable for platform infrastructure tasks like app deployment, at least right now.</li>
  <li><strong>Decide on your stack before vibe coding begins</strong>: Having already chosen React + Mapbox GL + Express freed Claude to focus on building features rather than making architectural decisions.</li>
  <li><strong>Synthetic data is awesome for prototyping</strong>: Being able to ask Claude to generate realistic contracts and well data and load them directly into Unity Catalog via MCP collapsed what would have been days of data prep into minutes.</li>
</ul>

<h3 id="whats-next">What’s Next</h3>

<p>The synthetic Permian dataset served its purpose for this post, but the real value unlocks when this runs on actual data. As more of our datasets land in the data lake, swapping out the synthetic layer is a straightforward schema change. A few things are on the roadmap beyond that:</p>

<ul>
  <li><strong>Pass-through authentication</strong>: so the app respects Unity Catalog row-level permissions, a user should only see the contracts and acreage dedications they’re entitled to.</li>
  <li><strong>Extending to other basins</strong>: the Permian is the proof of concept, but the same architecture applies anywhere we have pipeline and contract data.</li>
  <li><strong>Tighter Genie integration</strong>: to support more complex, multi-step analytical workflows directly from the map.</li>
  <li><strong>Contracts tab</strong>: to quickly view and search contract details and filter by operator, county, and AI‑generated contract summaries.</li>
</ul>

<p>I think this application is a great example of the power of <strong>Databricks Apps</strong> running on the Databricks platform. There is huge value in bringing the apps to the data, instead of the data to the apps.</p>

<p>Thanks for reading 😀!</p>]]></content><author><name>Conner Schiissler</name></author><category term="GenAI" /><category term="Databricks" /><category term="MCP" /><category term="Oil &amp; Gas" /><category term="GIS" /><category term="Geospatial" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">🛢️🤖 Inferring Pipeline State: Using Hidden Markov Models to Reduce False Leak Alarms</title><link href="https://schiiss.github.io/blog/data%20science/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms/" rel="alternate" type="text/html" title="🛢️🤖 Inferring Pipeline State: Using Hidden Markov Models to Reduce False Leak Alarms" /><published>2026-02-19T07:00:00-07:00</published><updated>2026-02-19T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/data%20science/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms</id><content type="html" xml:base="https://schiiss.github.io/blog/data%20science/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms/"><![CDATA[<blockquote>
  <p><strong>Disclaimer:</strong><br />
Hidden Markov Models were new territory for me, and understanding the underlying math required building the forward, backward, and Viterbi procedures from scratch in NumPy. While I’ve developed a much stronger intuition for how the parameters λ = (A, B, π) drive inference, I’m still refining my understanding. This post reflects what worked in our environment and what I’ve learned so far.</p>
</blockquote>

<p align="center">
  <img src="/blog/assets/images/blog_images/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms/blog_image.jpg" alt="blog image" />
</p>

<p>This post is a follow-up to <a href="https://www.linkedin.com/posts/conner-schiissler_oilandgas-datascience-anomalydetection-activity-7396547817726595072--5fV?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACXEibYBngZiCRvQiwlsg8p1A85--baPNfw">🛢️🤖 Why Detecting O&amp;G Pipeline Anomalies Is So Hard</a>, where I discussed the technical complexity we encountered while building a capability to ingest PI historian data into Databricks for further processing.</p>

<hr />

<h2 id="use-case">Use Case</h2>

<p>The technical work is driven by one core objective: <strong>reduce unnecessary pipeline shutdowns caused by false leak alarms</strong>. In many cases, these shutdowns are not caused by true failures, but by conservative safety mechanisms reacting to limited visibility into the true state of the pipeline and incomplete or uncertain system context. A common pain point is how operational slack is assessed. Controllers often must manually estimate slack conditions in real time, without a fully observable view of packing or unpacking behavior across the line. It’s error-prone and easy to miss during high-pressure operations. When slack is misjudged (or missed entirely), monitoring systems can trigger false leak or deviation alarms, leading to avoidable shutdowns and unnecessary investigations.</p>

<p>To address these challenges, we developed the <strong>Deviation Counter Tool</strong> to work hand-in-hand with our pipeline state detection. When the ML model infers that the pipeline has entered a specific state, such as unpacking (i.e., shutting down), the deviation counter logic is automatically activated. This tool continuously monitors for deviations, removing the need for manual slack estimation and reducing the risk of false alarms.</p>

<h3 id="key-terms-slack-pack-and-unpack">Key Terms: Slack, Pack, and Unpack</h3>

<ul>
  <li><strong>Slack:</strong> The presence of unpressurized or low-pressure sections in a pipeline, often due to imbalances between inflow and outflow. Slack can lead to inaccurate flow measurements and complicate leak detection.</li>
  <li><strong>Pack (Packing):</strong> The process of increasing pressure in the pipeline by introducing more product (inflow exceeds outflow), typically during startup or ramp-up operations.</li>
  <li><strong>Unpack (Unpacking):</strong> The process of decreasing pressure in the pipeline by reducing inflow or increasing outflow (outflow exceeds inflow), often during shutdown or ramp-down operations.</li>
</ul>

<h2 id="data-ingestion">Data Ingestion</h2>

<p>Everything starts with data engineering. None of the reporting, analytics, or ML happens without it. We are leveraging the <a href="https://docs.aveva.com/bundle/pi-web-api-reference/page/help.html">PI Web API</a> to pull a subset of PI tags from our meters. Specifically:</p>

<ul>
  <li><strong>Inflow</strong> (product entering a meter)</li>
  <li><strong>Outflow</strong> (product leaving a meter)</li>
  <li><strong>Over/short</strong> (outflow minus inflow, where a negative value indicates potential loss or leak)</li>
</ul>

<p>We’re ingesting 1-minute data every 3–5 minutes into our data lake and exposing the tables in Unity Catalog for dashboarding and machine learning.</p>

<p>A lot of effort has gone into hardening this ingestion pipeline.</p>

<p>We’re also running a data quality engine during ingestion. For example, we explicitly check for stale SCADA records. If a DQ issue is detected, we flag it and ignore the data in our ML model for state prediction. We don’t want to generate predictions off poor-quality historian data.</p>

<hr />

<h2 id="machine-learning-approach">Machine Learning Approach</h2>

<p>In the previous blog, I talked about potentially approaching this use case as a classification problem. We are still considering that approach, potentially leveraging something like a <a href="https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors">nearest neighbor</a> algorithm, and modeling different pipeline states (i.e., normal, shutdown, leak). However, we have had some recent success using a <strong>Hidden Markov Model (HMM)</strong>. In fact, there’s relevant academic work that inspired this direction from the College of Science, Engineering, and Technology in Houston. In <a href="https://arxiv.org/abs/2009.14589">‘Hidden Markov Models for Pipeline Damage Detection Using Piezoelectric Transducers’</a>, researchers applied an HMM-based method to detect pipeline leaks and crack conditions by mapping different damage conditions to distinct Markov states and using statistical signal features as the observable emissions. Their method showed that a Gaussian mixture model HMM (GMM-HMM) could successfully recognize whether a pipeline had a leak, and in some cases, locate it under time-varying conditions, despite noisy measurements.</p>

<p>HMMs can be useful for handling sequential data where the underlying process isn’t directly visible, but you can infer it from patterns in the observations. They typically excel in noisy, time-series environments because they account for uncertainty and temporal dependencies. The HMM is based on a Markov chain, which is a model that tells us something about the probabilities of sequences of random variables (states), each of which can take on values from some set. Hidden Markov Models are designed specifically for sequential data where the true state is not directly observable. That description fits pipeline operations almost perfectly.</p>

<hr />

<h2 id="from-weather-to-pipelines-a-markov-perspective">From Weather to Pipelines: A Markov Perspective</h2>

<p>The classic example for HMMs is the <a href="https://www.geeksforgeeks.org/machine-learning/hidden-markov-model-in-machine-learning/">“weather example”</a>:</p>

<p align="center">
  <a href="/blog/assets/images/blog_images/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms/hmm_weather_example.png">
    <img src="/blog/assets/images/blog_images/inferring-pipeline-state-using-hidden-markov-models-to-reduce-false-leak-alarms/hmm_weather_example.png" alt="HMM Weather Example" />
  </a>
</p>

<blockquote>
  <p>In this diagram:</p>
  <ul>
    <li>Hidden states: Weather conditions (Rainy, Cloudy, Sunny)</li>
    <li>Observations: Emotions (Happy, Neutral, Sad)</li>
    <li>Green arrows: Transition probabilities, the likelihood the weather changes from one state to another each day</li>
    <li>Red arrows: Emission probabilities, the likelihood of observing a particular emotion given the current weather</li>
  </ul>
</blockquote>

<p>We only see the emotions (observations), not the weather (hidden states). The HMM helps infer the most likely sequence of hidden states behind those observations.</p>

<p>The same idea applies to pipelines.</p>

<p><strong>Observable data (what we measure):</strong></p>

<ul>
  <li>Inflow</li>
  <li>Outflow</li>
  <li>Over/short</li>
</ul>

<p><strong>Hidden states (what we actually care about):</strong></p>

<ul>
  <li>Normal operation</li>
  <li>Line packing (starting up)</li>
  <li>Line unpacking (shutting down)</li>
  <li>Leak</li>
  <li>Shutdown</li>
  <li>Transition states</li>
</ul>

<p>An HMM asks a more realistic question:</p>
<blockquote>
  <p>“Given everything we’ve observed up to now, what state are we most likely in, and how likely is it to transition to another state?”</p>
</blockquote>

<p>That temporal dependency is critical because pipeline states persist. A leak doesn’t appear and disappear randomly minute to minute. A line doesn’t pack instantly. Modeling persistence reduces noise-driven false positives.</p>

<p>This model is a multivariate Gaussian HMM (one Gaussian per state) with 6 hidden states and diagonal covariance, trained unsupervised on 9 standardized, flow-derived features (inflow, outflow, over/short, ratios, deltas, and operational flags).</p>

<p>The HMM was trained and versioned within Unity Catalog, and it took nine iterations to align the transition structure and feature engineering with operational reality. I’m sure many more iterations are ahead as we collect more data, observe edge cases, and continue refining how the model reflects real pipeline behavior.</p>

<p>We leveraged tools like <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">StandardScaler</a> to account for unit variance, and engineered features for flow relationships (e.g., inflow/outflow ratio), temporal dynamics (e.g., delta inflows &amp; outflows based on a lag window), and operational flags (e.g., binary flag for inflow &lt; 5.0 (shutdown detection)).</p>

<hr />

<h2 id="orchestration--deployment">Orchestration &amp; Deployment</h2>

<p>All our artifacts in Databricks from ingestion of the PI data to predicting pipeline state via batch inference leveraging the HMM, are deployed via asset bundles and orchestrated in workflows.</p>

<p>We are also leaning heavily on Unity Catalog to version our HMM, so we have version history for training iterations.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>One challenge we’ve run into is <strong>temporal resolution</strong>.</p>

<p>We’re currently ingesting 1-minute data every 3–5 minutes. That sounds reasonable until you see something like this:</p>

<table>
  <thead>
    <tr>
      <th>Time</th>
      <th>Inflow</th>
      <th>Outflow</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1:00</td>
      <td>236</td>
      <td>343</td>
    </tr>
    <tr>
      <td>1:01</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>At first glance, everything looks within range… and then it drops to zero.</p>

<p>We pulled second-by-second data for that same window to see what happened in those 60 seconds.</p>

<p>Apparently, a lot can happen in 60 seconds in pipeline operations.</p>

<p>We may need to increase ingestion frequency because pulling at a minute granularity can miss critical context. When you’re modeling state transitions, that missing context matters.</p>

<p>There is still a lot to learn, but we’re getting closer to something that reflects operational reality rather than just reacting to noisy signals.</p>

<p>Big shoutout to <a href="https://www.linkedin.com/in/shawnom/">Shawn</a> for helping sharpen how we framed the operational problem, and to <a href="https://www.linkedin.com/in/mark-van-der-linden-30798811/">Mark</a> for leading the data engineering effort that makes any of this possible.</p>

<p>Thanks for reading! 😀</p>]]></content><author><name>Conner Schiissler</name></author><category term="Data Science" /><category term="Data Engineering" /><category term="Oil &amp; Gas" /><summary type="html"><![CDATA[Disclaimer: Hidden Markov Models were new territory for me, and understanding the underlying math required building the forward, backward, and Viterbi procedures from scratch in NumPy. While I’ve developed a much stronger intuition for how the parameters λ = (A, B, π) drive inference, I’m still refining my understanding. This post reflects what worked in our environment and what I’ve learned so far.]]></summary></entry><entry><title type="html">🎨🧑‍🎨 The Art of Keeping Things Simple: Data Quality</title><link href="https://schiiss.github.io/blog/data%20quality/the-art-of-keeping-things-simple-data-quality/" rel="alternate" type="text/html" title="🎨🧑‍🎨 The Art of Keeping Things Simple: Data Quality" /><published>2026-02-07T07:00:00-07:00</published><updated>2026-02-07T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/data%20quality/the-art-of-keeping-things-simple-data-quality</id><content type="html" xml:base="https://schiiss.github.io/blog/data%20quality/the-art-of-keeping-things-simple-data-quality/"><![CDATA[<p align="center">
  <img src="/blog/assets/images/blog_images/the-art-of-keeping-things-simple-data-quality/blog_image.png" alt="" />
</p>

<hr />

<p>Who doesn’t love talking about data quality?</p>

<p>It may not be the most exciting topic right now (or ever), but it certainly is very important. I am sure we have all seen the meme below circling around online, but it is so true and it demonstrates the importance of high-quality data.</p>

<p align="center">
  <a href="/blog/assets/images/blog_images/the-art-of-keeping-things-simple-data-quality/garbage_in_garbage_out.png">
    <img src="/blog/assets/images/blog_images/the-art-of-keeping-things-simple-data-quality/garbage_in_garbage_out.png" alt="Garbage In Garbage Out" />
  </a>
</p>

<p><a href="https://www.linkedin.com/in/mark-van-der-linden-30798811/">Mark van der Linden</a> and I have had many conversations about this and he always says, ‘never use data unless you understand its quality’, and that goes for the full stack. Reporting, analytics, and especially all kinds of AI.</p>

<p>Looking back, I’ve jumped headfirst into use cases more times than I’d like to admit without truly grasping the data’s quality upfront. Had I prioritized that early, I would’ve avoided a ton of rework, set more realistic expectations with stakeholders, and delivered better outcomes faster.</p>

<p>Data quality issues often tell a compelling story. When you surface gaps or inconsistencies to business owners and collaborate on root-cause fixes, like process changes that improve capture at the source, it turns a technical problem into a shared opportunity for improvement. Stakeholders get bought in, and the data gets better.</p>

<blockquote>
  <p>Full disclosure, I haven’t gone deep into dedicated data quality tooling yet, but I’ve always been intrigued by enterprise solutions like Informatica Data Quality.</p>
</blockquote>

<p>That said, this ties directly into the philosophy Mark and I explored in our recent post  <a href="https://www.linkedin.com/posts/conner-schiissler_dataengineering-platformengineering-databricks-activity-7312458737800069120-1YMf?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACXEibYBngZiCRvQiwlsg8p1A85--baPNfw">The Art of Keeping Things Simple</a>. Heavy tools bring specialized skills, complex licensing, and yet another vendor to manage, exactly the kind of over-engineering we argued against.</p>

<p>So, why not keep things simple? We’re experimenting with embedding a lightweight, SQL-based data quality engine directly into Data Weaver on Databricks. The goal: stay within a single, unified framework (Databricks + SQL/Python) for ingestion, transformation, orchestration, and quality checks, no extra layers, no context-switching.</p>

<hr />

<h2 id="use-case">Use Case</h2>

<p>I always like to ground technical solutions in business value. It’s easy for us as technologists to forget why these systems exist in the first place. Let’s talk about the use case we are looking at to build some DQ rules.</p>

<p>In a recent post, I talked a bit about <a href="https://www.linkedin.com/posts/conner-schiissler_mainframe-cobol-zos-activity-7413942435958980608-pHk7?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACXEibYBngZiCRvQiwlsg8p1A85--baPNfw">getting data out of mainframe</a> for a series of commercial focused use cases we are working on. Mostly, we’re dealing with volumetric receipt and delivery data at the contract level (i.e., what are our contractual volumetric commitments). We also pull in data from our scheduling system to verify if those committed volumes have been scheduled (i.e., ensuring logistics are in place for timely and accurate scheduling of oil and gas deliveries), as well as our crude oil accounting system for the volumetric actuals.</p>

<p>To expand this further, we’re integrating third-party data from <a href="https://www.enverus.com/solutions/trade-hedge-risk/?utm_source=google&amp;utm_medium=paid-search&amp;utm_term=pwr&amp;utm_content=solution-page&amp;utm_campaign=mosaic&amp;utm_source=google&amp;utm_medium=cpc&amp;utm_campaign=&amp;utm_content=&amp;utm_term=enverus%20energy%20analytics&amp;gad_source=1&amp;gad_campaignid=21128427730&amp;gbraid=0AAAAAD_nCTPPAUjF5_PS5UsQlpXweRz7J&amp;gclid=Cj0KCQiAnJHMBhDAARIsABr7b85oaprPHV47PTKDaLmFguvbx1RhT-G4bXtaWWPc8fefUFDHf5sw9BoaAgHUEALw_wcB">Enverus</a>, a leading provider of energy analytics. Enverus supplies granular insights into production forecasts, well performance, and market trends, which we layer onto our internal datasets. This helps us validate contractual commitments against external benchmarks, like regional production averages or competitor activity. For instance, we can cross-check our volumetric data against Enverus’ well-level production estimates to spot discrepancies early, such as over- or under-estimated reserves that might impact contract viability.</p>

<p>A key area where this shines is in handling acreage dedications. In the oil and gas space, acreage dedications lock in specific land areas (often measured in acres or sections) to a midstream contract, committing all production from that acreage to a particular pipeline, gatherer, or processor. Our data quality engine scrutinizes these dedications by validating spatial and volumetric alignments:</p>

<ul>
  <li>Are the dedicated acres accurately mapped to contracts?</li>
  <li>Do production actuals from those acres match the dedicated commitments?</li>
</ul>

<p>We flag issues like overlapping dedications (e.g., the same acreage tied to multiple contracts) or mismatches between dedicated volumes and historical output, which could lead to legal or operational risks.</p>

<p>This ties directly into right-sizing contracts based on historical actuals. By analyzing trends in actual deliveries from our accounting system, spanning months or years, we can recommend adjustments to contractual minimum volume commitments (MVCs) or take-or-pay clauses. For example, if historical actuals show consistent under-delivery on a dedicated acreage block (perhaps due to declining well productivity, as corroborated by Enverus data), we can simulate scenarios to “right-size” the contract:</p>

<ul>
  <li>Reduce MVCs to avoid penalties</li>
  <li>Renegotiate terms for flexibility</li>
  <li>Even release underperforming acreage from dedications</li>
</ul>

<p>This not only optimizes costs but also aligns commitments with real-world performance, reducing exposure to shortfall fees.</p>

<blockquote>
  <p>Full disclosure: much of the above is still aspirational. We’ve built pieces of this today, but not the entire end-to-end vision yet. What matters most at this stage isn’t the completeness of the technology, it’s deeply understanding the business of scheduling, marketing, and contract management. Once you truly understand how volumes flow, how commitments are made, and where real-world constraints live, it becomes much clearer what is and isn’t possible from a technical standpoint. The technology follows the business, not the other way around.</p>
</blockquote>

<hr />

<h2 id="sql-based-data-quality-engine">SQL Based Data Quality Engine</h2>

<p>Now let me put my propellor hat back on and talk about some tech!</p>

<p>Rather than introducing a standalone data quality platform, we opted for a lightweight, SQL-first data quality engine embedded directly into our Databricks workflows. The guiding principle is simple: data quality should live where the data already lives, using tools the team already understands.</p>

<p>At its core, the engine follows a three-table design inspired by dimensional modeling patterns. The goal is to separate what we are checking from how often issues occur and where they occur, enabling both high-level monitoring and deep operational investigation.</p>

<p><strong>Three-table design:</strong></p>

<ul>
  <li><strong>dq_dim_rule</strong> (Dimension Table)
    <ul>
      <li>Defines each data quality rule with metadata</li>
      <li>Fields: <code class="language-plaintext highlighter-rouge">RuleId</code>, <code class="language-plaintext highlighter-rouge">Area</code>, <code class="language-plaintext highlighter-rouge">ShortName</code>, <code class="language-plaintext highlighter-rouge">LongName</code></li>
    </ul>
  </li>
  <li><strong>dq_fact_rule_summary</strong> (Fact Table - Aggregated)
    <ul>
      <li>Stores daily aggregate counts per rule</li>
      <li>Fields: <code class="language-plaintext highlighter-rouge">Id</code> (auto-increment), <code class="language-plaintext highlighter-rouge">RuleId</code>, <code class="language-plaintext highlighter-rouge">Count</code>, <code class="language-plaintext highlighter-rouge">Date</code></li>
      <li>Provides high-level metrics for dashboards and trending</li>
    </ul>
  </li>
  <li><strong>dq_fact_rule_detail</strong> (Fact Table - Granular)
    <ul>
      <li>Stores individual DQ violations per rule</li>
      <li>Fields: <code class="language-plaintext highlighter-rouge">Id</code> (auto-increment), <code class="language-plaintext highlighter-rouge">RuleId</code>, <code class="language-plaintext highlighter-rouge">ContractId</code>, <code class="language-plaintext highlighter-rouge">Date</code></li>
      <li>Enables drill-down investigation of specific problem data.</li>
    </ul>
  </li>
</ul>

<p>The Databricks notebook runs daily to:</p>

<ul>
  <li>Delete existing detail records for each rule and current date</li>
  <li>Calculate violations by querying source tables (<code class="language-plaintext highlighter-rouge">receipt_data</code>, <code class="language-plaintext highlighter-rouge">delivery_data</code>, <code class="language-plaintext highlighter-rouge">scheduling_data</code>)</li>
  <li>Insert both summary counts and detailed contract IDs for each rule</li>
  <li>Track data quality trends over time with the date dimension</li>
</ul>

<p>This is where data engineers, analysts, and business users go when they want to understand which contracts are problematic and why.</p>

<hr />

<h2 id="execution-pattern">Execution Pattern</h2>

<p>A daily notebook in Databricks orchestrates all checks and follows a consistent, idempotent pattern:</p>

<ol>
  <li>Remove existing records for each rule and the current processing date</li>
  <li>Execute SQL logic to identify violations across source systems:
    <ul>
      <li>Receipt data</li>
      <li>Delivery data</li>
      <li>Scheduling data</li>
    </ul>
  </li>
  <li>Insert:
    <ul>
      <li>Aggregated violation counts into the summary table</li>
      <li>Individual contract violations into the detail table</li>
    </ul>
  </li>
  <li>Persist results with a date dimension to enable historical trending</li>
</ol>

<p>This approach ensures the job can be safely re-run, supports incremental rule additions, and keeps the entire framework transparent and auditable.</p>

<p>We’ve implemented several foundational rules so far, such as:</p>

<ul>
  <li>Contracts missing receipt or delivery volumes</li>
  <li>Spot deals with incomplete or missing date ranges</li>
  <li>Mismatches between scheduled and actual volumes</li>
</ul>

<p>Today, these results are surfaced through Databricks AI/BI dashboards, giving stakeholders real-time visibility into how data quality evolves over time, rather than treating it as a one-off validation exercise.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>What this approach has reinforced for us is that meaningful data quality work doesn’t require heavyweight platforms, complex rule engines, or yet another vendor to manage. In many cases, it simply requires:</p>

<ul>
  <li>Clear business context</li>
  <li>Well-understood data</li>
  <li>And a disciplined, repeatable way to surface issues</li>
</ul>

<p>By embedding data quality checks directly into our Databricks pipelines using SQL and Python, we’ve kept quality close to the data, close to the engineers, and most importantly close to the business questions that matter.</p>

<p>This approach won’t replace enterprise data quality platforms in every scenario, nor is it meant to. But for teams already operating in a modern lakehouse environment, it offers a pragmatic, low-friction way to start treating data quality as a first-class citizen, without over-engineering the solution.</p>

<p>In a future blog we will talk about data profiling, specifically a tool in Data Weaver that provides the ability to rapidly profile data to quickly get a feel for the quality of the data.</p>

<p>Thanks for reading 😀</p>]]></content><author><name>Conner Schiissler</name></author><category term="Data Quality" /><category term="SQL" /><category term="Databricks" /><category term="Data Engineering" /><category term="Oil &amp; Gas" /><category term="Data Weaver" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Unlocking Mainframe Data: From Legacy Systems to Analytics in Databricks</title><link href="https://schiiss.github.io/blog/mainframe/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/" rel="alternate" type="text/html" title="Unlocking Mainframe Data: From Legacy Systems to Analytics in Databricks" /><published>2026-01-02T07:00:00-07:00</published><updated>2026-01-02T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/mainframe/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks</id><content type="html" xml:base="https://schiiss.github.io/blog/mainframe/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/"><![CDATA[<p><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/blog_image.png" alt="" /></p>

<p>Mainframe technology has always been intriguing but has largely been a mystery to me how it worked or what ‘mainframe’ even meant.</p>

<p>Working at Plains, many of our commercial applications for crude oil accounting, scheduling, volumetric data (noms and actuals) and contract capture all run largely on <strong>the mainframe</strong>. Over the past few months, I have had the opportunity to learn from and work with a few mainframe developers and SMEs on the commercial side of our business.</p>

<p>As mentioned above, my biggest challenge learning all of this was that I did not really understand what <strong>mainframe</strong> meant. So, to get my hands dirty, I leveraged some open-source technology to spin up an extremely barebones mainframe on my computer to get a feel for various mainframe concepts.</p>

<p>In this post, I’ll walk through what a mainframe is, how I emulated one, and how data moves from mainframe to Databricks. I hope this post demystifies some mainframe concepts for you!</p>

<hr />

<h2 id="what-is-mainframe">What is Mainframe?</h2>

<p>Talking with friends in other industries, <strong>mainframes</strong> are still very much the backbone of enterprise computing for industries like banking, airlines, and energy, where transaction integrity and speed are critical. Several IBM studies indicate the continued use of mainframes is widespread and will continue well into the future.</p>

<blockquote>
  <p><a href="https://www.ibm.com/downloads/documents/us-en/10a99803fa2fdb20">Mainframes handle almost 70% of the world’s production IT workloads</a>, according to one of the studies.</p>
</blockquote>

<p>Mainframes are <strong>enterprise data servers</strong> engineered to handle massive, mission-critical workloads, and it is not uncommon for them to process transactions in the trillions per second. They are built to run the world’s most important data systems without interruption, even under extreme load.</p>

<p>The term <strong>mainframe</strong> initially referred to the large cabinet or ‘main frame’ that held the central processing unit (CPU) of early computer systems.</p>

<hr />

<h2 id="relevance-to-plains">Relevance to Plains</h2>

<p>The role I have at Plains has inspired me to get hands-on with mainframe technologies since many of our commercial applications run on an <strong>IBM z/Architecture</strong> (or something very similar). With my role focused on data engineering/ML, getting data out of mainframe into our Lakehouse is critical.</p>

<p>To have more informed conversations with that team, I wanted to get hands-on with the technology. I am a big believer in learning by doing.</p>

<p>After a bunch of research, I found an open-source software implementation of the mainframe System/370 and ESA/390 architectures, in addition to the latest 64-bit z/Architecture. While infinitely less complex than the mainframe environment we have at Plains, it has still allowed me to get my hands dirty with some mainframe concepts. I also validated the environment with our internal mainframe team to make sure I was on the right track with my learning.</p>

<p>The software is called <a href="https://github.com/hercules-390/hyperion"><code class="language-plaintext highlighter-rouge">Hercules</code></a>, and it has been incredibly helpful to wrap my brain around mainframe technology.</p>

<hr />

<h2 id="getting-hands-on-with-mainframe">Getting Hands-On with Mainframe</h2>

<p>To be honest, I had not even really seen what a mainframe looked like prior to experimenting with this emulator, so this was a fun experience. After logging in, you are greeted with what is called an <strong><a href="https://www.ibm.com/docs/en/zos-basic-skills?topic=interfaces-what-is-ispf">Interactive System Productivity Facility (ISPF)</a></strong> menu that provides a way to interact with the mainframe.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_1.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_1.png" alt="Mainframe Step 1" /></a></p>

<p>There are a wide variety of options here, most of which I have not touched. Most of my time has been spent in the <strong>UTILITIES</strong> option. Selecting option <code class="language-plaintext highlighter-rouge">3</code> will take us there.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_2.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_2.png" alt="Mainframe Step 2" /></a></p>

<p>From there, we have several more options to choose from. I really wanted to get hands-on with some <strong>COBOL programming</strong>, so I spent most of my time there.</p>

<p>Once inside the Utilities menu, I quickly discovered that everything in the mainframe world revolves around <strong>datasets</strong>. These are essentially the mainframe’s version of files, but they’re far more structured and rigid.</p>

<ul>
  <li>Some datasets behave like simple text files.</li>
  <li>Others act more like folders, holding multiple “members” such as JCL scripts, COBOL programs, or configuration files.</li>
</ul>

<p><strong><a href="https://www.ibm.com/docs/en/zos-basic-skills?topic=sdsf-what-is-jcl">JCL (Job Control Language)</a></strong> was cool to learn about, and it is basically an instruction sheet that tells the mainframe what work to do. A JCL script defines:</p>

<ul>
  <li>Which datasets to read from</li>
  <li>Which programs to run</li>
  <li>Where to put the output</li>
  <li>What system resources to allocate</li>
</ul>

<p>When you submit a job, the mainframe reads your JCL, locates the datasets you reference, runs the COBOL program you point to, and routes the output into spool datasets that you can view through <strong><a href="https://www.ibm.com/docs/en/zos/3.1.0?topic=guide-introduction-sdsf">SDSF (System Display and Search Facility)</a></strong> or another similar job log viewer.</p>

<p>Selecting option <code class="language-plaintext highlighter-rouge">4</code> or <strong>DSLIST</strong> allows me to search for the dataset I want. In this case, I am searching <code class="language-plaintext highlighter-rouge">SYS2</code>, which is used for datasets that belong to the operating system, utilities, or shared components, not to individual users.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_3.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_3.png" alt="Mainframe Step 3" /></a></p>

<p>This emulator I have loaded has lots of prebuilt COBOL scripts that I can experiment with to get a feel for how it works. All those examples are contained within the dataset <code class="language-plaintext highlighter-rouge">SYS2.JCLLIB</code>. Pressing enter brings up the following screen where I can select my dataset by using the arrow keys and entering <code class="language-plaintext highlighter-rouge">V</code> to view the contents.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_4.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_4.png" alt="Mainframe Step 4" /></a></p>

<p>After pressing enter, we are brought to another screen where we can see a variety of different things, COBOL scripts being one of them. I stumbled upon a basic ‘Hello World’ application in COBOL to familiarize myself with the syntax and get a feel for submitting jobs on mainframe.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_5.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_5.png" alt="Mainframe Step 5" /></a></p>

<p>Scrolling down to that file and entering <code class="language-plaintext highlighter-rouge">E</code> allows us to view the syntax of that COBOL application.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_6.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_6.png" alt="Mainframe Step 6" /></a></p>

<p>This COBOL script just prints <code class="language-plaintext highlighter-rouge">Hello World</code>. We can submit this job by entering the <code class="language-plaintext highlighter-rouge">submit</code> command in the Command prompt.</p>

<p>Doing so returns this:</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_7.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_7.png" alt="Mainframe Step 7" /></a></p>

<p>I was trying to identify where I could see the output of this job, but could not figure it out 😂</p>

<p>Talking with the mainframe team at Plains, they leverage a tool called <strong>System Display and Search Facility (SDSF)</strong> to monitor, manage, and view system activity, job queues, and job output in real-time. I do not have access to that on my emulator, and when I tried to view the logs in the <strong>OUTLIST</strong> utility, which basically displays JOB output, I could not find it.</p>

<p>I can see there is something called <code class="language-plaintext highlighter-rouge">SYSOUT</code> where printed records are stored, but I could not figure out how to access that. In theory, I should have been able to view the ‘Hello World’ output there.</p>

<p><a href="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_8.png"><img src="/blog/assets/images/blog_images/unlocking-mainframe-data-from-legacy-systems-to-analytics-in-databricks/mainframe_8.png" alt="Mainframe Step 8" /></a></p>

<h2 id="extracting-data-from-mainframe">Extracting Data from Mainframe</h2>

<p>One of the most important aspects of mainframe systems is their use of specialized databases. Unlike modern relational databases, many mainframe databases such as IBM’s IMS (Information Management System) are <strong>hierarchical</strong> in structure. This means data is organized in a tree-like format, with parent and child records, rather than tables with rows and columns. I have yet to dig into these databases at Plains but I hope to in the coming months.</p>

<p>As things stand now (and as far as I understand), we are leveraging scheduled jobs via JCL that call mostly COBOL code on mainframe. These jobs export data from various commercial apps that run on mainframe, into a tilde separated format. These tilde separated files are often times SFTP’d between file shares for consumption.</p>

<p>There are also .NET client applications built for deal capture that leverage a SQL database where data is copied from mainframe into these client SQL databases on a scheduled basis.</p>

<p>We are tying into a mix of the tilde separated files and the client app SQL servers to pull volumetric, deal information, and lease supply data and using Data Weaver to ingest and process this data in Databricks.</p>

<p>While we are not connecting directly into the mainframe IMS databases today, depending on our end users requirements, we may need to explore that to pull that data on a more regular interval.</p>

<p>I suspect this will not be a problem for now since the jobs on mainframe to extract the data are scheduled to run in accordance with <strong>crude oil accounting requirements and operational workdays</strong>. In a midstream context, a “workday” is not a calendar day, but a contract-defined 24-hour operating period used for measurement, allocation, inventory reconciliation, and settlement.</p>

<p>Aligning extract jobs to these workday boundaries ensures that volumetric data, deal activity, and lease-level movements remain consistent with how the business measures and settles crude. As long as downstream ingestion and processing remain in sync with these accounting cutoffs, the data we land in Databricks should reconcile cleanly with commercial and accounting systems, even if the underlying source systems and technologies differ significantly.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>I had fun trying to get all this running on my laptop and I learned a ton in the process. What the mainframe has done and is doing at Plains is amazing, and I am excited to see what data and information we can unlock out of the commercial apps that run on the IBM mainframe using some cutting edge technology in Databricks.</p>

<p>I hope those that have not interacted with a mainframe (like me) found this blog helpful. I am excited to continue working with the team at Plains to learn more about this technology.</p>

<p>Thanks for reading! 😀</p>]]></content><author><name>Conner Schiissler</name></author><category term="Mainframe" /><category term="COBOL" /><category term="z/OS" /><category term="Legacy Systems" /><category term="Enterprise Computing" /><category term="Modernization" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">The Pain of Managing Data Lakes</title><link href="https://schiiss.github.io/blog/data/the-pain-of-managing-datalakes/" rel="alternate" type="text/html" title="The Pain of Managing Data Lakes" /><published>2025-12-11T07:00:00-07:00</published><updated>2025-12-11T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/data/the-pain-of-managing-datalakes</id><content type="html" xml:base="https://schiiss.github.io/blog/data/the-pain-of-managing-datalakes/"><![CDATA[<p><img src="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/blog_image.png" alt="" /></p>

<p>Managing data lakes, especially before modern tooling, was a painful, time-consuming undertaking. There are many aspects to managing data lakes, but I wanted to focus this blog on the data security aspect.</p>

<p>Data security is the bane of my existence and is a consistent issue with ETL workloads as you are effectively copying data from a source system with its own access control into another system, probably Azure Data Lake Storage (ADLS), that has its own set of permissions and access controls. I have never seen security control mapping from source systems to data lakes done well. Most organizations unintentionally expose data far more broadly than they should.</p>

<p>I have worked in a few industries, each with varying degrees of security requirements. I have worked in airlines where we dealt with passenger records and PII information that needed special care from a security standpoint. I, however, have spent most of my time in energy and oil &amp; gas and I have found that keeping data security as simple as possible helps keep work moving quickly.</p>

<blockquote>
  <p>At Plains, we follow an <strong>‘open by default and closed by exception’</strong> policy where each data set that is landed in our lake is by default ‘open’ to all, and we restrict access to data sets that are identified as sensitive on ingestion. If you work in a more sensitive industry like banking, that may not be a reality for you.</p>
</blockquote>

<p>In this blog, I wanted to share a few stories and experiences I have had over the years securing data lakes and the pain of doing it.</p>

<p>As I was thinking about these previous experiences, I also want to talk about the amazing benefits Unity Catalog provides and how it could have saved me a lot of pain 5-6 years ago.</p>

<hr />

<h2 id="the-days-of-data-lake-gen1">The Days of Data Lake Gen1</h2>

<p>My first couple of years in the data space were purely focused on DevOps and integrations. Basically, making the release process for the data engineers more efficient and integrating all the many services we used.</p>

<p>It has been a while, but I believe we had an architecture like this:</p>

<p><a href="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/gen1_data_architecture.png"><img src="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/gen1_data_architecture.png" alt="Gen1 Data Architecture" /></a></p>

<p>I have annotated the above diagram with numbers to explain how the data moves between these services:</p>

<ol>
  <li><strong>ADF leveraging the integration runtime pulls data from on-premises systems</strong></li>
  <li><strong>Data is dropped into bronze as parquet</strong></li>
  <li><strong>Databricks picks up the data from bronze and converts it into delta parquet moving into silver</strong></li>
  <li><strong>ADF picks up the data from silver and puts it into ‘gold’ which is either analysis services or Synapse depending on the use case</strong>
    <ul>
      <li>Fact/dimension tables loaded from gold zone or Synapse</li>
      <li>Analysis services built the semantic model used by Power BI</li>
    </ul>
  </li>
  <li><strong>Power BI for reporting</strong></li>
</ol>

<p>As mentioned earlier, my sole job at the time was purely to manage integrations and the DevOps processes that supported the release of artifacts into production.</p>

<p>The industry has made leaps and bounds in progress to simplify this process but at the time this was incredibly complicated. There was no unified CI/CD pipeline across services, ADF’s JSON ARM templates were painful, AAS DevOps was extremely manual and the identity and access model were incredibly siloed across these services.</p>

<p>Each of these challenges could have individual blogs written about them, so to keep this blog focused, I want to move on to talking about the complexities around managing security in data lakes.</p>

<p>The above architecture references Data Lake Gen1, which I believe has been marked as out of support by Microsoft, but even Data Lake Gen2, I would argue, has the same technical challenges I am about to lay out.</p>

<hr />

<h2 id="the-pain-of-managing-data-lake-security">The Pain of Managing Data Lake Security</h2>

<p>If I knew what I know now back in 2019, I would have saved Conner back then a lot of pain. Now I want to talk about how managing data lake permissions back then was difficult and set up the blog to talk about how Unity Catalog addressed many of these challenges. Let’s start by talking about how POSIX-style permissions are limiting.</p>

<h3 id="posix-acl-limitations">POSIX ACL Limitations</h3>

<p>An Access Control List (ACL) is a list of permissions attached to an object (such as a file or folder) that specifies which users or groups can access the object and what operations they can perform.</p>

<p>This is a huge oversimplification of the technology, but Azure Data Lake Storage Gen1 and Gen2 implement a hierarchical file system with POSIX-style permissions and ACLs, similar to those found in Linux file systems. This means access control is primarily managed through three permission types: <strong>Read</strong>, <strong>Write</strong>, and <strong>Execute</strong>.</p>

<ul>
  <li><strong>Read (r):</strong>
    <ul>
      <li>For files: Allows viewing of the file’s contents.</li>
      <li>For directories: Allows listing the contents of the directory (e.g., using <code class="language-plaintext highlighter-rouge">ls</code>).</li>
    </ul>
  </li>
  <li><strong>Write (w):</strong>
    <ul>
      <li>For files: Allows modifying or deleting the file.</li>
      <li>For directories: Allows creating, deleting, or renaming files within the directory.</li>
    </ul>
  </li>
  <li><strong>Execute (x):</strong>
    <ul>
      <li>For files: Allows running the file as a program or script.</li>
      <li>For directories: Allows entering or accessing subdirectories within the directory.</li>
    </ul>
  </li>
</ul>

<blockquote>
  <p><strong>Note:</strong> Given the amount of <a href="https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/manage-privileges/privileges">permissions Unity Catalog exposes</a>, it becomes very clear that RWX permissions are incredibly limited.</p>
</blockquote>

<hr />

<h2 id="acl-inheritance-wasnt-reliable">ACL Inheritance Wasn’t Reliable</h2>

<p>There were lots of issues around setting permissions as well if you were not mindful in setting them up prior to ingesting data.</p>

<p>Let me explain this in an example, imagine you are streaming data in from a source. In my case, when I was helping maintain these services, we were bringing in PNRs and DCS data (passenger records and departure control records) from a large airline.</p>

<p>If you imagine all the flights taking off and landing, there was a high volume of data coming in. The integration pattern looked something like the below. We had an application publish messages to an event hub, and we had Databricks read from the event hub queues and drop the messages into parquet format in our data lake.</p>

<p><a href="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/streaming_data.png"><img src="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/streaming_data.png" alt="Streaming Data" /></a></p>

<p>It was exciting to see this data flowing into our lake after all the hard work building ARM templates and making sure everything was integrated nicely. However, I remember my excitement quickly dwindled when one of the data engineers came by and asked ‘Hey, can I get access to the PNR data please’?</p>

<p>I remember clicking through the UI in the Azure Portal and attempting to add permissions, but I made one bad assumption. I assumed there would be a way to recursively set the POSIX permissions on the PNR folder. Basically, I needed to set permissions on every single object (file and folder) within the PNR folder. Otherwise, the data engineer requesting this could get access denied reading a parquet file buried in a child folder somewhere.</p>

<blockquote>
  <p>You cannot <a href="https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-acl-azure-portal#apply-an-acl-recursively">recursively set permissions in the portal</a> and you need to write some code to do this.</p>
</blockquote>

<p>In our case, we wrote a .NET application to apply new users to these folders. Since the streaming data had been coming in for a few weeks, the number of files and folders in the PNR folder grew into the hundreds of thousands.</p>

<p>Running a script made this easier, but it was prone to failure due to network or timeout issues. For example, the script could get a good portion done assigning the permissions and it would fail, meaning some of the folders would not get the right permissions set. Data engineers would often report unauthorized error messages when trying to read those parquet files. Since the script had no form of state and could not pick up where it left off, we had no choice but to rerun the script from the beginning and continue running it until the data engineer no longer received that error.</p>

<p>As time went on, our code evolved to include things like ‘state/checkpointing’ so we could rerun the script from where it left off, but it really felt like we were building some pretty intricate software and it was just another thing to add to the long list of things we had to maintain.</p>

<hr />

<h2 id="acl-limits">ACL Limits</h2>

<p>Another oversight we did not realize at the time was that there were <a href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits?utm_source=chatgpt.com#azure-data-lake-storage-limits">hard limits</a> as to how many ACLs you can set on a file or folder. The maximum number of access ACLs per file or folder is <strong>32</strong>. This means only 32 objects can have access to a file/folder at a time. Our script was assigning individual users access to these folders 🤦‍♂️, which did not scale and we hit that limit very quickly. The obvious solution here (hindsight is always 20/20) was to assign Microsoft Entra ID groups access to the data lake objects. This was going to require our .NET software to become more sophisticated, creating groups and assigning them on demand.</p>

<hr />

<h2 id="the-solution">The ‘Solution’</h2>

<blockquote>
  <p>Full disclosure, all of this happened a number of years ago, so I am a bit fuzzy on the details.</p>
</blockquote>

<p>After all that pain, we learned some important lessons and were ready to revise our approach. We spent a bunch of time sitting with vendors and updating our .NET code to roll out a new strategy for managing security in our Data Lake. The cleanup of existing permissions is a whole other story, but we basically had to write more code to recursively remove all existing permissions and apply the new security pattern, while taking into account the disruption this would cause. It was a gong show!</p>

<h3 id="the-model-that-finally-worked-for-a-while">The Model That “Finally Worked”… for a While</h3>

<p>The model we settled on was straightforward. Every source system got a Microsoft Entra ID group for readers and another for writers, and we applied those groups directly to the parent folder in Azure Data Lake Storage Gen2 before any data landed. Because Azure Data Lake Storage uses inherited POSIX-style ACLs, anything that arrived beneath automatically picked up the same permissions. For the more complex relational systems, we expanded this pattern so we could secure both at the source level and at the individual table level. It still didn’t give us anything close to column-level or row-level access control, but compared to where we started, it was a massive improvement.</p>

<p>This setup actually worked well for a while. We weren’t constantly hunting down folders that lost inheritance or ACL entries that exceeded the 32-item limit. Permissions became something we only had to think about once during source onboarding. Users were simply added to the appropriate Entra ID groups and everything just flowed from there. For a moment, it felt like we had finally tamed the complexity of data lake security.</p>

<p><a href="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/group_based_security_model.png"><img src="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/group_based_security_model.png" alt="Group Based Security" /></a></p>

<p>Then we were blindsided by something we didn’t expect at all: <strong>Azure authentication itself.</strong></p>

<hr />

<h2 id="how-azure-authentication-works">How Azure Authentication Works</h2>

<p>This is where understanding how Azure handles authentication becomes important. When a user or service accesses something like Azure Data Lake Storage, Azure issues a JSON Web Token, or JWT. It’s just a small, signed package of information that gets passed along with each request. Inside that token is all the context Azure needs to authorize the user, who they are, which tenant they belong to, and, most importantly for us, which security groups they’re in. Azure Data Lake Storage checks the object IDs of those groups against the ACLs on the folders and files and decides whether the request should be allowed.</p>

<p>A simplified JWT looks like this:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"upn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user@company.com"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"oid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1234-5678"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"groups"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"guid1"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"guid2"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"guid3"</span><span class="p">,</span><span class="w">
    </span><span class="err">...</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>These group GUIDs are how Azure Data Lake Storage (and anything using POSIX ACLs) decides whether a user has read/write access to a file or folder.</p>

<p>The challenge is large enterprises have employees who belong to hundreds or even thousands of Microsoft Entra ID groups.</p>

<p>However, JWTs have size limits. When the group list becomes too large, Azure cannot fit them all into the token. So Azure switches to what’s called the “overage claim” mode.</p>

<p>Instead of including the group list, the token contains:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"_claim_names"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"groups"</span><span class="p">:</span><span class="w"> </span><span class="s2">"src1"</span><span class="w"> </span><span class="p">}</span><span class="err">,</span><span class="w">
</span><span class="nl">"_claim_sources"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nl">"src1"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"endpoint"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://graph.microsoft.com/v1.0/me/getMemberObjects"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This tells the resource:</p>

<blockquote>
  <p>“I can’t fit the groups in the token, go query Microsoft Entra ID to figure out what groups this user actually belongs to.”</p>
</blockquote>

<p>Authorization could now take hundreds of milliseconds or more since the groups were no longer embedded in the JWTs. We ended up seeing some pretty significant latencies to storage reads and writes.</p>

<p>Long story short, we had to engage several engineers across different product groups at Microsoft to try to get this resolved.</p>

<p>I believe a hotfix had to be released from one of the product groups to address this issue, but that was after weeks of sending/analyzing logs and lots of meetings.</p>

<hr />

<h2 id="enter-unity-catalog">Enter Unity Catalog</h2>

<p>After years of wrestling with the limitations of POSIX ACLs, recursive permission headaches, and Azure authentication quirks, Unity Catalog came along and seriously simplified data lake security. Unity Catalog centralizes data governance, enabling fine-grained access controls and simplified permission management, all natively integrated with Databricks. This modern approach addresses many of the pain points from legacy architectures, making it far easier to secure sensitive data, manage user access, and ensure compliance across the organization. Unity Catalog basically proxies access to your data lake:</p>

<p><a href="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/unity_catalog.png"><img src="/blog/assets/images/blog_images/the-pain-of-managing-datalakes/unity_catalog.png" alt="Unity Catalog" /></a></p>

<p>With Unity Catalog, organizations gain the ability to define access policies at the table, row, and column levels, which extends far beyond the limitations of traditional RWX permissions. Permissions can be managed through a unified interface, making onboarding simpler and reducing the risk of misconfiguration. Additionally, Unity Catalog enables auditing of data access and changes, supporting compliance and security investigations. Integration with existing identity providers ensures seamless management of users and groups. Here is a side by side comparison of POSIX style permissions and Unity Catalog:</p>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>POSIX ACL (ADLS Gen1/Gen2)</th>
      <th>Unity Catalog (Databricks)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Permission Model</strong></td>
      <td>RWX (Read, Write, Execute)</td>
      <td>Fine-grained privileges (SELECT, MODIFY, CREATE, etc.)</td>
    </tr>
    <tr>
      <td><strong>Scope of Control</strong></td>
      <td>File and folder level</td>
      <td>Table, row, and column level</td>
    </tr>
    <tr>
      <td><strong>Inheritance</strong></td>
      <td>Hierarchical, but prone to issues</td>
      <td>Managed centrally with consistent propagation</td>
    </tr>
    <tr>
      <td><strong>Identity Integration</strong></td>
      <td>Microsoft Entra ID groups and users</td>
      <td>Integrated with identity providers (Entra ID, etc.)</td>
    </tr>
    <tr>
      <td><strong>Scalability</strong></td>
      <td>Limited (max 32 ACL entries per object)</td>
      <td>Highly scalable with role-based access control</td>
    </tr>
    <tr>
      <td><strong>Auditing &amp; Governance</strong></td>
      <td>Minimal, manual logging</td>
      <td>Built-in auditing and compliance reporting</td>
    </tr>
    <tr>
      <td><strong>Management Interface</strong></td>
      <td>Azure Portal or custom scripts</td>
      <td>Unified UI and APIs within Databricks</td>
    </tr>
    <tr>
      <td><strong>Advanced Security Features</strong></td>
      <td>No row/column-level security</td>
      <td>Supports row-level and column-level security</td>
    </tr>
    <tr>
      <td><strong>Automation &amp; CI/CD</strong></td>
      <td>Manual or custom scripts</td>
      <td>Native integration with Databricks workflows</td>
    </tr>
  </tbody>
</table>

<p>With Unity Catalog, so many folks will be able to avoid the grief of managing ACLs in data lakes.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Managing data lakes has historically been a complex and error-prone process, fraught with technical limitations and operational headaches. 2019 Conner really wishes he had Unity Catalog 😂 Unity Catalog represents a significant leap forward, offering centralized, fine-grained, and auditable data governance that simplifies security management and helps organizations meet their compliance goals. By adopting modern tools like Unity Catalog, teams can spend less time fighting with permissions and more time unlocking the value of their data.</p>

<p>Thanks for reading! 😀</p>]]></content><author><name>Conner Schiissler</name></author><category term="Data" /><category term="Data Engineering" /><category term="Data Security" /><category term="DevOps" /><category term="Data Lakes" /><category term="Unity Catalog" /><category term="Databricks" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">🛢️🤖 Why Detecting O&amp;amp;G Pipeline Anomalies Is So Hard</title><link href="https://schiiss.github.io/blog/data%20science/why-detecting-o&g-pipeline-anomalies-is-so-hard/" rel="alternate" type="text/html" title="🛢️🤖 Why Detecting O&amp;amp;G Pipeline Anomalies Is So Hard" /><published>2025-11-16T07:00:00-07:00</published><updated>2025-11-16T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/data%20science/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard</id><content type="html" xml:base="https://schiiss.github.io/blog/data%20science/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/"><![CDATA[<blockquote>
  <p><strong>Disclaimer:</strong><br />
PI data, pipeline sensor data, and the operational use cases around them are areas I’m actively learning. There are experts at Plains and across the industry who know this inside out. I’m approaching this from a technical perspective, and it’s a complex topic. This blog is meant to capture ideas and lessons from others while helping me solidify my own understanding. I’ve done my best to ensure accuracy, but please treat this as a learning perspective rather than authoritative guidance.</p>
</blockquote>

<p><img src="/blog/assets/images/blog_images/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/blog_image.png" alt="" /></p>

<hr />

<h2 id="-introduction">👋 Introduction</h2>

<p>Over the past few months, I’ve been diving into the world of pipeline operations and the rich data that comes from them. I have grown a huge appreciation for how complex these systems are. The sensors along these pipelines generate massive amounts of data, tracking flow, pressure, temperature, vibration, and more, every second of every day. Interpreting this data is far from straightforward and understanding how to extract meaningful insights is both fascinating and challenging.</p>

<p>I recently had some great conversations with <a href="https://www.linkedin.com/in/scotthmckean/">Scott McKean</a> on the complexities of operational anomaly detection in pipelines. We also have an incredibly knowledgeable OT team at Plains who have been patient in explaining topics like <strong>over/short</strong>, <strong>inhibits</strong>, <strong>triggers</strong>, and more. It has been and will continue to be a humbling learning experience.</p>

<p>This space sits at the intersection of engineering, operations, and data science. There are some interesting data and AI problems here and I’m excited to continue exploring this space and sharing insights along the way.</p>

<hr />

<h2 id="-anomaly-detection-complexity">🧩 Anomaly Detection Complexity</h2>

<p>There is a lot that happens on an oil and gas pipeline, and it is far more complicated than just monitoring a few pressure and flow readings. Pipeline data is never static, and the pipeline system is constantly responding to external factors like temperature and pressure changes, and various operational decisions. So, while anomaly detection in many industries looks for outliers, in pipelines the definition of “normal” is always shifting. Sifting through ‘noise’ (weak outliers) and ‘anomalies’ (strong outliers) can be very difficult.</p>

<p>From what I have learned thus far, here are a few of the major factors that make the task of detecting anomalies so difficult.</p>

<blockquote>
  <p><strong>Side note:</strong><br />
Scott has a great <a href="https://www.linkedin.com/pulse/anomaly-detection-scott-mckean-hi1tc">blog</a> on anomaly detection and is definitely worth checking out.</p>
</blockquote>

<p><a href="/blog/assets/images/blog_images/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/anomaly_detection_outliers.jpg"><img src="/blog/assets/images/blog_images/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/anomaly_detection_outliers.jpg" alt="Anomaly Detection Outliers" /></a></p>

<h3 id="️-the-pipeline-system-is-always-changing">🌡️ The Pipeline System is Always Changing</h3>

<p>Hydrocarbons inside a pipeline expand and contract with temperature and pressure, and this stored compression is known as <strong>linepack</strong>. When a pipeline shuts down, pumps stop, pressure bleeds off, and the product settles. The measured volume can suddenly look different from what the system says should be there. This temporary difference is often called an <strong>over/short imbalance</strong>, and it can look almost identical to other operational issues like leaks or metering errors. Without accounting for linepack and how pressure and temperature shifts affect volume, even a good anomaly detection model will throw constant false positives.</p>

<p>This is why basic volume imbalance checks are unreliable the moment temperature swings, shutdowns, or other transient events happen.</p>

<h3 id="-why-linepack-and-transients-break-simple-imbalance-detectors">📘 Why Linepack and Transients Break Simple Imbalance Detectors</h3>

<p>Pipelines do not behave like static tanks. They act more like long, flexible storage vessels. As pressure increases, the pipe walls and the product inside compress, storing additional volume as <strong>linepack</strong>. During real operations, this is always changing as pumps ramp, temperatures shift, valves move, and product batches travel down the line. Because of this, the actual volume inside a pipeline at any moment is not fixed and can’t be inferred from flow meters alone.</p>

<p>Transient events make this even more obvious. Startup, shutdown, pump trips, valve changes, and temperature swings all shift linepack faster than simple models expect. A basic imbalance detector that just looks at flow in versus flow out ends up interpreting these changes as apparent losses or gains. To a naive model, those look like leaks. In reality, they are perfectly normal behaviors. This is why modern leak detection systems and anomaly detection approaches need to understand operating context, pressure and temperature behavior, and the physical dynamics of the pipeline rather than relying on static thresholds.</p>

<h3 id="-multiple-pumps-and-split-flows-add-signal-complexity">🔀 Multiple Pumps and Split Flows Add Signal Complexity</h3>

<p>Pipelines are not linear “point A to point B” systems, and they often include multiple pumping stations, branch lines that split the product into different paths and potential recombination points downstream.</p>

<p>Each pump also introduces changes in the pipeline like pressure, flow, vibrations and transient behavior like shutdowns.</p>

<blockquote>
  <p>If flow in ≠ flow out → anomaly will generate a false positive every time.<br />
These pipeline systems are incredibly complex.</p>
</blockquote>

<h3 id="-triggers-and-inhibits-make-the-operational-context-dynamic">🔔 Triggers and Inhibits Make the Operational Context Dynamic</h3>

<p>Pipeline safety systems rely on automated logic and there are a few terms I have learned about in the past few weeks:</p>

<ul>
  <li><strong>Triggers:</strong> Conditions that automatically generate alerts to the control center for operator review and decision-making (e.g., high pressure)</li>
  <li><strong>Inhibits:</strong> Temporary overrides that prevent those alerts when abnormal readings are expected (e.g., during startup or maintenance)</li>
</ul>

<p>During startup, shutdown, pigging, and maintenance, inhibits are often active and from an operations standpoint we are saying ‘Yes, conditions look abnormal, but we expect that, so don’t respond.’</p>

<p>For anomaly detection models, this is incredibly hard. The model must also know when to ignore anomalies (i.e., the noise). This means the key is not just analyzing the sensor values but understanding the operating state at that moment.</p>

<h3 id="-false-positives-can-be-expensive">💸 False Positives Can be Expensive</h3>

<p>If an anomaly detection system signals a leak incorrectly and triggers or pressures teams to shut down, and if the system cries wolf too often, it simply gets ignored, which is worse than having no system at all. An incorrect flag could lead teams to shut down a pipeline segment:</p>

<ul>
  <li>Product stops moving across large distances</li>
  <li>Field crews may be dispatched to remote terrain</li>
  <li>Downstream contracts and storage logistics are interrupted</li>
  <li>The cost can escalate into hundreds of thousands of dollars per hour</li>
</ul>

<hr />

<h2 id="️-our-initial-approach">🛠️ Our Initial Approach</h2>

<p>Before having a good understanding of all the above factors, we originally started tackling this as an unsupervised learning problem, specifically leveraging <strong>isolation forest</strong> and this seemed very appealing since we did not need to label events, and it did a decent job at detecting anomalies. We ran into some stark limitations right off the bat. Isolation forest only tells you that an anomaly occurred, not what changed or what the anomaly could be. It produces an anomaly score, but it doesn’t reveal which signal, feature, or condition triggered that score.</p>

<p>Isolation forest is not great since all you get is if an anomaly is detected or not and not what was detected. Being able to explain to the operational teams ‘why’ an alert was triggered is important to build trust and continuously improve the model through the practice of MLOps.</p>

<hr />

<h2 id="-recognizing-the-multi-dimensional-nature-of-the-problem">📐 Recognizing the Multi-Dimensional Nature of the Problem</h2>

<p>The next thing I learned came from taking a step back and looking at the data from a physical standpoint.</p>

<p>A single flow rate or pressure signal cannot tell the whole story. Pipelines are multi-dimensional systems that are influenced by many variables including things like flow rate (volume movement), pressure (linepack, pump load, blockages, restrictions), vibration (pump condition and mechanical signatures), amongst other things. The behavior of any one signal only makes sense in context with the others. After understanding this, we moved away from isolation forest to an ensemble model to help us tackle the multi-dimensional nature of this problem.</p>

<hr />

<h2 id="-shifting-to-classification-models-eg-xgboost">📊 Shifting to Classification Models (e.g., XGBoost)</h2>

<p>Once we understood the operating patterns, we moved toward supervised and semi-supervised classification, where the model learns to identify operating states and anomaly types, not just differences. This is where classification models (like XGBoost) became a much better fit.</p>

<p>Instead of asking, “Is this point different from normal?” (Isolation Forest), we instead ask, does this data pattern match one of the following:</p>

<ul>
  <li>normal operation,</li>
  <li>shutdown conditions, or</li>
  <li>a potential leak?</li>
</ul>

<p>This framing is much more aligned with how pipeline operators think.</p>

<hr />

<h2 id="-using-synthetic-data-to-train-the-classifier">🧪 Using Synthetic Data to Train the Classifier</h2>

<p>To wrap my head around how to technically do this, I thought of generating some synthetic pipeline data to learn how to do this.</p>

<table>
  <thead>
    <tr>
      <th>State</th>
      <th>What’s Happening Physically</th>
      <th>Why It Matters for Modeling</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Normal Operation</td>
      <td>Pumps running steady, linepack stable</td>
      <td>Establishes the baseline “normal” pattern the model must learn</td>
    </tr>
    <tr>
      <td>Shutdown Conditions</td>
      <td>Flow and pressure bleed down, pumps slow, valves close</td>
      <td>A naturally abnormal-looking state that should not be treated as a threat</td>
    </tr>
    <tr>
      <td>Leak Scenario</td>
      <td>Pressure drops while flow and vibration increase</td>
      <td>The critical anomaly we want the model to reliably detect</td>
    </tr>
  </tbody>
</table>

<p>We generated these states by sampling from realistic ranges of flow rate, pressure, temperature, vibration, pump speed, and valve position, each representing a different part of pipeline behavior.</p>

<p>We validated how well (or not) the feature engineering went by leveraging EDA techniques to understand the relationship between the features. As I understand it, the less overlap between features, the better. In the below example, you can see some plot overlaps between normal and shutdown which could lead to false positives. Also, in retrospect, I wish I did not display balanced (i.e., normal) flow in red 😂.</p>

<p><a href="/blog/assets/images/blog_images/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/eda_xgboost.png"><img src="/blog/assets/images/blog_images/why-detecting-o&amp;g-pipeline-anomalies-is-so-hard/eda_xgboost.png" alt="EDA XGBoost" /></a></p>

<hr />

<h2 id="️-next-steps">⏭️ Next Steps</h2>

<p>The ML model is just one piece of broader operational architecture. The output needs to be trusted, repeatable, and actionable across hundreds of pipeline systems with different equipment, environmental conditions, and product characteristics.</p>

<p>We’re approaching the next stage in three main areas:</p>

<h3 id="-pairing-ml-models-with-a-rule-based-engine">🤝 Pairing ML Models with a Rule-Based Engine</h3>

<p>While the classification model helps interpret multi-sensor patterns, pipeline operations still rely on deterministic safety logic that cannot be replaced. So instead of choosing between ML or rules, the goal is to combine them:</p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Role in the System</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rule-Based Engine</td>
      <td>Handles explicit operating constraints and known alarm conditions (e.g., high-pressure trip, pump vibration alarm). These rules already exist and are trusted by operations.</td>
    </tr>
    <tr>
      <td>Classification Model (e.g., XGBoost)</td>
      <td>Looks across multiple signals to interpret the broader operating state and detect subtle or emerging anomalies that rules do not capture.</td>
    </tr>
    <tr>
      <td>Operating Context Layer</td>
      <td>Ensures alerts are only considered during meaningful states (e.g., ignore anomaly-like signals during startup or pigging because inhibits and transitions are expected).</td>
    </tr>
  </tbody>
</table>

<p>This approach reduces false positives while maintaining operational trust, something purely ML-driven anomaly detection rarely gets right on the first try.</p>

<h3 id="️-model-governance-and-lifecycle-management">🗂️ Model Governance and Lifecycle Management</h3>

<p>To move beyond proof-of-concept, we need consistent versioning, reproducibility, and centralized governance:</p>

<table>
  <thead>
    <tr>
      <th>Tool</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MLflow</td>
      <td>Track training runs, feature configurations, and performance metrics. Ensures we know what changed when improving models.</td>
    </tr>
    <tr>
      <td>Unity Catalog</td>
      <td>Provides versioning, permission control, and lineage for both models and features. This is key to auditability and cross-team adoption.</td>
    </tr>
    <tr>
      <td>Feature Store / Feature Engineering Layer</td>
      <td>Ensures features like “pressure delta over 10 minutes” are computed once, consistently, everywhere.</td>
    </tr>
  </tbody>
</table>

<h3 id="-data--pipeline-orchestration-at-scale">🚦 Data &amp; Pipeline Orchestration at Scale</h3>

<p>The data pipeline itself matters as much as the model:</p>

<ul>
  <li>We ingest PI tag data (and related historian telemetry).</li>
  <li>We cleanse, align timestamps, and compute features.</li>
  <li>We classify operating state + evaluate for anomalies.</li>
  <li>We push outputs to dashboards and alerting systems.</li>
</ul>

<p>To manage this at operational scale, we are evaluating <strong>Lakeflow Declarative Pipelines</strong> to orchestrate:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ingestion → Tag Aggregation → Feature Computation → Model Inference → Alert Routing
</code></pre></div></div>

<p>And last, but certainly not least, continue working with our OT teams to encode operational playbooks into the context layer.</p>

<p>Thanks for reading 😊</p>]]></content><author><name>Conner Schiissler</name></author><category term="Data Science" /><category term="Data Engineering" /><category term="Oil &amp; Gas" /><summary type="html"><![CDATA[Disclaimer: PI data, pipeline sensor data, and the operational use cases around them are areas I’m actively learning. There are experts at Plains and across the industry who know this inside out. I’m approaching this from a technical perspective, and it’s a complex topic. This blog is meant to capture ideas and lessons from others while helping me solidify my own understanding. I’ve done my best to ensure accuracy, but please treat this as a learning perspective rather than authoritative guidance.]]></summary></entry><entry><title type="html">🤖 Databricks MCP Catalog: The Missing Operational Layer for Enterprise Agents</title><link href="https://schiiss.github.io/blog/genai/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/" rel="alternate" type="text/html" title="🤖 Databricks MCP Catalog: The Missing Operational Layer for Enterprise Agents" /><published>2025-11-09T07:00:00-07:00</published><updated>2025-11-09T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/genai/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents</id><content type="html" xml:base="https://schiiss.github.io/blog/genai/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/"><![CDATA[<p><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/blog_image.png" alt="" /></p>

<p>The amount of adoption and hype around <strong>Model Context Protocol (MCP)</strong> has pleasantly surprised me lately. While I was skeptical at the start regarding its value proposition and how widely (or not) it would be adopted, I’m now seeing how it’s coming together.</p>

<p>The most important aspect of agents are their <strong>tools</strong>, as they give them access to your enterprise context as well as the outside world. I’m starting to see how MCP will enable more seamless agent integration with the outside world.</p>

<p>At Plains, we’ve fully embraced the <strong>Lakehouse architecture</strong>, and the possibilities expand dramatically when you have data from your critical applications centralized in one place. This presents some fascinating opportunities for MCP, which I’ll explore in detail below.</p>

<p>Databricks has been releasing lots of exciting features during their ‘Week of Agents’ campaign and I wanted to cover their recent release around the <a href="https://www.databricks.com/blog/accelerate-ai-development-databricks-discover-govern-and-build-mcp-and-agent-bricks?utm_source=bambu&amp;utm_medium=social&amp;utm_campaign=advocacy&amp;blaid=8065278">MCP catalog and marketplace</a>.
I wanted to put together a short blog post on how to get started and where I see some opportunities.</p>

<hr />

<h2 id="-announcements">📢 Announcements</h2>

<p>There were a few exciting announcements around MCP during this release, but it really boils down to discovery, governance, and actionability of MCP endpoints. On the discovery side, you can now explore external MCP servers via the marketplace:</p>

<p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/extrernal_mcp_marketplace.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/extrernal_mcp_marketplace.png" alt="External MCP Catalog" /></a></p>

<p>There are some neat MCP integrations to S&amp;P and Nasdaq and there are a few free ones to get started with. I will cover the easiest way to explore external MCP shortly.</p>

<p>You can also more easily discover MCP servers hosted within your Databricks environment. If you think in terms of data products, discoverability is an important attribute of them, and I see MCP servers becoming products within your data and AI platform that will start to take on a lifecycle of their own and enable agents within your environment to interact with your enterprise data.</p>

<p>You can now explore MCP servers within your environment in the ‘Agents’ tab:</p>

<p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/discover_mcp_servers.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/discover_mcp_servers.png" alt="Discover MCP Servers" /></a></p>

<p>Lastly, the Multi-Agent supervisor now supports connecting to external MCP servers, giving agents access to AI-ready external data. I fully expect the offerings in this catalog to expand in the future.</p>

<p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_external_mcp.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_external_mcp.png" alt="MAS External MCP" /></a></p>

<hr />

<h2 id="-getting-started">🚀 Getting Started</h2>

<p>I would say the easiest way to get started with testing some of these new features is to select one of the free external MCP servers. I suggest the Tavily MCP that allows agents to hook into the web.</p>

<ol>
  <li>Navigate to <a href="https://app.tavily.com/">Tavily</a> and create an account. You get 1000 free API calls a month to test out the functionality!</li>
  <li>
    <p>Next, navigate to the Databricks Marketplace and find the Tavily MCP server and select ‘Install’.</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_1.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_1.png" alt="Tavily Marketplace Setup Step 1" /></a></p>
  </li>
  <li>
    <p>Finally, place your API key from <a href="https://app.tavily.com/">Tavily</a> into the bearer token input and click ‘Install’.</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_2.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_2.png" alt="Tavily Marketplace Setup Step 2" /></a></p>
  </li>
  <li>
    <p>You should now be able to see your newly created Tavily MCP server in the catalog:</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_3.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_3.png" alt="Tavily Marketplace Setup Step 3" /></a></p>
  </li>
  <li>
    <p>Now the fun part begins and we can create a multi-agent supervisor and give it access to the new MCP server:</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_4.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_4.png" alt="Tavily Marketplace Setup Step 4" /></a></p>
  </li>
  <li>
    <p>Asking the agent in the playground, <code class="language-plaintext highlighter-rouge">what tools do you have access to?</code> is a good first test to make sure the agent has access to be able to interact with the Tavily MCP server:</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_5.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/tavily_marketplace_step_5.png" alt="Tavily Marketplace Setup Step 5" /></a></p>
  </li>
  <li>
    <p>Let’s give the agent a run for its money and ask a tough question:</p>

    <blockquote>
      <p>For companies in the midstream oil &amp; gas sector, what are the top three real-world use cases where multi-agent systems have reached production? Include provider names, architecture patterns, and ROI details if available.</p>
    </blockquote>

    <p>The agent seems to spin for about a minute, but no answer is returned:</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_first_question.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_first_question.png" alt="First MAS Question" /></a></p>

    <p>The trace also does not seem to indicate that the Tavily MCP server was called; however, I can see the calls on my Tavily subscription.</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_first_question_trace.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_first_question_trace.png" alt="First MAS Question Trace" /></a></p>
  </li>
  <li>
    <p>Let’s try another question!</p>

    <blockquote>
      <p>How does the Model Context Protocol (MCP) differ from OpenAI Function Calling?</p>
    </blockquote>

    <p>This one worked really well. I can see in the MLflow trace that the Tavily MCP server was called multiple times to generate a file response:</p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_second_question_trace_1.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_second_question_trace_1.png" alt="Second MAS Question Trace 1" /></a></p>

    <p><a href="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_second_question_trace_2.png"><img src="/blog/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas_second_question_trace_2.png" alt="Second MAS Question Trace 2" /></a></p>

    <p>I am also impressed with the answer. For reference, you can view the <a href="https://github.com/Schiiss/blog/blob/master/assets/images/blog_images/databricks-mcp-catalog-the-missing-operational-layer-for-enterprise-agents/mas-mcp-vs-openai-functions-response.md">agents response</a> to my second question and see how it did.</p>
  </li>
</ol>

<blockquote>
  <p><strong>Note</strong>: I have converted the response to markdown for better readability.</p>
</blockquote>

<hr />

<h2 id="-use-cases">💡 Use Case(s)</h2>

<p>I’ll admit, some of the MCP use cases in the market right now still feel a bit aspirational. But I think that’s the point. The value comes before the use case. Getting ahead of the curve on how systems will be accessed by agents and ensuring your data is structured, cataloged, and centralized is what positions you to move fast when the patterns become clear.</p>

<p>Instead of waiting for the “perfect” GenAI use case to fall into our lap, the strategy is to get the prerequisites in place now:</p>

<ul>
  <li>Data centralized</li>
  <li>Metadata and governance</li>
  <li>Services exposed consistently through something like MCP</li>
</ul>

<p>Then iterate on use cases as the opportunities show up.</p>

<p>One area where this does feel tangible today is with the sensor data we collect from our pipelines. We are already well underway on a build out of an anomaly detection layer (both a rules-based engine and an ML classifier, e.g., XGBoost). But once an anomaly is flagged, an operator still has to go and validate whether it is real or explainable. This is a highly manual workflow that often requires switching between SCADA, historian data, Maximo, and inspection/maintenance logs.</p>

<p>An MCP-connected agent could replicate the exact steps an experienced operator takes to triage the anomaly but do it in seconds instead of minutes or hours.</p>

<h3 id="the-agents-workflow">The Agent’s Workflow</h3>

<p>Once an anomaly is detected, the agent would:</p>

<ol>
  <li><strong>Automatically trigger from the alert</strong>
    <ul>
      <li>Input comes from the classifier/rule engine</li>
      <li>Pass along signal metadata (location, severity, timestamp, sensor type)</li>
    </ul>
  </li>
  <li><strong>Check maintenance history via Maximo</strong>
    <ul>
      <li>“Is there a work order already open for the affected equipment?”</li>
      <li>If yes → annotate the alert as operationally explained</li>
    </ul>
  </li>
  <li><strong>Query SCADA / historian for recent operating patterns</strong>
    <ul>
      <li>Compare pressures/flows/temperatures before and after anomaly</li>
      <li>Detect step changes vs gradual drift vs sensor noise patterns</li>
    </ul>
  </li>
  <li><strong>Look at nearby related assets</strong>
    <ul>
      <li>Sometimes upstream equipment explains downstream anomalies</li>
      <li>Check if other components in the same block also changed state</li>
    </ul>
  </li>
  <li><strong>Retrieve relevant inspection or ILI records</strong>
    <ul>
      <li>Determine if this area has prior corrosion, fatigue, or coating damage history</li>
    </ul>
  </li>
  <li>
    <p><strong>Return a structured human friendly report</strong></p>

    <p>Example output:</p>

    <blockquote>
      <p>Anomaly at Station 12 Compressor Train B.
No active work order. Upstream suction pressure dropped 8% at the same timestamp, possibly due to upstream supply swings. Last ILI shows no integrity red flags. Recommend monitoring; no immediate action required.</p>
    </blockquote>
  </li>
</ol>

<p>The cool part with this potential use case is MCP standardizes how agents access enterprise systems, and with MCP servers discoverable via the MCP catalog, other agent builders can tie into existing functionality.</p>

<hr />

<h2 id="-conclusion">🎯 Conclusion</h2>

<p>The introduction of the MCP catalog and marketplace represents a significant step forward in making enterprise AI agents more accessible and operational. By providing a centralized hub for discovering, managing, and utilizing MCP servers, Databricks is addressing key challenges in enterprise agent development. As the ecosystem grows and more organizations contribute their MCP servers, we’ll likely see an acceleration in the adoption of agent-based solutions across industries.</p>

<p>It would be really neat if eventually these MCP servers were also exposed in Databricks One for the business to discover agents and the tools they can access.</p>

<p>These features are still in beta, so it is fully expected to run into the issue we saw in the ‘Getting Started’ section where the agent did not reply at all. I am excited to see where this goes next.</p>

<p>Thanks for reading! 🙂</p>]]></content><author><name>Conner Schiissler</name></author><category term="GenAI" /><category term="Databricks" /><category term="MCP" /><category term="Enterprise AI" /><category term="Data Platforms" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">📈🚀 Where I’m Getting the Most Value from Generative AI Today: Batch Inference at Scale</title><link href="https://schiiss.github.io/blog/genai/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/" rel="alternate" type="text/html" title="📈🚀 Where I’m Getting the Most Value from Generative AI Today: Batch Inference at Scale" /><published>2025-10-17T07:00:00-07:00</published><updated>2025-10-17T07:00:00-07:00</updated><id>https://schiiss.github.io/blog/genai/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale</id><content type="html" xml:base="https://schiiss.github.io/blog/genai/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/"><![CDATA[<p><img src="/blog/assets/images/blog_images/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/blog_image.png" alt="" /></p>

<p>I have found myself in a rollercoaster of emotions over the past two years during this AI Hype cycle. There is a visual model called the ‘Gartner Hype Cycle’ that has been catching on, namely in the AI community. It articulates the various stages of hype as a given technology matures.</p>

<p><a href="/blog/assets/images/blog_images/where-i’m-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/gartner_hype_cycle.jpg"><img src="/blog/assets/images/blog_images/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/gartner_hype_cycle.jpg" alt="Gartner Hype Cycle" /></a></p>

<p>I have been guilty in the past of riding the hype cycle and inflating my expectations of what the technology can do.</p>

<p>I have found myself in the ‘trough of disillusionment’ a few times over the past year, especially as I have tried to roll out customer-facing agents and have run into a whole host of issues getting them into production.</p>

<p>Is it because the technology is limited? Or is it because I have not figured out the best practices to follow to implement this in an optimal way?</p>

<p>I think as we near the ‘Plateau of Productivity’ the industry will start figuring out the ‘optimal way’ to leverage this technology going forward.</p>

<p>In this blog, I wanted to share <strong>where I am deriving the most value from Generative AI</strong> and the use cases we have been working on.</p>

<h2 id="️-the-generative-ai-value-pyramid">🏔️ The Generative AI Value Pyramid</h2>

<p>Databricks has a great visual to articulate the value pyramid for generative AI:</p>

<p><a href="/blog/assets/images/blog_images/where-i’m-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/generative_ai_value_pyramid.png"><img src="/blog/assets/images/blog_images/where-im-getting-the-most-value-from-generative-ai-today-batch-inference-at-scale/generative_ai_value_pyramid.png" alt="Generative AI Value Pyramid" /></a></p>

<p>Many teams (and I am totally guilty of this too!), start by building fancy UI’s and chatbots first to start leveraging generative AI.</p>

<p>I have made the joke in the past to friends and fellow colleagues that if I have to build one more chatbot, I may go crazy 🤪.</p>

<p>You will notice in the pyramid visual above, Databricks mentions that <strong>‘batch inference delivers real value, fast!’</strong></p>

<p>But what is batch inference?</p>

<hr />

<h2 id="️-what-is-batch-inference">🤷‍♂️ What is Batch Inference?</h2>

<p><strong>Batch inference</strong> is when you run an AI (LLM) or machine learning model on a large set of data all at once instead of processing it piece by piece. This usually happens on a schedule, like a nightly job, using automated workflows or pipelines. The model processes the data, and the results, such as predictions, categories, or summaries, are saved back into a data system where they can be used for reports, applications, or decision making.</p>

<p>In practice, this means you’re enhancing your existing ETL or data engineering pipelines by adding AI as another transformation step. So instead of just cleaning, joining, and reshaping data, you can now enrich it with intelligence, like classifying documents, detecting anomalies, or generating insights at scale.</p>

<hr />

<h2 id="-how-does-batch-inference-work-in-databricks">🧱 How Does Batch Inference Work in Databricks?</h2>

<p>Databricks has the concept of <a href="https://learn.microsoft.com/en-us/azure/databricks/large-language-models/ai-functions"><strong>AI Functions</strong></a>, and AI Functions allow you to <em>‘apply AI, like text translation or sentiment analysis, on your data that is stored on Databricks’</em>.</p>

<p>The cool part here is that these functions are invokable via code and are easily integrated into your pipelines.</p>

<p>We will talk about some of the use cases I am seeing below, but let’s talk about concepts a bit more to make sure we have a good understanding of how they work.</p>

<p>Databricks exposes a number of <a href="https://learn.microsoft.com/en-us/azure/databricks/large-language-models/ai-functions#-task-specific-ai-functions"><strong>task-specific AI functions</strong></a>, that can do a variety of things like analyze sentiment, classify input text according to labels you provide, amongst other things.</p>

<p>There is also a function called <code class="language-plaintext highlighter-rouge">ai_query()</code> which enables you to invoke models hosted in the <a href="https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/"><strong>Mosaic Model Serving</strong></a> platform. It basically calls LLM or ML models via their REST API’s.</p>

<p>This is incredibly powerful and we have used this to invoke foundational models, custom agents, and agents created via <a href="https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-bricks/">Agent Bricks</a>.</p>

<p>Let’s talk about a few use cases we are working on to cement these concepts.</p>

<hr />

<h2 id="-use-cases">🤖 Use Cases</h2>

<p>There are a few use cases we are working on, some are in production, while others are still in development.</p>

<h3 id="️-classification">🏷️ Classification</h3>

<p>We are pulling incidents out of our <strong>EAM (Enterprise Asset Management System)</strong> into our Lakehouse. Inside the incidents table is a free text description field where operators in the Permian Basin manually record details whenever an incident occurs. These incidents can range from minor events, like a worker tripping on site, to more serious safety or equipment issues.</p>

<p>We applied the <code class="language-plaintext highlighter-rouge">ai_classify()</code> function to that free text description field to enable an LLM to categorize it. Our SQL function looked similar to the example below:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
    <span class="n">description</span><span class="p">,</span>
    <span class="n">ai_classify</span><span class="p">(</span>
        <span class="n">description</span><span class="p">,</span>
        <span class="n">ARRAY</span><span class="p">(</span>
            <span class="s1">'personal safety'</span><span class="p">,</span>
            <span class="s1">'equipment/mechanical failure'</span><span class="p">,</span>
            <span class="s1">'process/operational upset'</span><span class="p">,</span>
            <span class="s1">'environmental'</span><span class="p">,</span>
            <span class="s1">'vehicle/transportation'</span><span class="p">,</span>
            <span class="s1">'fire/explosion'</span><span class="p">,</span>
            <span class="s1">'regulatory/compliance'</span>
        <span class="p">)</span>
    <span class="p">)</span> <span class="k">AS</span> <span class="n">category</span>
<span class="k">FROM</span>
    <span class="n">incidents</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>We then created a new column called <code class="language-plaintext highlighter-rouge">ai_generated_category</code> and worked with the end users to define what ‘good’ looked like.</p>

<p>This data was presented in a PowerBI report and provided efficiency gains when generating reports on incident trends at our facilities.</p>

<h3 id="-summarization">📝 Summarization</h3>

<p>Plains Midstream is going through a <a href="https://finance.yahoo.com/news/plains-american-paa-flexibility-enhances-055251476.html?guccounter=1&amp;guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&amp;guce_referrer_sig=AQAAAMq-JoCc5nl--zx8H9wyAV7ekDI1Sge5rwu3yPOPeQhrH2wcAdxHo86WEafhylYGCQUJVEBJWVr3qOLFuMdw6mfo-4yfy1NKy2KHxLDTtEJbY4UUoNaX6yS2FQFeFt0tpmbQeam3ku1aVkFaD-8XwEiP8_V9xGXviX7POdI4-eip"><strong>large divestiture</strong></a> effort right now and there is a lot of data that needs to be analyzed in an incredibly short window of time.</p>

<p>Amongst other things, contracts dating back to the 80’s must be reviewed and bucketed into categories of <em>‘review’</em>, <em>‘divest’</em>, or <em>‘retain’</em>.</p>

<p>There are several data points that need to be evaluated to make an informed choice as to what bucket to put that contract into.</p>

<p>For example:</p>

<ul>
  <li>Does that contract contain a sold or retained asset?</li>
  <li>What legal entity is mentioned in the contract?</li>
  <li>Are there any clauses like right of first refusal or change of control defined in the contract?</li>
</ul>

<p>To help with the analysis and indexing of <strong>20,000+ contracts</strong>, we decided to leverage Agent Bricks <a href="https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-bricks/custom-llm#model-specialization"><strong>Custom LLM</strong></a> agent.</p>

<p>Leveraging that agent allows us to guide it by specifying fields like:</p>

<ul>
  <li><em>‘Identify ROFR Clause if it exists. Make sure to identify the section where it exists.’</em></li>
  <li><em>‘Identify if any sold assets are mentioned. Leverage the list below to identify the sold assets:’</em></li>
  <li><em>‘Identify legal entity if it exists. Make sure to identify the section where it exists.’</em></li>
</ul>

<p>Once the agent is created, we can invoke this ‘at scale’ across all <strong>20,000+ contracts</strong> via a SQL statement like so:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">file_name</span><span class="p">,</span>
  <span class="n">ocr_text</span><span class="p">,</span>
  <span class="n">ai_query</span><span class="p">(</span>
    <span class="s1">'t2t-1234-endpoint'</span><span class="p">,</span>
    <span class="n">ocr_text</span><span class="p">,</span>
    <span class="n">failOnError</span> <span class="o">=&gt;</span> <span class="k">false</span>
  <span class="p">).</span><span class="k">result</span> <span class="k">AS</span> <span class="n">ai_summary</span>
<span class="k">FROM</span> <span class="k">catalog</span><span class="p">.</span><span class="k">schema</span><span class="p">.</span><span class="k">table</span>
</code></pre></div></div>

<p>This enables us to expose a summary of the contract so the lawyers can easily glance at the salient information for the divestiture.</p>

<p>We then leveraged Databricks Apps to create a custom front-end application where the lawyers could select the original PDF and contrast it with the AI-generated summary to ensure we are providing the source file that the LLM used to generate the summary.</p>

<h3 id="-structuring-data">🧩 Structuring Data</h3>

<p>The last use case I want to talk about is adding structure to unstructured data. We leverage the <a href="https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/ai_extract">AI extract function</a> called <code class="language-plaintext highlighter-rouge">ai_extract()</code>. Providing OCR’d text to this function lets you extract structured fields from it.</p>

<p>We parsed this structured data downstream, wrote it to a delta table, and then set up a <a href="https://learn.microsoft.com/en-us/azure/databricks/genie/">Genie Space</a> where an LLM can write SQL queries for you based on natural language input.</p>

<hr />

<h2 id="-conclusion">🥳 Conclusion</h2>

<p>Generative AI is still evolving, and while the hype is real, the most tangible value I’ve seen so far comes from integrating AI into existing data workflows, especially through batch inference. Whether it’s classifying incidents, summarizing contracts, or structuring messy data, the impact is clear when AI is applied thoughtfully and with the end user in mind. There’s still plenty to learn, but focusing on practical use cases is where the real progress happens.</p>

<p>Thanks for reading! 😀</p>]]></content><author><name>Conner Schiissler</name></author><category term="GenAI" /><category term="Databricks" /><summary type="html"><![CDATA[]]></summary></entry></feed>