Павло Щербуха

Logo

Персональна освітня сорінка

Розробка на Node.js, VUE.js, Python, IBM Integration Bus (App Connect Ent) , ORACLE PL/SQL
27 January 2026

From PL/SQL to PySpark: How to Build an On-Premice Lakehouse with Delta Lake

by Pavlo Shcherbukha

1. Introduction

For over two decades, I’ve lived in the world of relational databases. I’m used to the comfort of ACID transactions, the rigor of DDL, and the reliability of Oracle. But the data landscape is changing. When “Big Data” first arrived, it felt like a step backward into a “Data Swamp” of messy files. Then came Delta Lake.

In this article, I want to show you how a DB professional can build a modern Lakehouse architecture in their data center. And to get you started, how to build a modern Lakehouse architecture right on your laptop using Docker, PySpark, and Delta Lake. It’s not just about files; it’s about bringing the maturity and experience of a developer from the DB layer to the scalable world of Spark.

But first, I’ll briefly tell you how I got here.

Despite the fact that I continue to work with the ORACLE database, the focus of my attention has constantly shifted towards cloud tools. It was not a one-step decision and it was not a one-step solution. It was a long journey of trial and error, sometimes stopping or rolling back, and then repeated attempts to learn tools, studying architectural approaches and patterns. There was an attempt to initiate and build an On-premise Cloud and it turned out to be successful, in my opinion. I tried to study different platforms of different cloud providers. Recently, I have focused on AZURE, because I liked the functionality of their cloud the most. But over time, I realized that no matter how much I study clouds and campaign for clouds, there are cases when this or that client will not go to public clouds due to various restrictions. Most often regulatory ones. Therefore, while getting acquainted with the AZure Cloud Product Microsoft Fabric, I asked myself two questions:

Very briefly about Microsoft Fabric

There is an enterprise product in the Azure cloud called Microsoft Fabric. It is a product that can automate many aspects of a corporation’s work. You can read more about it at the link: Get started with Microsoft Fabric.

The basis of this product is::

In addition, there are a large number of other Azure products that allow you to combine integrations, reporting, and access control. But all of this revolves around the Spark compute cluster and Delta Lake structures.

As a developer who has designed and developed more than one application in ORACLE Database, it is very difficult for me to change the design logic to Delta Lake friendly. Here I will try to explain why developers who are used to designing applications in an enterprise database environment: ORACLE, MSSQL…. should take SPARK and LakeHouse with Delta Lake seriously when designing applications. Personally, I have a lot of internal conflicts that prevent me from taking this product seriously.

My first internal conflict: Don’t get lost in the bright marketing of Microsoft Fabric. The various connectors (Data Factory, Copy Tool, Pipelines) are just “hoses” through which water flows (i.e. data is pumped). But if you haven’t built the right “reservoir” (Lakehouse structure), then these hoses will simply create chaos and your lake will turn into a swamp. Therefore, the first thing you need to do is learn how to build the right data structures in Lakehouse.

My second internal conflict: to change the logic of thoughts and algorithm construction, which we are accustomed to for large corporate databases such as ORACLE, to the logic of Delta Lake files.

In fact, these “internal conflicts” are a general conflict between the Tool-oriented and Architecture-oriented approaches.

Next, I will try to prove why developers of enterprise products should take Spark and LakeHouse with Delta Lake seriously, even if you are not ready to use Microsoft Fabric or give your data to the cloud at all. And I will explain a little about the LakeHouse with Delta Lake data architecture.

2. The Architectural Pillars of Delta Lake

Here are the 8 pillars on which the logic of the structure design in Delta Lake rests:

  1. Medallion Architecture (Logical Zoning)

These are not just folder names, these are levels of quality and trust in the data. We should not design “tables”, but “data paths”:

  1. Partitioning (Physical Design)
  1. Idempotence and Keys

Since there are no Foreign Keys, we design the structure so that each transformation step is idempotent. That is, if we run the same JSON load 10 times, the table structure should remain intact. We design the MERGE logic as a foundation for stability.

At Oracle, we designed the structure to minimize duplicates (Normalization). At Lakehouse, we often design the structure to optimize reads (sometimes allowing for denormalization), because disk storage is cheap and Spark cluster runtime is expensive.

  1. There are no cursors that are familiar to ORACLE.

Traditional ORACLE developer conflict: In ORACLE I could open a cursor and loop through the records and validate each record, but in SPARK it’s an anti-pattern. Looping through the records in Spark via .collect() or using Python cursors is really an anti-pattern, as it kills distributed computing and turns our powerful cluster into one slow computer.

  1. The Transaction Log (The “Heart” of Delta)

Think of this as the Oracle Redo Log or Archive Log. Every single change is recorded in a JSON-based transaction log (_delta_log). This ensures ACID compliance: if a write fails halfway, the system knows exactly how to roll back, leaving your “Data Lake” clean and consistent.

  1. Schema Enforcement & Evolution

In a traditional Data Lake, you can accidentally drop a “corrupted” file into a folder and break every downstream report. Delta Lake acts as a gatekeeper. It validates data types during write (Enforcement) but also allows for graceful schema updates (Evolution) when business requirements change.

  1. Time Travel (Data Versioning)

This is a “killer feature” for any DBA. Since every transaction is logged, you can query the state of your table at any specific point in time or version. Need to see the balance as of last Monday? Just use VERSION AS OF. It’s like having a built-in Flashback Query.

  1. Unified Batch and Streaming

Delta Lake blurs the line between historical data processing and real-time streams. You can treat a Delta table as both a static source for batch jobs and a continuous sink for streaming data, ensuring that your analytics are always up to date.

3. A little excursion into history and comparisons

I had developed reporting for commercial banks in the NBU for about 10 years, and even on FOXPRO 2.0. Then we tried to migrate all this to ORACLE (it was version 10). There we encountered problems when materialized VIEW aggregated and non-aggregated worked slower than FoxPro, which merges all the necessary data onto the user’s machine, sequentially scans the tables and prepares aggregated data, and then returns them to network dbf files. Moreover, the most reliable option in ORACLE was PL/SQL coding with a number of Oracle OLAP tools and parallelism in queries. But there is a separate problem there. There, the queries are kilometer long.

Why did FoxPro seem faster than Oracle?

FoxPro (as in Clipper/dBase) had a “navigational” access architecture. Data was fed to the client and the local processor “chewed” it sequentially. This worked quickly at small to medium volumes because there was no overhead of managing transactions, locking, and the SQL server network protocol.

Oracle is a “heavy” server. When we did Materialized View, Oracle tried to be too smart: checking consistency, updating indexes, monitoring Undo/Redo logs. And kilometer-long PL/SQL queries appeared because we tried to cram complex procedural logic into declarative SQL.

Why is Lakehouse (Fabric/Spark) “FoxPro on steroids”?

The Lakehouse architecture is ideologically closer to FoxPro than to Oracle, but with the power of thousands of servers:

Scans instead of indexes: Like FoxPro, Spark loves to “pour” large data sets through itself (Full Table Scan). But where FoxPro did it in a single thread, Spark does it in parallel.

Separation of compute and storage: In Oracle, disk and processor are “married.” In Lakehouse, the data resides in Delta/Parquet (like our good old .dbf), and the Spark cluster is a temporary “processor” that jumps on these files, processes them, and shuts down.

Logic instead of server magic: In Oracle, we relied on the Optimizer. In Spark, we write the plan ourselves (like in FoxPro): “Read this, filter that, do a Join.” This gives us the same control we had in the 90s, but at petabytes.

How does Spark solve the problem of “mile-long queries”?

In Oracle, we wrote one giant SELECT because each intermediate step (table) was expensive (disk I/O). In Spark, we use DataFrames. This allows us to break a “mile-long query” into 10 small, understandable steps. Each step is stored in memory (Lazy Evaluation), and Spark itself assembles them into the final execution plan.

4. Consider a typical business case

There is a core of the banking system, which is a “thing in itself” based on ORACLE. There is a prefix to this system on another ORACLE database, called DWH or GeneralLdge. Its function is:

The problem is that we can’t do with just aggregation (SELECT sum(), count() FROM …). The load and calculation process runs from the service user. There are several parallelization threads and parallelization at the query level. There are a bunch of cursors and thousands of PL/SQL lines. But the problem is that it eats up all the processors available to the ORACLE server. That is, there are still some on the server, but the database doesn’t see more than a certain number of them. And there is a significant disk load, such that you can cook a lot of food on the disk rack for all admins. And there is also a network load, since you have to transfer gigabytes from one database to another.

Now let’s look at it all from another side.

Questions to think about:

If we don’t need such expensive computing power 24*7 and the ORACLE database does not perform its main functions (meaning servicing transactions from concurent users), then it might make sense to think about making all this cheaper, namely:

Of course, it is quite easy to raise a Spark cluster in the cloud: Azure, IBM, AWS, and each of these providers has S3-compatible Object Storage for storing Delta files.

But, since I touched on the banking sector, a cloud from hyperscalers is probably not an option, so let’s consider the option of building “your own cloud” in your own bank data center.

5. Data storage: Your own cloud in a data center

For Spark to work, you need to install the Spark cluster itself somewhere. Since everyone has already heard of Kubernetes and Openshift, this is the best option for a bank that already uses containerization.

  1. Spark on Kubernetes/OpenShift (The Modern approach)

How ​​it works: Spark is launched as a set of pods. When a task comes in, Kubernetes creates a Spark Driver and the required number of Executors. Once the calculations are complete, the resources are released.

Pros: Efficient use of hardware, resource isolation, easy scaling. OpenShift adds enterprise security and a user-friendly interface.

  1. Spark on virtual machines (Standalone Mode)

This is a simpler option if you don’t have Kubernetes. You just allocate a few servers (or VMs), put Java and Spark on them. One server becomes the Master, the others are Workers.

Pros: Minimum “magic”, everything is clear to system administrators.

Cons: Less flexible resource management compared to K8s.

  1. Storage (Where to keep Delta files?)

In the cloud it is S3 or ADLS. Locally - you will need analogues:

In order for Spark to work with Delta files locally, we need a storage that supports the S3 protocol. Although MinIO is popular, for banking systems it is worth considering alternatives with more loyal licenses:

  1. Use PostgreSQL like Hive Metastore (the place where Spark stores a list of your silver_ledger, gold_balance tables).

This allows you to build a Data Lakehouse that does not physically leave the bank’s perimeter, but works with the flexibility of cloud solutions.

6. What I would personally choose for a bank that can’t go to the cloud but wants to get the speed of Spark:

  1. SeaweedFS — Object Storage (S3/OneLake Replacement)

This is the foundation. Spark doesn’t work with “just folders” on disk as efficiently as it does with object storage.

Why SeaweedFS: It’s incredibly fast for small files (Delta metadata), easier to configure than Ceph, and has a “pure” Apache license.

Role: This is where our Delta files (Bronze, Silver, Gold) reside. Spark accesses it via the S3 protoc

  1. PostgreSQL — Metadata catalog (Hive Metastore)

Spark is a memoryless “brain”. It needs somewhere to store table definitions: where they are located and what their structure is.

Role: Performs the Hive Metastore function. When we write spark.table(“silver_ledger”) in the code, Spark goes to Postgres, learns the path to the files in SeaweedFS in a millisecond and starts the calculation.

Why Postgres: It is the most reliable database for storing metadata, which is available in every bank.

  1. Spark на Kubernetes (K8s) — Computing core

This is our engine, which we have taken outside of Oracle.

Role: Does the heavy lifting (Joins, Aggregations, Merges).

Elasticity: At the time of nightly closing of the day, K8s allocates, for example, 500 GB of RAM and 100 cores to the Spark cluster. As soon as the balance is restored, the resources are instantly returned to the other systems of the bank.

How it works together (Interaction diagram)

This is the perfect tech stack for a “Private Lakehouse” inside a bank. It is completely Open Source, does not require cloud licenses, and yet provides Enterprise-level reliability. For those who can’t go to the cloud but want the speed of Spark, we’re putting together a stack of proven components.

Як це працює разом (Схема взаємодії)

It should be noted here that we do not transfer everything every day. Our Notebook knows the ‘High Watermark’ (last processed date) and only requests new records from Oracle. This minimizes network traffic and load on Redo Logs.

Why is this stack “mature”?

This is an architecture that grows with the bank, rather than requiring you to replace the server every time the number of transactions doubles.

7. ETL based on Jupyter Notebooks

Given that calculations are done in Jupyter Notebooks, it would be logical to use this tool for ETL (ELT) processes as well. This idea is not new at all, although I first encountered it in Microsoft Fabric, but then I found many products that use the same approach.

We abandon disparate scripts and closed ETL platforms. Using Jupyter Notebooks (based on Papermill or Airflow or Mage.ai) as an engine for unloading allows:

As options:

  1. Direct JDBC upload via Spark (The “Simple” Way)

Spark can connect to Oracle as a source. You simply create a DataFrame from an Oracle SQL query and store it as Parquet in your repository.

Pros: It’s as simple as possible and doesn’t require additional licenses

  1. Orchestration Tools (The “Enterprise” Way)

There are products that allow you to manage these notebooks (run them sequentially, track errors):

*Why Mage.ai?** This is the next generation of ETL tools. Unlike Airflow, Mage is built around the idea of ​​’Data as a First-class Citizen’. It visualizes the data flow between code blocks, allowing you to see the result of the transformation at each step, which makes debugging banking algorithms many times faster.

At the Extraction stage, we do not burden Oracle with complex procedures. We use a lightweight Notebook that does ‘Select’ by index (e.g. by date) and immediately converts the data into a compressed Parquet format. This is faster than any standard CSV export, because Spark can unload data from Oracle in parallel (via multiple JDBC connections at the same time).

Separately, we should mention CDC (Change Data Capture) products. This is when we do not do Select from the database, but read Oracle logs (Redo Logs) to collect changes in real time without any load on the database CPU. For example Debezium.

7.1. Transparency and Manageability: A Manager’s Perspective

For IT management, the transition to Spark and Mage.ai is primarily a transition from “chaos of scripts” to “data factory.”

pic-1 Data Pipeline Management

Instead of thousands of PL/SQL lines scattered throughout the database, we get visual control over each step of the transformation. This allows the bank to respond faster to changes in legislation or NBU requirements, without risking “breaking” the main database.

8. Appeal to developers

Here I will provide a diagram that reflects the ideal architecture that I was trying to convey to the readers.

graph TD
    subgraph "The Sources (Perimeter)"
        A[Oracle ERP/Core] -- "JDBC / CDC Incremental" --> B[Notebook ELT]
    end

    subgraph "Infrastructure Layer (On-Premise Stack)"
        B --> C{SeaweedFS / Ceph}
        D[(PostgreSQL)] -- "Hive Metastore: Metadata & Schema" --> E[Spark Cluster]
    end

    subgraph "Medallion Architecture (Data Path)"
        C -- "Raw Files" --> F((BRONZE))
        F -- "Validation & Normalization" --> G((SILVER))
        G -- "Aggregations & Business Logic" --> H((GOLD))
        
        E -.-> F
        E -.-> G
        E -.-> H
    end

    subgraph "The Consumers"
        H --> I[Power BI / Reporting]
        H --> J[National or Central Bank Reports]
        H -- "Summary Delta" --> K[Oracle DWH / GL]
    end

    style A fill:#f96,stroke:#333
    style E fill:#69f,stroke:#333,stroke-width:4px
    style D fill:#4db,stroke:#333
    style H fill:#ffd700,stroke:#333

Description of key nodes:

Explanation of architectural nodes

How are Idempotence and Reliability implemented here?

For Oracle developers accustomed to transactions, these points are critical:

to summarize: We get a system where the code is separated from the data. We can completely reinstall the Spark cluster or change the Python version, but our data in SeaweedFS and its descriptions in Postgres remain intact. This is the same “architectural maturity” that we strive for.

An Oracle developer does not cease to be a developer. He simply expands his horizons with other modern tools, carrying out calculations that are not very inherent in databases on other tools. This is, I would say, the opportunity for a very sharp transition, which gives the opportunity to modernize your skills and make them quite universal.

9. Where to start for an ORACLE developer

To start learning Spark and everything related to it, you don’t need to deploy this entire infrastructure. One or three containers with Spark and jupyter lab are enough for a postman. Moreover, I am somehow sure that once you get used to these tools, developers can transfer their routine to Jupyter Notebook (Jupyter Lab).

To get started, you can use the Portable Jupyter Lab Winpython. It doesn’t have Spark, but it’s a very convenient platform to learn Python coding and Jupyter Notebook.

As an integration tool for some ETL processes, you can use a plugin for Jupyter Lab called elyra and in Docker elyra in Docker.

How to accelerate and configure is described at the link on github: sh-jpylab-spark. Several examples of using Spark are also built there.

tags: