Персональна освітня сорінка
by Pavlo Shcherbukha
For over two decades, I’ve lived in the world of relational databases. I’m used to the comfort of ACID transactions, the rigor of DDL, and the reliability of Oracle. But the data landscape is changing. When “Big Data” first arrived, it felt like a step backward into a “Data Swamp” of messy files. Then came Delta Lake.
In this article, I want to show you how a DB professional can build a modern Lakehouse architecture in their data center. And to get you started, how to build a modern Lakehouse architecture right on your laptop using Docker, PySpark, and Delta Lake. It’s not just about files; it’s about bringing the maturity and experience of a developer from the DB layer to the scalable world of Spark.
But first, I’ll briefly tell you how I got here.
Despite the fact that I continue to work with the ORACLE database, the focus of my attention has constantly shifted towards cloud tools. It was not a one-step decision and it was not a one-step solution. It was a long journey of trial and error, sometimes stopping or rolling back, and then repeated attempts to learn tools, studying architectural approaches and patterns. There was an attempt to initiate and build an On-premise Cloud and it turned out to be successful, in my opinion. I tried to study different platforms of different cloud providers. Recently, I have focused on AZURE, because I liked the functionality of their cloud the most. But over time, I realized that no matter how much I study clouds and campaign for clouds, there are cases when this or that client will not go to public clouds due to various restrictions. Most often regulatory ones. Therefore, while getting acquainted with the AZure Cloud Product Microsoft Fabric, I asked myself two questions:
Very briefly about Microsoft Fabric
There is an enterprise product in the Azure cloud called Microsoft Fabric. It is a product that can automate many aspects of a corporation’s work. You can read more about it at the link: Get started with Microsoft Fabric.
The basis of this product is::
In addition, there are a large number of other Azure products that allow you to combine integrations, reporting, and access control. But all of this revolves around the Spark compute cluster and Delta Lake structures.
As a developer who has designed and developed more than one application in ORACLE Database, it is very difficult for me to change the design logic to Delta Lake friendly. Here I will try to explain why developers who are used to designing applications in an enterprise database environment: ORACLE, MSSQL…. should take SPARK and LakeHouse with Delta Lake seriously when designing applications. Personally, I have a lot of internal conflicts that prevent me from taking this product seriously.
My first internal conflict: Don’t get lost in the bright marketing of Microsoft Fabric. The various connectors (Data Factory, Copy Tool, Pipelines) are just “hoses” through which water flows (i.e. data is pumped). But if you haven’t built the right “reservoir” (Lakehouse structure), then these hoses will simply create chaos and your lake will turn into a swamp. Therefore, the first thing you need to do is learn how to build the right data structures in Lakehouse.
My second internal conflict: to change the logic of thoughts and algorithm construction, which we are accustomed to for large corporate databases such as ORACLE, to the logic of Delta Lake files.
In fact, these “internal conflicts” are a general conflict between the Tool-oriented and Architecture-oriented approaches.
Next, I will try to prove why developers of enterprise products should take Spark and LakeHouse with Delta Lake seriously, even if you are not ready to use Microsoft Fabric or give your data to the cloud at all. And I will explain a little about the LakeHouse with Delta Lake data architecture.
Here are the 8 pillars on which the logic of the structure design in Delta Lake rests:
These are not just folder names, these are levels of quality and trust in the data. We should not design “tables”, but “data paths”:
Bronze: The data should look exactly as it does in the source (even if the JSON is broken). This is our insurance.
Silver: This is where relational cleanup takes place. Duplicate removal, type casting, Master-Detail checking.
Gold: Here the data is aggregated for a specific report.
Since there are no Foreign Keys, we design the structure so that each transformation step is idempotent. That is, if we run the same JSON load 10 times, the table structure should remain intact. We design the MERGE logic as a foundation for stability.
At Oracle, we designed the structure to minimize duplicates (Normalization). At Lakehouse, we often design the structure to optimize reads (sometimes allowing for denormalization), because disk storage is cheap and Spark cluster runtime is expensive.
Traditional ORACLE developer conflict: In ORACLE I could open a cursor and loop through the records and validate each record, but in SPARK it’s an anti-pattern. Looping through the records in Spark via .collect() or using Python cursors is really an anti-pattern, as it kills distributed computing and turns our powerful cluster into one slow computer.
Think of this as the Oracle Redo Log or Archive Log. Every single change is recorded in a JSON-based transaction log (_delta_log). This ensures ACID compliance: if a write fails halfway, the system knows exactly how to roll back, leaving your “Data Lake” clean and consistent.
In a traditional Data Lake, you can accidentally drop a “corrupted” file into a folder and break every downstream report. Delta Lake acts as a gatekeeper. It validates data types during write (Enforcement) but also allows for graceful schema updates (Evolution) when business requirements change.
This is a “killer feature” for any DBA. Since every transaction is logged, you can query the state of your table at any specific point in time or version. Need to see the balance as of last Monday? Just use VERSION AS OF. It’s like having a built-in Flashback Query.
Delta Lake blurs the line between historical data processing and real-time streams. You can treat a Delta table as both a static source for batch jobs and a continuous sink for streaming data, ensuring that your analytics are always up to date.
I had developed reporting for commercial banks in the NBU for about 10 years, and even on FOXPRO 2.0. Then we tried to migrate all this to ORACLE (it was version 10). There we encountered problems when materialized VIEW aggregated and non-aggregated worked slower than FoxPro, which merges all the necessary data onto the user’s machine, sequentially scans the tables and prepares aggregated data, and then returns them to network dbf files. Moreover, the most reliable option in ORACLE was PL/SQL coding with a number of Oracle OLAP tools and parallelism in queries. But there is a separate problem there. There, the queries are kilometer long.
Why did FoxPro seem faster than Oracle?
FoxPro (as in Clipper/dBase) had a “navigational” access architecture. Data was fed to the client and the local processor “chewed” it sequentially. This worked quickly at small to medium volumes because there was no overhead of managing transactions, locking, and the SQL server network protocol.
Oracle is a “heavy” server. When we did Materialized View, Oracle tried to be too smart: checking consistency, updating indexes, monitoring Undo/Redo logs. And kilometer-long PL/SQL queries appeared because we tried to cram complex procedural logic into declarative SQL.
Why is Lakehouse (Fabric/Spark) “FoxPro on steroids”?
The Lakehouse architecture is ideologically closer to FoxPro than to Oracle, but with the power of thousands of servers:
Scans instead of indexes: Like FoxPro, Spark loves to “pour” large data sets through itself (Full Table Scan). But where FoxPro did it in a single thread, Spark does it in parallel.
Separation of compute and storage: In Oracle, disk and processor are “married.” In Lakehouse, the data resides in Delta/Parquet (like our good old .dbf), and the Spark cluster is a temporary “processor” that jumps on these files, processes them, and shuts down.
Logic instead of server magic: In Oracle, we relied on the Optimizer. In Spark, we write the plan ourselves (like in FoxPro): “Read this, filter that, do a Join.” This gives us the same control we had in the 90s, but at petabytes.
How does Spark solve the problem of “mile-long queries”?
In Oracle, we wrote one giant SELECT because each intermediate step (table) was expensive (disk I/O). In Spark, we use DataFrames. This allows us to break a “mile-long query” into 10 small, understandable steps. Each step is stored in memory (Lazy Evaluation), and Spark itself assembles them into the final execution plan.
There is a core of the banking system, which is a “thing in itself” based on ORACLE. There is a prefix to this system on another ORACLE database, called DWH or GeneralLdge. Its function is:
The problem is that we can’t do with just aggregation (SELECT sum(), count() FROM …). The load and calculation process runs from the service user. There are several parallelization threads and parallelization at the query level. There are a bunch of cursors and thousands of PL/SQL lines. But the problem is that it eats up all the processors available to the ORACLE server. That is, there are still some on the server, but the database doesn’t see more than a certain number of them. And there is a significant disk load, such that you can cook a lot of food on the disk rack for all admins. And there is also a network load, since you have to transfer gigabytes from one database to another.
Now let’s look at it all from another side.
Questions to think about:
If we don’t need such expensive computing power 24*7 and the ORACLE database does not perform its main functions (meaning servicing transactions from concurent users), then it might make sense to think about making all this cheaper, namely:
Extraction (Bronze): Oracle simply unloads flat files or something like that — this is a minimal load on the database.
Transformation (Silver): You replace all that PL/SQL code that “eats” processors with your Spark code. It runs in parallel, is faster, and does not interfere with Oracle.
Aggregation & Reporting (Gold): You generate reporting in Data Lake Delta.
Feedback Loop: Only final aggregates (ready postings for GL) are entered back into Oracle or DWH.
Of course, it is quite easy to raise a Spark cluster in the cloud: Azure, IBM, AWS, and each of these providers has S3-compatible Object Storage for storing Delta files.
But, since I touched on the banking sector, a cloud from hyperscalers is probably not an option, so let’s consider the option of building “your own cloud” in your own bank data center.
For Spark to work, you need to install the Spark cluster itself somewhere. Since everyone has already heard of Kubernetes and Openshift, this is the best option for a bank that already uses containerization.
How it works: Spark is launched as a set of pods. When a task comes in, Kubernetes creates a Spark Driver and the required number of Executors. Once the calculations are complete, the resources are released.
Pros: Efficient use of hardware, resource isolation, easy scaling. OpenShift adds enterprise security and a user-friendly interface.
This is a simpler option if you don’t have Kubernetes. You just allocate a few servers (or VMs), put Java and Spark on them. One server becomes the Master, the others are Workers.
Pros: Minimum “magic”, everything is clear to system administrators.
Cons: Less flexible resource management compared to K8s.
In the cloud it is S3 or ADLS. Locally - you will need analogues:
In order for Spark to work with Delta files locally, we need a storage that supports the S3 protocol. Although MinIO is popular, for banking systems it is worth considering alternatives with more loyal licenses:
Ceph Object Gateway: if Enterprise-level scalability is required.
SeaweedFS: if speed of metadata access and ease of maintenance are the priority.
This allows you to build a Data Lakehouse that does not physically leave the bank’s perimeter, but works with the flexibility of cloud solutions.
This is the foundation. Spark doesn’t work with “just folders” on disk as efficiently as it does with object storage.
Why SeaweedFS: It’s incredibly fast for small files (Delta metadata), easier to configure than Ceph, and has a “pure” Apache license.
Role: This is where our Delta files (Bronze, Silver, Gold) reside. Spark accesses it via the S3 protoc
Spark is a memoryless “brain”. It needs somewhere to store table definitions: where they are located and what their structure is.
Role: Performs the Hive Metastore function. When we write spark.table(“silver_ledger”) in the code, Spark goes to Postgres, learns the path to the files in SeaweedFS in a millisecond and starts the calculation.
Why Postgres: It is the most reliable database for storing metadata, which is available in every bank.
This is our engine, which we have taken outside of Oracle.
Role: Does the heavy lifting (Joins, Aggregations, Merges).
Elasticity: At the time of nightly closing of the day, K8s allocates, for example, 500 GB of RAM and 100 cores to the Spark cluster. As soon as the balance is restored, the resources are instantly returned to the other systems of the bank.
How it works together (Interaction diagram)
Extraction: The script loads data from Oracle into SeaweedFS (as JSON/CSV).
Trigger: A Spark job is launched in Kubernetes.
Metadata: Spark contacts PostgreSQL to find the table descriptions.
Processing: Spark reads data from SeaweedFS, recalculates balances in memory, and writes the result back to SeaweedFS in Delta format.
Access: Analysts connect to PostgreSQL (via Thrift Server) or read the final Gold result directly from files for BI.
This is the perfect tech stack for a “Private Lakehouse” inside a bank. It is completely Open Source, does not require cloud licenses, and yet provides Enterprise-level reliability. For those who can’t go to the cloud but want the speed of Spark, we’re putting together a stack of proven components.
Як це працює разом (Схема взаємодії)
It should be noted here that we do not transfer everything every day. Our Notebook knows the ‘High Watermark’ (last processed date) and only requests new records from Oracle. This minimizes network traffic and load on Redo Logs.
Why is this stack “mature”?
Linear scalability: Need it faster? Just add two more servers to your Kubernetes cluster.
No license fees: You pay for the hardware and support, not the number of cores in your DBMS license.
Open formats: Even if you decide to remove Spark, your data in SeaweedFS will remain in Parquet format, which any modern system (such as Trino or Presto) will read.
This is an architecture that grows with the bank, rather than requiring you to replace the server every time the number of transactions doubles.
Given that calculations are done in Jupyter Notebooks, it would be logical to use this tool for ETL (ELT) processes as well. This idea is not new at all, although I first encountered it in Microsoft Fabric, but then I found many products that use the same approach.
We abandon disparate scripts and closed ETL platforms. Using Jupyter Notebooks (based on Papermill or Airflow or Mage.ai) as an engine for unloading allows:
As options:
Spark can connect to Oracle as a source. You simply create a DataFrame from an Oracle SQL query and store it as Parquet in your repository.
Pros: It’s as simple as possible and doesn’t require additional licenses
There are products that allow you to manage these notebooks (run them sequentially, track errors):
Apache Airflow / Dagster: These are “conductors”. They can run your Notebooks on a schedule.
Papermill: This is an interesting tool that allows you to parameterize and execute Jupyter Notebooks from the command line. You can pass the reporting date to the notebook and it will upload data for that day.
*Why Mage.ai?** This is the next generation of ETL tools. Unlike Airflow, Mage is built around the idea of ’Data as a First-class Citizen’. It visualizes the data flow between code blocks, allowing you to see the result of the transformation at each step, which makes debugging banking algorithms many times faster.
At the Extraction stage, we do not burden Oracle with complex procedures. We use a lightweight Notebook that does ‘Select’ by index (e.g. by date) and immediately converts the data into a compressed Parquet format. This is faster than any standard CSV export, because Spark can unload data from Oracle in parallel (via multiple JDBC connections at the same time).
Separately, we should mention CDC (Change Data Capture) products. This is when we do not do Select from the database, but read Oracle logs (Redo Logs) to collect changes in real time without any load on the database CPU. For example Debezium.
For IT management, the transition to Spark and Mage.ai is primarily a transition from “chaos of scripts” to “data factory.”

pic-1 Data Pipeline Management
Instead of thousands of PL/SQL lines scattered throughout the database, we get visual control over each step of the transformation. This allows the bank to respond faster to changes in legislation or NBU requirements, without risking “breaking” the main database.
Here I will provide a diagram that reflects the ideal architecture that I was trying to convey to the readers.
graph TD
subgraph "The Sources (Perimeter)"
A[Oracle ERP/Core] -- "JDBC / CDC Incremental" --> B[Notebook ELT]
end
subgraph "Infrastructure Layer (On-Premise Stack)"
B --> C{SeaweedFS / Ceph}
D[(PostgreSQL)] -- "Hive Metastore: Metadata & Schema" --> E[Spark Cluster]
end
subgraph "Medallion Architecture (Data Path)"
C -- "Raw Files" --> F((BRONZE))
F -- "Validation & Normalization" --> G((SILVER))
G -- "Aggregations & Business Logic" --> H((GOLD))
E -.-> F
E -.-> G
E -.-> H
end
subgraph "The Consumers"
H --> I[Power BI / Reporting]
H --> J[National or Central Bank Reports]
H -- "Summary Delta" --> K[Oracle DWH / GL]
end
style A fill:#f96,stroke:#333
style E fill:#69f,stroke:#333,stroke-width:4px
style D fill:#4db,stroke:#333
style H fill:#ffd700,stroke:#333
Description of key nodes:
Explanation of architectural nodes
Oracle ERP/Core (The Source): The source of truth. We use Incremental Load, taking only the changed records per day to minimize the I/O load on the productive database.
Notebook ELT (The Logic): A software node based on Jupyter Notebook. It contains the logic for connecting to Oracle and the initial structuring. This avoids the “black boxes” of proprietary ETL tools.
SeaweedFS/Ceph (The Private Cloud Storage): A local object storage. This is the “bottom” of our lake, where data is stored in the open Delta/Parquet format. This ensures independence from cloud providers and compliance with bank security requirements.
Spark Cluster (The Compute Core): A distributed “engine” based on Kubernetes. It dynamically allocates resources for heavy Join operations and aggregations, offloading Oracle processors.
PostgreSQL (The Brain/Metastore): Stores table schemas and file paths. Spark looks here to understand how to interpret a set of files in SeaweedFS as a relational table.
How are Idempotence and Reliability implemented here?
For Oracle developers accustomed to transactions, these points are critical:
Bronze (Raw Data): Each ELT run creates a new record (Append). If the download fails, we simply delete the corrupted files and run again. This is our “rollback point”.
Silver (The Merge Point): This is where the UPSERT (Update + Insert) logic works. Even if we mistakenly run the same day processing twice, the MERGE command in Delta Lake will simply update the existing records with the new values, without creating duplicates.
Schema Enforcement: Unlike older Data Lakes, Delta Lake will not allow “garbage” to be written to the Silver layer if the data structure does not match the one defined in the Hive Metastore (PostgreSQL).
Time Travel: Since Delta stores transaction logs, we can always look at the balances “as of yesterday” or “before we started this coveted laptop”, simply by specifying the data version.
to summarize: We get a system where the code is separated from the data. We can completely reinstall the Spark cluster or change the Python version, but our data in SeaweedFS and its descriptions in Postgres remain intact. This is the same “architectural maturity” that we strive for.
An Oracle developer does not cease to be a developer. He simply expands his horizons with other modern tools, carrying out calculations that are not very inherent in databases on other tools. This is, I would say, the opportunity for a very sharp transition, which gives the opportunity to modernize your skills and make them quite universal.
To start learning Spark and everything related to it, you don’t need to deploy this entire infrastructure. One or three containers with Spark and jupyter lab are enough for a postman. Moreover, I am somehow sure that once you get used to these tools, developers can transfer their routine to Jupyter Notebook (Jupyter Lab).
To get started, you can use the Portable Jupyter Lab Winpython. It doesn’t have Spark, but it’s a very convenient platform to learn Python coding and Jupyter Notebook.
As an integration tool for some ETL processes, you can use a plugin for Jupyter Lab called elyra and in Docker elyra in Docker.
How to accelerate and configure is described at the link on github: sh-jpylab-spark. Several examples of using Spark are also built there.
tags: