Chasing Insights by Corey Satnick

Optimizing Delta Tables in the Silver Layer

Corey Satnick — Mon, 29 Dec 2025 16:24:13 GMT

Congrats you’ve built your first medallion architecture. Good news is you’ve finally gotten through all the business lines of questions, validation, and intense engineering workloads. The issue is now your processes are slowly getting slower and slower. Why? How do you fix that? Why is your capacity spiking?

In this blog post I’ll talk about how to optimize your silver layer. And no I’m not talking about the easy stuff like only bringing in the columns you need, or proper data types. We’re going to dive deep into spark properties and setting up your environment.

To keep performance fast as data grows and changes, Fabric relies on Delta Lake optimizations—not Parquet optimizations. Because Delta Tables support updates, deletes, merges, and streaming workloads, the transaction log can grow, files can fragment, and query performance can degrade over time. The good news: Fabric includes several intelligent features that mitigate this. Below is an overview of the major Delta optimization mechanisms available in Fabric today, why they matter, and when to use them.

But first lets break down what Parquet files really are. If you already know this (or simply don’t care) you can skip this part. But I know some people like to know how the sausage gets made.

Parquet- What is it and why you should care?

Parquet gets talked about like it’s some magical file format that automatically makes your analytics fast just by existing. And to be fair… it kind of is. But only if you treat it right.

Parquet is an open-source columnar storage format designed specifically for analytics.

At a high level, Parquet stores data by column instead of by row, which is what unlocks most of its benefits:

Better compression (similar values compress really well together)
Faster queries because engines only read the columns they need
Predicate pushdown meaning filters get applied before all the data is read
Schema evolution so your data doesn’t explode the first time someone adds a column

That’s why Parquet is the default for basically every modern analytics engine. It’s efficient, flexible, and plays nicely with distributed systems.

quick note on faster queries – I recently had a customer try to convince me that they need a table with 100 columns in it for analytics. It doesn’t matter what file format you use – a select * against that many columns will be slow. No amount of optimization can save poor data modeling. Ok first rant over.

Okay, But What Does “Columnar” Actually Mean?

Well to oversimplify it- columnar means all of your data is stored column by column, not row by row. Look at this visual below.

When you run a report, you’re usually asking questions like:

“What was total revenue last month?”
“How many orders were flagged as delayed?”
“Show me counts by status”

You don’t need every column. You need two or three per query which means less data scanned resulting in:

Faster queries
Lower compute usage
Happier capacity admins and business owners who have to spend less money.

What About the Negatives of Parquet?

Up to this point, Parquet sounds perfect. And honestly, for analytics reads, it mostly is.
But here’s the part that usually gets glossed over:

Parquet is immutable.

Once a Parquet file is written, it can’t be updated in place. There’s no “go change row 37” operation. That design choice is what makes it fast for reads—but it also introduces some very real tradeoffs once you start changing data.

And in the Silver layer, you change data a lot.

So when you “update” or “delete” data, what’s really happening is:

New Parquet files get written
Old data gets logically invalidated
The old files still exist but fabric is smart enough to point to the proper file(s).

Over time, this creates:

Lots of small files
Files full of data you no longer care about
Extra work for the engine every time you query

Don’t believe me? Open a spark notebook and run a describe detail on a table that has a lot of transactions on it.

%sql Describe Detail [table]

Let me know what the stats show on one of your non optimized tables.

I bet you’ll see a bunch of files and more so a lot of them aren’t needed. If you’re from the Stone Age you probably are thinking to yourself “That sounds like dead tuples” and you’d be right. It’s data that is technically gone, but still hanging around until someone cleans it up.

Now that you understand Parquet, it’s important to recognize that Delta Tables are a transactional layer built on top of Parquet. Fabric stores Delta Tables as Parquet files plus a transaction log. Because of this extra metadata and transactional behavior, optimizing Delta Tables requires different techniques than simply optimizing Parquet.

Delta Features That Actually Matter in the Silver Layer

Deletion Vectors

Deletion vectors are a Delta Lake optimization that work around Parquet immutability, avoiding full file rewrites when only a handful of rows change.

By default, if a single row in a Parquet file needs to be deleted or updated, Delta Lake performs a Copy on Write approach - which rewrites the whole file. For large Parquet files, that’s expensive, especially in the Silver layer where you’re continuously applying row‑level CDC, merges, or backfills.

With deletion vectors enabled, Delta takes a smarter approach. Instead of rewriting the file immediately, it records which rows are no longer valid and stores that information separately. This is often referred to as Merge‑on‑Read. When the table is queried, Delta applies those deletion markers to exclude invalid rows at read time.

The result:

Far fewer file rewrites
Less write amplification
Faster DELETE and MERGE operations

In other words: you can invalidate one row out of 100,000 without paying the cost of rewriting the entire file.

Auto Compaction

Auto compaction combines small Parquet files within a Delta table’s partitions to reduce the classic small file problem. It’s triggered after a successful write and runs synchronously on the cluster that performed the write.

Rather than rewriting the entire table, auto compaction groups small files together and writes fewer, larger files. Files may be compacted multiple times as new data arrives, and only stop being considered once they reach the effective size threshold (for example, at least half of the target file size). This keeps ingest and merge pipelines fast while gradually improving file layout over time.

Why it matters

Reduce file sprawl
Improve read performance (less files = fewer metadata lookups)
Keep your table layout healthy—without manual OPTIMIZE jobs. Auto compaction quietly maintains good file hygiene, this way engineers don’t have to routinely schedule OPTIMIZE jobs or manually manage file layout.

Adaptive Target File Size

Hardcoding Parquet file sizes is a guessing game. At some point, everyone hears a rule like “128 MB or 256 MB files are ideal” and locks it into a Spark config. It works - until your data, usage patterns, or table size change.

Silver workloads are not static. Ingest volumes grow, merges become more frequent, backfills happen, and CDC patterns evolve. A file size that was perfect at 5 million rows quietly becomes a bottleneck at 500 million.

Adaptive target file sizing removes that guesswork. Instead of enforcing a fixed size forever, Fabric dynamically adjusts file sizes based on table-level metrics (such as overall table size) at write time. The engine determines an appropriate target size per write, rather than relying on a single hardcoded value in the table definition.

Fast OPTIMIZE

Traditional OPTIMIZE rewrites every bin of small files it finds - even ones that are already “good enough.” That wastes compute, drags out jobs, and doesn’t always move the performance needle. Fast OPTIMIZE thinks before it rewrites. It scans each bin and only compacts it when the result is expected to meet the minimum file size threshold or when there are enough small files to justify compaction.

The outcome is a tighter file layout, fewer tiny files, and more efficient reads. Query planning becomes cheaper, scans become faster, and metadata pressure drops. Capacity usage becomes far more predictable, because the work happens through Fabric’s native engine instead of large Spark shuffles.

What Fast OPTIMIZE Actually Does

Looks at each group of files (bins) before rewriting anything.
Skips bins where compaction won’t improve things.
Focuses compaction work only where it materially improves the resulting file layout.

File Level Compaction Targets

File-level compaction targets define the file size thresholds used to guide compaction decisions. They determine when combining small files is worthwhile and when a file is considered large enough to stop being compacted. By providing clear size boundaries, they prevent unnecessary rewrites and keep compaction work focused on changes that meaningfully improve file layout.

Optimize Write and V-Order:

Optimize Write acts as pre-write compaction. It introduces additional shuffle and coordination during writes to avoid producing small files in the first place. This can be extremely valuable for trickle or micro-batch workloads (for example, structured streaming jobs or frequent small inserts) where each write would otherwise generate undersized files. In those cases, Optimize Write can reduce downstream compaction work and improve overall efficiency.

However, for many workloads Optimize Write is often a poor trade. Paying extra write-time cost to produce “perfect” files is inefficient when those files are likely to be invalidated, merged, or compacted again shortly after. In these scenarios, pre-write compaction simply shifts work earlier without reducing total work.

V-Order has a similar trade-off. It optimizes file layout for analytical read patterns, which is highly effective when data is relatively stable. Optimizing read layout in a layer where data is constantly changing often results in wasted write effort with little lasting benefit.

Closing Thoughts:

The Silver mindset should be cheap, resilient writes first - smart cleanup later. Enable Optimize Write only when the write pattern would otherwise generate excessive small files. Let deletion vectors absorb row-level churn, rely on auto compaction and adaptive target file sizing to manage file sprawl over time, and use Fast OPTIMIZE to focus heavy rewrite work where it actually matters. Save aggressive read-path optimizations like V-Order for Gold when using direct lake.

Additional Sources:

Additional Notes:

With the new release of Runtime 2.0 there are some exciting things to look for (Spark 4.0 and Delta Lake 4.0 to name a few). There are some limitations though since this is experimental preview. One of the most notable to me is:

You can read and write to the Lakehouse with Delta Lake 4.0, but some advanced features like V-order, native Parquet writing, autocompaction, optimize write, low-shuffle merge, merge, schema evolution, and time travel aren't included in this early release.

For more information check out: Runtime 2.0 in Fabric - Microsoft Fabric | Microsoft Learn

Dimensional Modeling 101

Corey Satnick — Sun, 07 Sep 2025 16:16:55 GMT

In my last post, we talked about the medallion architecture - how to layer your data into Bronze, Silver, and Gold so things are cleaner, faster, and easier to manage.

But here’s the catch: you can build the optimal pipelines, a performant Lakehouse, and governance rules tighter than Fort Knox… and your reports can still suck if the data model is wrong.

Slow reports. High capacity usage. Dashboards timing out. Most people point fingers at Power BI - but nine times out of ten, the real culprit is a poorly designed data model.

So how do we fix it? How do we build something that performs, scales, and doesn’t drive users back to their old friend Excel?

Start with Kimball

Ralph Kimball is widely regarded as the godfather of modern data modeling for analytics - not because he invented databases, but because he understood something many people still overlook:

A good data model isn’t just technically correct. It has to make sense to the people using it.

Before Kimball, most models followed the Inmon/Third Normal Form approach. Great for transactional systems (where you care about writes), not so great for analytics (where you care about reads). Kimball flipped the script and said: keep it simple, keep it intuitive, and the whole system runs better.

The Heart of the Dimensional Model: The Star Schema

Fact Tables: The Core of Measurement

Fact Tables are where your aggregations live. Sales, revenue, units sold, discounts, returns - if your business measures it, it belongs here.

Each row = a real-world event at its lowest grain. That could be a single transaction at the register, a return at the service desk, or an online order being shipped.

A fact table always includes foreign keys to the dimensions around it - customer, product, date, store, promotion - so you can answer questions like:

Which products sell the most by store?
Which customers are most profitable?
Which promotions actually drove traffic instead of just margin loss?

And because fact tables are massive, they must be centralized. If every region or department builds their own version of “sales,” you’ll end up with different numbers for the same metric - and good luck explaining that to the CFO when the dashboards don’t match.

Not all facts are created equal, though. Here’s the quick rundown:

Additive: Can be summed across any dimension (sales amount, quantity sold).
Semi-additive: Can be summed across some dimensions but not all (inventory balances add up across products, not across time).
Non-additive: Ratios and percentages (gross margin %, conversion rate) that need to be calculated, often in the BI layer, from their additive building blocks.

Keeping these distinctions straight will save you from half the reporting headaches that developers run into when the “same metric” looks different in finance vs. operations.

The Three Flavors of Fact Tables

1. Transaction Fact Tables

Grain = the individual sale or return.
One row = one receipt line.
Perfect for basket analysis, customer journeys, or SKU-level profitability.

2. Periodic Snapshot Fact Tables

Grain = time period (day, week, month).
One row = “what did sales or inventory look like on this day?”
Perfect for trending KPIs, tracking store comps, or monitoring daily inventory.
Even if no sales happen in a store, you still log a row- otherwise your dashboards show gaps.

3. Accumulating Snapshot Fact Tables

Grain = process with a defined start and finish (e.g., order fulfillment, supply chain, loyalty enrollment).
One row, updated as milestones are hit - order placed, shipped, delivered, returned.
Unique because it’s updated instead of just appended.
Perfect for tracking online orders, click-to-delivery times, or end-to-end supply chain visibility.

Together, these three types cover almost every analytic need - from SKU-level margin analysis to enterprise-wide sales trends.

Quick Note on Nulls (and a personal pet peeve)

Fact table measures can handle nulls just fine - SUM, COUNT, AVG all behave. But foreign keys? That’s a hard no. Nulls there break referential integrity.

The fix: create an “Unknown” row in your dimension with a surrogate key (often -1). Use COALESCE(key, -1) on load. That way, if something breaks upstream, your report shows “Unknown” instead of silently dropping rows. Plus, it’s a giant neon sign to the data engineer: something went sideways.

Dimensions: Giving Numbers Their Meaning

If fact tables are where the aggregations live, dimension tables are what make those numbers make sense. They turn “$10,392.57” into “Total Net Sales on Sept 1st, 2025 was $10,392.57.”

Every dimension has a single primary key, which shows up as a foreign key in your fact table. That’s how you join a row of sales data to its customer, product, store, date, or marketing campaign.

Unlike fact tables (which are tall and skinny), dimension tables are usually wide and flat - packed with descriptive attributes that people actually filter and group by. Things like:

Product name, brand, category
Store region, format (mall kiosk vs. superstore)
Promotion type (BOGO, clearance, loyalty points, Labor Day Sale that extended are still happening)
Customer demographics or segments

Dealing with Codes and Flags

Notice what’s not helpful? Cryptic codes and flags. “PromoType = C3” column won’t mean much to your business user. Instead, dimensions should spell it out: “PromoType = Clearance.” If your source system insists on giving you codes, expand them into human-friendly descriptions in your dimension table. Keep those keys in your fact table to keep it nice and tight.

Drilling Down and Hierarchies

One of the main reasons dimensions exist is to make analysis intuitive. Kurt Buhler wrote a fantastic blog about the 3, 30, 300 rule I highly recommend reading. Business users should be able to drill down to their desired dataset in under 30 seconds, and get to their detailed information in under 300 seconds.

Sales by month → week → day
Revenue by region → store → aisle
Inventory by category → brand → SKU

Good dimension design makes this seamless. You don’t need to hardcode every path; as long as the attributes are there, users can explore naturally.

Most dimensions also support more than one hierarchy. Examples:

Date: Day → Week → Fiscal Period, or Day → Month → Year
Product: SKU → Category → Department, or SKU → Brand → Line → Group

The takeaway? Don’t over-engineer the drill path. Put the attributes in the dimension table, and let users choose the path that matches their business question

The Calendar Date Dimension

Almost every fact table connects to a date dimension - it’s how you move through time in your analysis. Without it, you’re stuck writing messy SQL to figure out things like fiscal periods or holidays (and trust me, you don’t want to compute Easter yourself).

A solid date dimension comes preloaded with all the attributes people care about:

Day, week, month, quarter, year
Fiscal periods (because finance always has its own calendar)
Holidays and special events (Black Friday, Cyber Monday, Easter, etc.)
Week numbers and month names for easy grouping

To make partitioning easier, the primary key is often a smart integer like 20250905 (YYYYMMDD). But here’s the important part: business users shouldn’t rely on that key. They should slice and filter using the attributes - month name, fiscal week, holiday flag - because that’s what actually makes sense to them.

And don’t forget the edge cases:

You’ll need a special row for “Unknown” or “TBD” dates.
If you need more precision, like time of day, you can add a separate time-of-day dimension (shifts, day parts, hours).
For detailed timestamps (like order created at 10:42:17 AM), just keep the raw datetime column in the fact table - no need to overcomplicate it.

In retail, one of the most common calendar setups is the 4-5-4 model. Instead of neat calendar months, the year is broken into quarters where the first month has 4 weeks, the second has 5 weeks, and the third has 4 weeks. This design keeps weeks aligned to the same day of the week year over year—so a Saturday in week 32 this year is still a Saturday in week 32 next year—making it much easier to compare sales, traffic, and promotions across periods. It also ensures holidays and seasonal events line up consistently, which is critical for planning, reporting, and year-over-year comp analysis in retail. Here is an example below:

4-5-4 Calendar | NRF

Role-Playing Dimensions: Same Table, Different Hats

Sometimes the same dimension needs to show up more than once in a fact table - just playing a different role. That’s where role-playing dimensions come in.

The best example is the date dimension. A single transaction might reference multiple important dates:

Order Date → when the customer placed it
Ship Date → when it left the warehouse
Delivery Date → when it arrived at their doorstep
Return Date → when it came back to the store

All of those link back to the same date dimension table - but each plays a different role. Instead of building four separate date tables, you just reuse the one dimension and give each instance an alias (OrderDateKey, ShipDateKey, etc.).

Role-playing isn’t limited to dates, either. You might use the same employee dimension to represent both the cashier who rang up a sale and the manager who approved a discount. Or the same store dimension to represent both the selling location and the return location.

Wrapping It Up

Dimensional modeling is more than just a way to organize tables - it’s the foundation of an enterprise data model. Get this part right, and everything else - pipelines, governance, dashboards, even AI - sits on top of a structure that’s consistent, scalable, and built for the business.

A strong dimensional model doesn’t just deliver faster queries and cleaner reports. It also sets you up for optimized AI insights - because models trained on clean, business-ready data actually produce results you can trust.

And most importantly, it keeps your end users where they belong: in your BI environment, not back in Excel. When reports run fast, metrics stay consistent, and the data model “just makes sense,” people stop exporting and start exploring. That’s when your data platform shifts from being a cost center to becoming a competitive advantage.

The Medallion Architecture (Batch)

Corey Satnick — Tue, 05 Aug 2025 22:06:32 GMT

Let’s be honest— “medallion architecture” sounds like something cooked up to make a slide deck sound cooler than it actually is. But I promise you this one’s actually worth taking the time to learn and implement.

At its core the medallion architecture is a simple, scalable way to organize your data into three layers: Bronze, Silver, and Gold. Yes, like Olympic medals and no you don’t get a podium, but you will get recognition if you implement it correctly and here’s why.

It’s not just about being neat. This structure helps you:

Cut down on redundant processing
Improve data quality and traceability
Apply smarter governance and access controls
Use your capacity more efficiently (which saves money and will get you that recognition I was talking about above)

Before we talk about how it makes your life easier, let’s walk through what each layer is actually for—and why they matter.

Bronze: The Raw Landing Zone

Welcome to the starting line (well for analytics at least).

The Bronze layer is where all your data lands first, untouched and unfiltered. Think of it as your inbox before any rules are applied—where everything piles up before you sort through it. It’s raw, messy, and that’s exactly the point. It’s a replica of your transactional system, excel files of every version of next fiscals budget, and that random json file someone needs for a data science project.

In Fabric, this usually lives in a Lakehouse, because it handles both structured (delta tables) and unstructured data (csv, parquet, excel).

So why keep things raw? Because Bronze is your system of record—a snapshot of your data exactly as it arrived. No filters. No transformations. Just full fidelity, ready for anything you might need later (like debugging that error that shows up once a month, or proving to your coworker that you were right and it’s the source system’s fault).

Here’s what makes Bronze Bronze:

Raw by design – Stores whatever shows up: JSON, CSV, Parquet, you name it. No changes.
Append-only – New records are added over time. Think of it as a historical log that you can always replay if something goes sideways.
Not for analysis – This is not the layer your analysts should be querying. You should use it for validation though.
Great for traceability – You’re keeping the original structure, which helps when you need to trace an issue back to the source.
Flexible ingestion – Works with both batch and streaming sources—ADLS, S3, Kafka, Event Hubs, you get the idea.

In short, Bronze is the foundation. It’s the layer that lets you confidently say, “Yes, we have the original data—no, we didn’t accidentally overwrite it three months ago.”

Next up? We take that messy inbox and start cleaning it up.

Silver: Cleaned and Modeled

This is where things start to get interesting—and where, in my experience, most of the work actually happens.

Silver is the layer where you take all that raw, messy data from Bronze and start making sense of it. You clean it up, apply structure, and turn it into something the business can actually use. Think of it as the translation layer—you're taking “data” and turning it into “information.”

In Fabric, that usually means using Notebooks, Data Pipelines, or a mix of both to apply your business logic. Maybe you’re flattening nested JSON. Maybe you’re fixing timestamp formats. Maybe you're adding logic that finance swears is critical—kind of like how WeWork swore their “community-adjusted EBITDA” was a thing.

Here’s what Silver is all about:

Transformation starts here – Filtering, joining, standardizing, deduplicating—this is where raw data starts becoming analysis-ready.
Business logic lives here – Whether it’s calculating revenue, flagging status fields, or prepping a clean dimension table, this is your playground.
Not quite final – It’s not ready for dashboards yet, but it’s miles ahead of where it started. Think: clean ingredients, not the final dish.
Reduces pain later – Validating and standardizing early keeps things clean downstream—especially when people start building reports on top of it.

If Bronze is your raw transactional system duplicate, Silver is where the analytic layer starts to take form. It’s where you bring structure and logic into place.

But here’s the thing—your exec team probably doesn’t care about Silver. They want polished dashboards, KPIs, and numbers that “just make sense.” They don’t want to know how the hotdog is made – they just want the finished product. That’s where Gold comes in.

Gold: Business-Ready Insights

The Gold layer is your polished, analytics-ready data—this is the stuff your executives, analysts, and self-service heroes actually see. It’s where you take everything you’ve cleaned and modeled in Silver, and serve it up in a way that’s fast, clear, and business-friendly.

In Fabric, this usually means:

Loading into Warehouse tables
Building views with clean, readable column names
Designing semantic models in Power BI
Applying row-level security to make sure the right people see the right numbers (and only those numbers)

Unlike the granular detail in Bronze or Silver, Gold is typically aggregated—daily, weekly, monthly—whatever fits the business question.

This data is optimized for reporting and querying. Think that you’re serving it on a platter for your end users to consume. Theres no additional work for them to be done. You can attach AI models to it, create suites of reporting, even give access to power users for reporting.

Because the gold layer models a business domain, some customers create multiple gold layers to meet different business needs, such as HR, finance, and operations.

Use cases:

Semantic Models
Reports & Dashboards
LLM Built on Clean Datasets

Final Thoughts (and a Quick Reality Check)

No architecture is a silver bullet. You still need good data practices, solid governance, and a team that understands the business. But Medallion gives you a head start. It gives your data a home, a purpose, and a path forward.