Hybrid Columnar Compression on Exadata: A Back-To-Basics Guide

Hybrid Columnar Compression on Exadata: A Back-To-Basics Guide

A recent LinkedIn post by Alex Blyth, Exadata Product Manager, got me thinking again about Hybrid Columnar Compression. He shared an infographic off the back of a conversation he had with a colleague about one of their customers, and whether HCC on Exadata should be used for their older data.

That’s a deceptively simple question. Because the honest answer to “should I use HCC here?” is almost always “it depends on your data and how you access it.” Older, rarely-touched data does happen to be one of the spots where HCC really shines. So it felt like the perfect excuse to take HCC back to basics: what it actually is, and just as importantly, where you should and shouldn’t use it.

This is one of those features that gets people the most excited when they move onto an Exadata, and also gets them into the most trouble when they don’t fully understand it: Hybrid Columnar Compression, or HCC.

HCC is built to use both the database and the storage capabilities on Exadata to give you tremendous space savings andhigh performance at the same time. And it can. I’ve seen tables shrink by 99% and queries get faster in the process. But like most things on Exadata, the magic only shows up when you understand what’s actually happening under the hood and when you use it for the right kind of data.

So before we get to HCC, let’s start from the ground up.

First, what compression do we already have?

Before HCC ever enters the picture, Oracle gives you a couple of compression options that have nothing to do with Exadata. It’s worth understanding them.

Basic Compression — Here the compression unit is a single Oracle block. The important thing to know is that it only compresses data on direct path loads. Any modification, or any insert that doesn’t use the direct path load mechanism, forces the data to be stored uncompressed. That behavior makes it a poor fit for OLTP databases. The syntax is simple:

CREATE TABLE ... COMPRESS;

OLTP Compression — This one is part of the licensable Advanced Compression option. Unlike Basic, it allows data to be compressed for all operations, not just direct path loads. It tries to plan ahead for future updates by leaving roughly 10% of free space in each block through the PCTFREE setting, and once a block becomes “full” it gets compressed:

CREATE TABLE ... COMPRESS FOR OLTP;

Keep that 10% behavior in your back pocket, as it explains a result we’ll see later.

So what is HCC?

HCC is only available for tables stored on Exadata storage. That’s the first thing to remember.

The second thing, and this is the one that bites people, is that just like Basic compression, data is only stored in HCC format when it’s loaded using direct path loads. Any regular inserts or updates against an HCC table will cause those records to be stored in OLTP compressed format instead, not HCC.

In other words, HCC is at its best for data you load and then mostly read (think data warehouses and archival data), not for tables getting hammered with single-row DML all day long. If you point HCC at a busy OLTP table and start updating rows, you’ll slowly erode the very compression you were after.

The four Types of HCC

  • Query Low — LZO algorithm. Lowest compression ratios but least CPU. Optimized for speed. COMPRESS FOR QUERY LOW
  • Query High — ZLIB (gzip). The one I’d point a data warehouse toward when the focus is on saving space. COMPRESS FOR QUERY HIGH
  • Archive Low — ZLIB (gzip) at a higher level than Query High. For archival data where load time is still a factor. COMPRESS FOR ARCHIVE LOW
  • Archive High — Bzip2. Highest compression, most CPU-intensive by far. For archival data where maximum space saving is the goal. COMPRESS FOR ARCHIVE HIGH

Now that we have the theory,let’s test it

I built two tables, both narrow with 12 columns, differing only in data variety. The first has ~20 million rows with a very HIGH number of distinct values (NDV), so lots of unique data and very little repetition. The second has ~19 million rows with a very LOW NDV, meaning the same values repeated over and over. From these I created 12 more (one per method, for each table) and measured space, build time, and query time.

Result #1 — Space saved:

Compression Method High NDV (MB) Reduction Low NDV (MB) Reduction
No Compression 3,109.00 n/a 3,008.00 n/a
Basic 2,769.00 10.90% 246.88 91.80%
OLTP 3,080.60 0.90% 280.94 90.70%
Query Low 2,466.90 20.70% 113.25 96.20%
Query High 1,637.40 47.30% 5.31 99.80%
Archive Low 1,632.80 47.50% 5.31 99.80%
Archive High 1,546.10 50.30% 5.31 99.80%

On low-NDV data, 3 GB collapses to ~5 MB (99.8%). On high-NDV data, the air comes out of the balloon: the best HCC manages is 50.3%, and OLTP barely moves at 0.9% (that’s the 10% PCTFREE showing up). The lesson: your ratio depends far more on your data than on the algorithm.

Result #2 — Load time cost:

Compression Method High NDV Low NDV
Basic 00:13.7 00:06.6
OLTP 00:13.0 00:05.7
Query Low 00:10.4 00:07.0
Query High 00:26.4 00:13.7
Archive Low 00:36.8 00:13.9
Archive High 02:05.1 00:42.8

Archive High took over two minutes vs ~10 seconds for Query Low on high-NDV, roughly 12x the CPU for a marginally smaller segment. That’s why the level names map to use cases.

Result #3 — Query time:

Compression Method High NDV Low NDV
No Compression 00:01.6 00:04.0
Basic 00:01.7 00:02.3
OLTP 00:01.8 00:01.4
Query Low 00:01.3 00:01.1
Query High 00:01.6 00:01.0
Archive Low 00:01.4 00:00.7
Archive High 00:01.8 00:00.9

Compression didn’t punish reads, it helped. You’re trading CPU for I/O, and a 5 MB segment is far less to drag off disk than 3 GB.

So when should you actually use HCC?

Repetitive (low-cardinality) data is where HCC earns its reputation; match the level to the workload (Query Low for loading fast, Query High for warehouse tables, Archive levels for cold data); and respect the direct-path-load rule, or Oracle quietly demotes your rows to OLTP compression.

Conclusion

HCC genuinely lives up to the hype, but only once you understand the paradigm. It rewards repetitive, load-once-read-many data, trades CPU for I/O in a way that often makes queries faster, and punishes you quietly if you ignore the direct-path-load rule. As always in this series: don’t memorize a “best” setting. Understand what’s happening, measure your data, and choose what fits.

Tags:
Rene Antunez
antunez.rene@gmail.com
No Comments

Leave a Reply