Columnar and Relational Database Structures

v0.1.0

Jamie Clark
Dimensional Insight

Dimensional Insight

business analytics and
integrated business intelligence solutions

Diver Platform

Supply Chain, Manufacturing,
Healthcare, Beverage Alcohol, ...

> spectre dive basic.cbase \
    -q -d Direction -c Value 
Direction       Value
East       2,937.7687
North      2,520.4438
South      1,641.8041
West       2,924.6749

Jamie Clark

15+ years building business intelligence tools
author of DivePort
Spectre team lead
first poultry and egg conference

james.clark@dimins.com

Agenda

☑ storage geometry:
row-oriented vs. column-oriented

☑ database types:
MOLAP vs. in-memory ROLAP

Tables

(usually) two-dimensional

but they must be stored
on one-dimensional media
so a decision must be made...

row-oriented

Title Year Gross
Bolt 2008 114M
Juno 2007 143M
Thor 2011 181M

Bolt 2008 114M Juno 2007 143M Thor 2011 181M

column-oriented

Title Year Gross
Bolt 2008 114M
Juno 2007 143M
Thor 2011 181M

Bolt Juno Thor 2008 2007 2011 114M 143M 181M

so what's the difference?

locality of reference

disk is very far away

data: Jonas Bonér

row-oriented

Bolt 2008 114M Juno 2007 143M Thor 2011 181M

column-oriented

Bolt Juno Thor 2008 2007 2011 114M 143M 181M

row-oriented

  • add new rows quickly
  • fast access to many columns on few rows

column-oriented

  • add/remove columns quickly
  • fast access to many rows on few columns

analytical tasks prefer column-oriented transactional tasks prefer row-oriented

Business Analytics

(OLAP: online analytical processing)

it's possible to do analysis
in a transactional database
using SQL

a business intelligence solution
makes it easier and faster

let's look at two structures
for busines analytics...

MOLAP and ROLAP

MOLAP

multidimensional OLAP

8 x 7 = ?

8 x 7 = 56

MOLAP

pre-computed query results

MOLAP

to answer a query: identify the right table, load it

this is very fast

(when the answer was pre-computed)

this works on weak hardware

this takes time to prepare

and cannot be in-memory

cannot change summary functions at run time

must limit number of dimensions

MOLAP is fast at run-time
but inflexible

ROLAP

relational OLAP

102 x 11 = ?

102 x 11 = 1,122

ROLAP

no pre-computed results
just the detail table

ROLAP

queries are run directly against detail table

this is very flexible

and takes much less space

but it requires strong hardware

even then, it can be slow

there's a balance between speed and flexibility

but it's complicated...
if ROLAP fits in memory
and we have multiple processors
and the questions are complex
and the data changes frequently
and the engine can optimize the query
and nobody is leaning against the server
and the moon is waxing
and your favorite show is on
...

ROLAP is very flexible,
and was considered slow at run-time
but with modern hardware
and enough memory
in-memory ROLAP is looking good

MOLAP

  • fast query time
  • especially on poor hardware

in-memory ROLAP

  • fast access time
  • fast build time
  • scales better
  • more flexible

 

let's bring it back to storage geometry

MOLAP "data cubes" and ROLAP data tables
could be row- or column-oriented

for MOLAP it doesn't matter much

for ROLAP it matters (among many other factors)

transactional databases are usually row-oriented

"optimize" the data by moving it
to a column-oriented database

Changes

Dimensional Insight started in 1989
with row-major MOLAP

in 2014 we introduced Spectre,
column-oriented in-memory ROLAP

why?

ROLAP has a lot of advantages

and two disadvantages
(performance and hardware requirements)

the disadvantages were impossible to overcome

but things change

memory is very cheap

data: hblok.net

CPU cores are becoming plentiful

we've ridden Moore's law to a tipping point
where cheap processors and abundant memory
permit fast, dynamic, in-memory analysis

Conclusions

things change
buy lots of memory
know how your data is stored
modern hardware opens up possibilities

columns under Gaudí's Parc Güell, Barcelona