Figure 1: Shows a fact table with 2 dimension tables for Customers and SalesPerson
My response to the image above 2 years ago as an excel user: “What the heck is a FACT and DIM table?”
One of the common questions that anyone using Powerpivot has is… How should I model my Data in Powerpivot? and it’s not rocket science… it’s easy to learn and hopefully you’ll understand more about it after you finish reading this post.
If you are new to Powerpivot and new to database modeling then the diagram above makes little to no sense but just think about them as lookup tables that you’ll later be doing some VLOOKUPS, MATCH/INDEX and all those formulas that we, as excel users, know about.
Let’s learn more about what you should consider when building your Data Model!
Why is this important?
This is probably one of the most crucial parts if not the most important one. Why? think about it… If you happen to have a bad data model and your data is all over the place or not stored in the correct way then it’ll be much more complex to develop your pivot tables, DAX formulas and, in the end, the actual storage space that the tables are taking out of your memory could possibly be lowering the performance when going through the Powerpivot Grid and making new calculated columns.
Above all, what we really want is to keep things as simple as we can and as optimized as they can possibly be for reporting purposes.
In order to do this we need to understand the following concepts:
- Type of Tables and their Attributes
- Primary Keys and Foreign Keys
- Normalization vs Denormalization
- Star Schema vs Snowflake
Type of Tables and their Attributes
First, we need to understand that Powerpivot only handles Tables…within a table you’ll find field (columns) and within those fields you’ll find rows. A record is the combination of all of the fields or in plain English it’s just a complete row, so for example:
Customer Table (a Dim Table) | |||
Customerid | Customer Name | Attribute1 | Attribute2 |
1 | John Doe | Value | Value |
3 | Jonas Doe | Value | Value |
4 | Jane Doe | Value | Value |
5 | Phillip Doe | Value | Value |
6 | Jack Doe | Value | Value |
the row where we see the customer id 1, customer name John Doe, Attribute1 as Value and Attribute2 as Value it’s a record. More important, this record has an id that it’s defined in the Customerid column.
Behind this, we need to understand why some tables are called DIM and some are called FACT tables. Here’s the answer
DIM = Dimensions a.k.a basically a lookup table where you’ll have groups, subgroups or even just categories in order to better slice and filter the information that you have. A dimension it’s a way of seeing things…someone could be looking at sales by Year, but some other person would like to see the sales by Customer and if there’s a customer group they might want to see it that way also…these are called dimensions because they give another set of insights or views to the end user.
FACT = as the name states, it’s a table with all the facts that you have in an operation or process. For Example, my fact table has all the sales that I’ve made. Another type of FACT table could be a ticketing system where all the tickets are being stored and all of the status are being changed. It’s the primary table and the base of any schema as from here is that you’ll calculate your measures (in most cases).
Primary Keys and Foreign Keys
As we talked about the types of tables that we will encounter during our Powerpivot journey, we also need to speak about the type of fields or columns that will relate each table.
Sales Table (a Fact Table) | |||
Order# | SalesPersonid | Customerid | Sales Total |
1 | 4 | 1 | 133 |
2 | 1 | 3 | 63 |
3 | 1 | 4 | 75 |
4 | 1 | 4 | 135 |
5 | 3 | 5 | 18 |
6 | 1 | 1 | 88 |
7 | 4 | 5 | 117 |
8 | 2 | 1 | 36 |
9 | 4 | 4 | 70 |
10 | 2 | 4 | 41 |
11 | 2 | 1 | 54 |
12 | 5 | 6 | 51 |
On the table Above we can group the fields as
Primary Key: Order#
Foreign Keys: Salespersonid, Customerid
Field for Aggregations: Sales Total
ok, cool..we are grouping now but…what does Primary and Foreign really mean?
Primary Key = The column ID for the table you’re in. It’s an unique value that makes a distinction between each row for the specified table
Foreign Key = it’s a Primary Key that belongs to another table but it’s in the table above because we use it to relate the tables. Therefore we are creating the relationships between the Customers and SalesPersons table using their Primary Keys as foreign keys in our FACT table.
As excel users you were already doing this with INDEX/MATCH, SUMPRODUCT, VLOOKUP, HLOOKUP and other functions that were able to relate fields between 2 tables. What changes now is that you’re able to do so by just creating a relationship instead of creating an specific formula for it. (Note that we still have the LOOKUPVALUE formula to perform like a VLOOKUP FALSE function)
Note that the relationships are only made between a unique values table to one that has n amount of duplicated values. (one-to-many relationship)
Normalization vs Denormalization
First, let’s define both concepts:
Normalization = optimize by reducing the redundancy of the data between the tables
Denormalization = attempting to optimize queries and read processes by adding redundant data or by grouping data
Now, let’s look at a real example:
Sales Table (a Fact Table) | |||||
Order# | SalesPersonid | Customerid | Sales Total | Attribute1 | Attribute2 |
1 | 4 | 1 | 133 | Value | Value |
2 | 1 | 3 | 63 | Value | Value |
3 | 1 | 4 | 75 | Value | Value |
4 | 1 | 4 | 135 | Value | Value |
5 | 3 | 5 | 18 | Value | Value |
6 | 1 | 1 | 88 | Value | Value |
7 | 4 | 5 | 117 | Value | Value |
8 | 2 | 1 | 36 | Value | Value |
9 | 4 | 4 | 70 | Value | Value |
10 | 2 | 4 | 41 | Value | Value |
11 | 2 | 1 | 54 | Value | Value |
12 | 5 | 6 | 51 | Value | Value |
take closer look at the color of the columns, you’ll notice that we have the Attribute1 and Attribute2 columns from the Customers Table now as part of the FACT Table… can you guess if this Table is Normalized or Denormalized?
The answer is Denormalized…why? because we already have a relationship to the DIM Customers table using the Customerid so we’re creating a redundancy on the table by having these 2 columns which translates into more memory consumption (which is bad).
So which one should I aim for?
Normalization but to a certain extent…let’s talk more about it on the type of schemas that we commonly have.
Star schema vs Snowflake
The main difference from the Star Schema and the snowflake one is that in the Star schema we have somewhat less tables but within those tables we have a more compact way of storing the data. In other words, if we have a main table like the customers table but within that same table we have a column/field by the name of Customers Group…this field could be related to another table and subsequently that related table could have another related table with more attributes to define it
Now, if we try to define every field to the most granular level in a separate table we’ll encounter with a scenario where we could have a high amount of tables in our data model. One of the suggestions of the Star Schema is that some information can be kept in a shared table which is something like the combination of the last 2 tables
The idea of having less tables and a few more columns is often the best approach but it depends on your specific data model to see if it fits your needs.
More Rows is always a better option than more columns
More Columns is often a better option than more Tables if the amount of rows is significantly small
Here you’ll find a better post on how evil the snowflake can be at times and how cool the stars are:
http://www.powerpivotpro.com/2013/01/stars-and-snowflakes-and-bears-oh-my/
For more information about Data modeling check out these posts:
http://support.gateway.com/s/tutorials/Tu_837599.shtml
if you have questions shoot me an email (about section) or leave a comment below!
Well explained ..
Thank you