Understanding Dataset

Introduction

In Holistics, a Dataset is a "container" holding several data models together so they can be explored together, and dictating which join path to be used in a particular analytics use case.

In other words, Dataset is like a mini data mart that enables two things:

  • Data Exploration: Dataset can be shared to Explorers (non-technical users) to do self-service exploration of the data.
  • Creating Charts: All Charts in Holistics have to be created from a dataset. This is done either by the Analyst or the Explorer

UI of a dataset exploration

Using Dataset

Creating a dataset

To create a Dataset, you can go to Datasets section under Reporting. From here you can either create datasets directly or create a folder to organize your datasets.

First, start with selecting all the Data Models that you want to add to your Dataset. Then, activate or deactivate relationships that enable the join paths you need.

In this example, since model Users, Orders and Order Items have relationships with each other, if you choose Order Items as root model, you will be able to select Users and Orders as peripheral models.

You also have the option to create a Dataset directly from Data Models to explore it

Explore a dataset

With Dataset, business users can ask and answer questions themselves without relying on the analyst to write SQL at every step in their exploration.

If you want to start fresh, simply go to the Dataset section on Reporting page and click on a Dataset. You will be presented with the Explore screen where you can start dragging in fields, measures, apply conditions, and tweak the visualizations.

When you are satisfied with the exploration result, you can save it as a widget in a dashboard.

You can also explore a report / dashboard widget's result by clicking on it and choose "Explore Data". For more details, please refer to Explore Data

Share a Dataset

Just like with your reports, dashboard or KPI metric sheets, you can share Datasets to specific users or groups. Sharing a folder will automatically share all items inside.

Relationships in Dataset

Combining data from different models

In an exploration, users can drag fields from different data models in a single explore. Based on the relationships metadata, Holistics figures out the correct JOINs statements to apply to the SQL query.

Example: Given a dataset created from the following models: orders, users, merchants, cities, countries

You want to calculate: "Total orders broken down by countries". You will need access to countries.name and orders.total_orders fields.

Simply drag in Country Name and Total Orders field (from 2 different models):

The generated query will be:

SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC

Ambiguity in join paths

Sometimes, based on the relationships metadata, there are more-than-once possible paths to combine fields between two models.

Example: you have some models with relationships as below

To answer the question "Total order values placed by countries", you drag in Country Name from countries and Orders Count from orders. As you can see, there are two ways to go from countries to orders:

  • A: countries โ†’ cities โ†’ users โ†’ orders โ†’ order items
  • B: countries โ†’ cities โ†’ merchants โ†’ products โ†’ order items

Each JOIN path will produce a different result with different meanings (total values by users' countries vs. by merchant's origin countries.)

To prevent this, when creating datasets, Holistics automatically detects and disable relationships that introduce ambiguity in the JOIN paths. However, you could also browse the Relationship list and adjust your active relationship as your needs.

Duplicating into 2 different models: In these situation, it is also recommended that you duplicate Cities and Countries models into dedicated models for Users and Merchants. This makes your dataset clearer to the end-user:

Which JOIN types (left, right, full) does Holistics use?

The JOINs are constructed following some simple rules:

  • If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (A LEFT JOIN B or B RIGHT JOIN A)
  • If the relationship is 1 - 1, the path is valid both ways (A LEFT JOIN B or B LEFT JOIN A are OK) so we apply FULL JOIN in this case.

Why do I sometimes see 'SELECT DISTINCT' in the SQL?

In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.

First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue

Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.

Show items with no data

By default, Holistics will not present items with no data when exploring dataset.

For example, you have 2 models users and orders, the relationship between them is 1 to many. If you want to understand how many orders each user has, you will add in 2 fields: users.name and count(orders.id). However, we will exclude users that have not made any purchases.

In order to include items with no data, you can toggle on the option Include empty children rows under STYLE > Others in our Visualization Settings. Please refer to the video below: