Skip to main content

How Holistics Handles Joins

Introduction

For business users that use drag-and-drop report interface day by day, it may not be that important to understand what is happening behind the scenes as long as “it works”.

However, for analysts, it is important to understand fully Holistics’ underlying mechanism for joining to ensure the results are calculated correctly.

How we generate SQL based on different join types

Many-to-one relationship

In Many-to-one relationship, Holistics uses LEFT JOIN (also known as LEFT OUTER JOIN) from ‘many’ table to ‘one’ table.

Consider an example where we want to count total revenue generated by different genders of the users.

Using Data Exploration, the result may look like this:

The final SQL will look like this:

SELECT
U.gender,
SUM(O.revenue) as total_orders
FROM orders O
LEFT JOIN users U ON O.user_id = U.id
GROUP BY 1
Why we do LEFT JOIN from ‘many’ table to ‘one’ table?
  • We do this to handle cases where make sure records from ‘many’ table without a corresponding row in ‘one’ table is counted properly. This is called Referential Integrity.
  • In the above example, doing orders LEFT JOIN users will ensure sales records whose user_id is not present in users table will also be counted.

One-to-one relationship

In One to One relationship, Holistics uses FULL OUTER JOIN to keep the blank value on either side.

Consider case where you have customers and email_addresses on a one-to-one relationship (each customer has a corresponding record in email_addresses). We want to retrieve the full information for each customer.

The final SQL may look like this:

SELECT C.name, E.email
FROM customers C
FULL OUTER JOIN email_addresses E ON C.id = E.customer_id

In the above example, customers FULL OUTER JOIN email_addresses will select all customers and addresses, regardless of whether these customers/ emails have their corresponding entries in the other table or not.

Potential fan-out issues with one-to-many relationship

Fan-out happens when you LEFT JOIN two data models with one-to-many relationship, but the table with one column is on the left and the table with many column on the right.

For example, you are working on an ecommerce dataset that contains two data models Order Items and Orders.

Let's say when exploring this set of Data, if you use any field from order_items, for example, order_items.order_id then delivery_attempts will be duplicated and normally sum on the number of delivery will be wrong.

Since our current mechanism will find a join path (LEFT JOIN by default) from the data model that has aggregations (measures) to the data model that does not, the fan-out issue occurs.

In the aforementioned case, you use measure field SUM(orders."delivery_attempts") of orders data model and non-measure field order_items.order_id of order_items data model. Since the relationship between orders and order_items is one-to-many, the underlying query will be generated as below

SELECT
T1."order_id" AS "order_id",
SUM(T0."delivery_attempts") AS "delivery_attempts"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."order_items" T1 ON T0."id" = T1."order_id"
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

What have we done to solve the issue

Note:

Holistics would auto-resolve this fan-out issue for the Aggregations and Business Calculation in our Dataset exploration UI or measure defined using AML (only in 4.0). However, we do not support solving the fan-out problems when using measures created in the 3.0 Data Model layer.

In order to solve this issue, Holistics has modified the underlying query to get the distinct value when the fan-out issue happens.

Specifically, when the measure/aggregation is on the dimension model instead of the fact model, we will calculate the measure first inside the dimension model before executing the join.

For example, let take the example above, when we sum the delivery_attempts, this query will run first inside order model to calculate total delivery_attempts for each order_id. Let's call this Set A

SELECT
T0."id" AS "order_id",
SUM(T0."delivery_attempts") AS "total_delivery_attempts"
FROM
"ecommerce"."orders" T0
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC

After that, we will execute the join from order_items to orders as normal and select Distinct order_id from order_items model. Let's call this Set B

SELECT
distinct T0."order_id" AS "order_id"
FROM
"ecommerce"."order_items" T0
LEFT JOIN "ecommerce"."orders" T1 ON T0."order_id" = T1."id"
WHERE
T0."id" < 2200

Finally, we will execute Inner Join Set A and Set B, select order_id from Set B and total_delivery_attempts from Set A. By doing this, the order_id will be unique and fan-out issue no longer exists.

The final query will be:

With total_delivery_attemps as (
SELECT
T0."order_id" AS "order_id",
SUM(T0."delivery_attempts") AS "total_delivery_attempts"
FROM
"ecommerce"."orders" T0
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC
)
, distinct_order_id as (
SELECT
distinct T0."order_id" AS "order_id"
FROM
"ecommerce"."order_items" T0
LEFT JOIN
"ecommerce"."orders" T1 ON T0."order_id" = T1."id"
WHERE
T0."id" < 2200
)

SELECT
T1."order_id"
T0."total_delivery_attempts"
FROM
total_delivery_attemps T0
INNER JOIN
distinct_order_id T1 on T0."order_id" = T1."order_id"

How to avoid fan-out issue in your report

In Holistics, the fan-out warning should only appear when you're using a custom measure, because we only auto-handle fan-out for measures with an explicit aggregation type. If your measure is created inside modeling layer using SQL, there will be high possibility that we cannot guess our aggregation function thus causing the fan-out issues.

There are some solutions to avoid the error in this situation:

  • Use Aggregation or Business calculation on the Dataset Exploration
  • Using AML syntax (only in 4.0)

You could find more detail here

Handling Referential Integrity Violations

Take example of (Orders, Users) above. Consider scenario where there are orders records of user ID = 5, but in the users model, no corresponding record of user ID 5 found. This is a violation of the referential integrity rule between orders and users (many-to-one relationship).

If we use INNER JOIN for the above query, the result set would eliminate orders rows with unfounded users. This is dangerous and will underreport the sales results.

Therefore, using INNER JOIN does not resolve referential integrity violation correctly.

-- IMPORTANT: This is not what Holistics does
-- Using INNER JOIN
SELECT
U.gender,
SUM(O.revenue) as total_orders
FROM orders O
INNER JOIN users U ON O.user_id = U.id
GROUP BY 1

Generally speaking, Holistics will opt to use OUTER JOIN when dealing with referential integrity issues. In some situation, this will add query performance overhead, but it ensures that all records will be accounted for and not fall into referential integrity violation traps.


Let us know what you think about this document :)