Skip to main content

Best practices when working with datasets

Knowledge Checkpoint

A grasp of these concepts will help you understand this documentation better:

Dataset is the product of the data builder, and it is the interface that end-users will analyze data. Therefore, you must pay extra care when building the dataset.

What makes a good dataset

Understanding the basics of building dataset is easy — ”just join together your data models by setting up the relationships and paths!”

But there are three things you should be optimizing for when building your dataset:

  1. Correctness: Do reports built on top of this dataset return correct data? This is the very lowest bar for design, as Holistics makes it fairly easy to get correct results as long as your join conditions are correct.
  2. Performance: While there may be multiple ways to get Holistics to write a query, there will typically be one way that has the optimal performance. Designing your dataset with performance in mind can be the difference between 15-second and 15-minute dashboard load times.
  3. Usability: When a user who is not an expert in your data loads your explore, it should be immediately clear how to get the answers they’re looking for.

There is no hard-and-fast rule to develop datasets effectively in Holistics, but here are some recommendations:

Best Practices

Organizing Datasets in a logical manner

  • Here are some typical ways of organizing the dataset:

    • Master Datasets: group by entity type (customers, inventory, product). Goal is completeness

    • Department Datasets. Goal is relevance. Ask yourself these questions before building a dataset:

      • “If I'm from this department, how can I find my data quickly?
      • "Is there a dataset folder that my department will access frequently?”
  • Don't worry about duplicating datasets (there's no harm in having the same data model fields for different teams) - Focus on accessibility and building purpose

  • Make it clear when datasets are not meant to be explored (Group into folder)

Start small then expand the dataset

There seems to be 2 common use cases:

  • Building "small" datasets with only a few models that allow users to deep dive into one aspect of the business
  • Building “large” datasets that can be used for complex exploration

The best advice we can give is to keep multiple small datasets for "low-requirement" business users and also create large datasets for the more analytics-savvy users to do more complex things and ad-hoc exploration.

Start out with small datasets which answer specific questions

  • Design each dataset to answer a specific set of questions. Anticipate your end users’ questions, and build your dataset around that. Only include the models relevant to the questions.
  • It is advisable to have 4-5 models joined together per one small dataset, and 6–7 is the absolute upper limit. If you find yourself needing more, you probably need to model the underlying data more effectively.
  • This is because:
    • Too many models will confuse end-users
    • Too many models will more likely generate complicated & non-optimal joins

Create large datasets for more complex requirements or ad-hoc exploration

  • When the company grows, so should the datasets in order to answer more questions that business users might have. That's why datasets will naturally grow as there are more questions from the business. Continue to include many models in one dataset as long as business users become more confident with data exploration.
  • One typical way is creating large datasets related to each business activity like (‘inventory’, ‘sales’, ‘marketing’, ‘risk’), and joining 10-15 tables (or more) one each dataset. On the user navigation end, it’s very simplified, sales users know where to find sales data, and marketing user knows where to find marketing data.
  • However, there is a trade-off between dataset flexibility and performance and sometimes, there are many incompatible fields which makes it frustrating to figure out what fields can be used together, or not. Here are a few notes to control these large datasets:
    • Self-explanatory naming for data models and fields
    • Enrich Data models and Fields with metadata

Avoid complex join paths

  • Build your dataset in “star schema”. There should be one central table (can be a fact table) that contains necessary measures, surrounded by dimension tables.
  • Avoid “snowflake-schema like” datasets, because in snowflake schema, additional information about an entity/dimension is separated in their own tables → More models to include in the dataset → more joins are needed to get complete information about a certain entity
  • Ideally, set up your models so that the dimensions you need and the measures are only one JOIN away. A join chain that is too long will not be performant.
  • One good way to reduce the joins generated at dataset run time is to prejoin / denormalize your data as much as possible.

Performance Tips

Related resources

Click here to see the central page of Performance Guides.

Pre-aggregate your data

Instead of exploring from large raw tables, it is better to write Query Models to filter out data relevant to a particular question, pre-aggregate it to a grain that can satisfy most of your end-users' data questions, and persist the result back to the database for smoother exploration experience.

Avoid exposing large tables in explorable datasets

It is not advisable to include large tables in an end-user facing dataset. Models creating from large tables (for example, events or log tables) should only serve as a base for you to build other derived models upon.

Large tables often have indexes that are well-known to analysts and engineers. However, end-users of datasets may not know this and can generate poor performance queries while exploring.

Beware of potential fan-out when creating relationships

When you set relationships between models (many-to-one or one-to-one ), we assume that the fields at the "one" end is already unique.

If the field is not unique, when you drag in fields from those two models, a fan-out will happen. The result set will be a Cartesian product of the two models and may "explode" into millions of rows.

Use Holistics's modeling syntax in Query Models

When writing Query Models, instead of using the normal model_alias.field_name notion to select fields from tables, use {{ #model_alias.field_name }} so Holistics will only select the necessary fields to be included in the generated queries. Learn more

Other Notes

  • You can create any number of datasets because Holistics datasets has the below 2 benefits
    • No need to worry about report sprawl. All definitions are centralized within Holistics.
    • All Holistics datasets are virtual and they don't take up physical storage space.
  • Common Challenges:
    • Data exploration offers flexibility for business users, but there is the risk of dragging in incompatible dimensions and metrics, or adding non-additive metrics.
    • Beware of Fanout Issues.

Let us know what you think about this document :)