Skip to main content

Setup DataHub integration

Early access

This feature is currently in development and not yet available. Contact Holistics to sign up for early access.

Prerequisites

Before setting up the integration, make sure you have:

  • Holistics CLI installed and available in your PATH. See the CLI documentation for installation instructions.
  • DataHub instance running and accessible. This can be a local instance or DataHub Cloud.
  • Access to your Holistics AML project - either as a local directory or a git repository.

Installation

Install the DataHub Holistics connector using pip:

pip install datahub-holistics

This package registers itself as a DataHub ingestion source plugin, so you can use type: holistics in your ingestion recipes.

The connector expects a recent Holistics CLI that exposes the canonical lineage graph via:

holistics aml lineage .

Configuration

The connector is configured through a YAML recipe file, following DataHub's standard ingestion format.

Basic setup with local directory

If your AML project is on your local machine, use the base_folder option:

source:
type: holistics
config:
base_folder: /path/to/your/holistics-aml-project

connection_to_platform_map:
bigquery_prod:
platform: bigquery
env: PROD

sink:
type: datahub-rest
config:
server: http://localhost:8080

Git-based setup

For production use, you'll typically want the connector to clone your AML project from git. This ensures you're always ingesting from the latest committed state:

source:
type: holistics
config:
git_info:
repo: https://github.com/your-company/holistics-project
branch: main
deploy_key_file: /path/to/deploy_key # Optional, for private repos

connection_to_platform_map:
bigquery_prod:
platform: bigquery
env: PROD

sink:
type: datahub-rest
config:
server: http://localhost:8080

Connection mapping

Connection mapping is essential for establishing lineage from your Holistics models to the underlying database tables. Without it, the connector won't know which DataHub platform corresponds to each Holistics data source.

Each entry maps a Holistics data_source_name (as defined in your AML files) to a DataHub platform:

connection_to_platform_map:
# Simple mapping - just specify the platform
bigquery_prod:
platform: bigquery
env: PROD

# Detailed mapping - useful for platforms that need more context
postgres_analytics:
platform: postgres
platform_instance: analytics-db
database: analytics
schema: public
env: PROD

# Snowflake example
snowflake_warehouse:
platform: snowflake
platform_instance: my-snowflake
env: PROD

The connector uses this mapping to construct proper DataHub URNs for source tables, enabling end-to-end lineage from dashboards down to database tables.

Feature flags

You can control what metadata gets extracted:

source:
type: holistics
config:
base_folder: /path/to/project

# Feature flags (all default to true)
extract_owners: true # Extract owner information from AML
extract_lineage: true # Build lineage relationships
extract_descriptions: true # Include descriptions from AML
include_hidden_fields: false # Include fields marked as hidden

# Platform identification
platform_instance: production-holistics
env: PROD

connection_to_platform_map:
# ... your mappings

Filtering

Use regex patterns to control which entities get ingested:

source:
type: holistics
config:
base_folder: /path/to/project

# Only ingest specific entities
model_pattern:
allow:
- ".*"
deny:
- "tmp_.*" # Skip temporary models
- "test_.*" # Skip test models

dataset_pattern:
allow:
- ".*"

dashboard_pattern:
allow:
- ".*"

connection_to_platform_map:
# ... your mappings

Stateful ingestion

Enable stateful ingestion to automatically detect and remove stale entities when they're deleted from your AML project:

source:
type: holistics
config:
base_folder: /path/to/project

stateful_ingestion:
enabled: true

connection_to_platform_map:
# ... your mappings

Running the ingestion

Save your recipe to a file (e.g., holistics_recipe.yaml) and run:

datahub ingest -c holistics_recipe.yaml

The connector will output progress information showing how many models, datasets, dashboards, and charts were processed.

Internally, the connector calls the Holistics CLI and reconstructs DataHub entities from AML-native graph nodes and edges.

Verification

After the ingestion completes:

  1. Check DataHub UI - Navigate to your DataHub instance and search for "holistics". You should see your models, datasets, and dashboards.

  2. Verify lineage - Open a dashboard and check the Lineage tab. You should see connections to charts, which connect to models, which connect to source tables.

    The canonical AML graph may contain additional concepts such as filter blocks or other non-viz dashboard blocks. These are preserved in the CLI output but are not currently emitted as DataHub chart entities.

  3. Check schema - Open a model and look at the Schema tab. Dimensions and measures should appear as fields with appropriate tags.

Troubleshooting

CLI not found: Ensure the Holistics CLI is installed and in your PATH. Test by running holistics --version.

Git clone fails: For private repositories, make sure your deploy key has read access and the path in deploy_key_file is correct.

No lineage to source tables: Verify your connection_to_platform_map entries match the data_source_name values in your AML models.

Entities missing: Check the ingestion report for filtered or errored entities. Adjust your *_pattern settings if needed.


Open Markdown
Let us know what you think about this document :)