Setup DataHub integration
This feature is currently in development and not yet available. Contact Holistics to sign up for early access.
Prerequisites
Before setting up the integration, make sure you have:
- Holistics CLI installed and available in your PATH. See the CLI documentation for installation instructions.
- DataHub instance running and accessible. This can be a local instance or DataHub Cloud.
- Access to your Holistics AML project - either as a local directory or a git repository.
Installation
Install the DataHub Holistics connector using pip:
pip install datahub-holistics
This package registers itself as a DataHub ingestion source plugin, so you can use type: holistics in your ingestion recipes.
The connector expects a recent Holistics CLI that exposes the canonical lineage graph via:
holistics aml lineage .
Configuration
The connector is configured through a YAML recipe file, following DataHub's standard ingestion format.
Basic setup with local directory
If your AML project is on your local machine, use the base_folder option:
source:
type: holistics
config:
base_folder: /path/to/your/holistics-aml-project
connection_to_platform_map:
bigquery_prod:
platform: bigquery
env: PROD
sink:
type: datahub-rest
config:
server: http://localhost:8080
Git-based setup
For production use, you'll typically want the connector to clone your AML project from git. This ensures you're always ingesting from the latest committed state:
source:
type: holistics
config:
git_info:
repo: https://github.com/your-company/holistics-project
branch: main
deploy_key_file: /path/to/deploy_key # Optional, for private repos
connection_to_platform_map:
bigquery_prod:
platform: bigquery
env: PROD
sink:
type: datahub-rest
config:
server: http://localhost:8080
Connection mapping
Connection mapping is essential for establishing lineage from your Holistics models to the underlying database tables. Without it, the connector won't know which DataHub platform corresponds to each Holistics data source.
Each entry maps a Holistics data_source_name (as defined in your AML files) to a DataHub platform:
connection_to_platform_map:
# Simple mapping - just specify the platform
bigquery_prod:
platform: bigquery
env: PROD
# Detailed mapping - useful for platforms that need more context
postgres_analytics:
platform: postgres
platform_instance: analytics-db
database: analytics
schema: public
env: PROD
# Snowflake example
snowflake_warehouse:
platform: snowflake
platform_instance: my-snowflake
env: PROD
The connector uses this mapping to construct proper DataHub URNs for source tables, enabling end-to-end lineage from dashboards down to database tables.
Feature flags
You can control what metadata gets extracted:
source:
type: holistics
config:
base_folder: /path/to/project
# Feature flags (all default to true)
extract_owners: true # Extract owner information from AML
extract_lineage: true # Build lineage relationships
extract_descriptions: true # Include descriptions from AML
include_hidden_fields: false # Include fields marked as hidden
# Platform identification
platform_instance: production-holistics
env: PROD
connection_to_platform_map:
# ... your mappings
Filtering
Use regex patterns to control which entities get ingested:
source:
type: holistics
config:
base_folder: /path/to/project
# Only ingest specific entities
model_pattern:
allow:
- ".*"
deny:
- "tmp_.*" # Skip temporary models
- "test_.*" # Skip test models
dataset_pattern:
allow:
- ".*"
dashboard_pattern:
allow:
- ".*"
connection_to_platform_map:
# ... your mappings
Stateful ingestion
Enable stateful ingestion to automatically detect and remove stale entities when they're deleted from your AML project:
source:
type: holistics
config:
base_folder: /path/to/project
stateful_ingestion:
enabled: true
connection_to_platform_map:
# ... your mappings
Running the ingestion
Save your recipe to a file (e.g., holistics_recipe.yaml) and run:
datahub ingest -c holistics_recipe.yaml
The connector will output progress information showing how many models, datasets, dashboards, and charts were processed.
Internally, the connector calls the Holistics CLI and reconstructs DataHub entities from AML-native graph nodes and edges.
Verification
After the ingestion completes:
-
Check DataHub UI - Navigate to your DataHub instance and search for "holistics". You should see your models, datasets, and dashboards.
-
Verify lineage - Open a dashboard and check the Lineage tab. You should see connections to charts, which connect to models, which connect to source tables.
The canonical AML graph may contain additional concepts such as filter blocks or other non-viz dashboard blocks. These are preserved in the CLI output but are not currently emitted as DataHub chart entities.
-
Check schema - Open a model and look at the Schema tab. Dimensions and measures should appear as fields with appropriate tags.
Troubleshooting
CLI not found: Ensure the Holistics CLI is installed and in your PATH. Test by running holistics --version.
Git clone fails: For private repositories, make sure your deploy key has read access and the path in deploy_key_file is correct.
No lineage to source tables: Verify your connection_to_platform_map entries match the data_source_name values in your AML models.
Entities missing: Check the ingestion report for filtered or errored entities. Adjust your *_pattern settings if needed.