Loading
Turn backup data into AI-ready datasets

Cohesity Gaia Catalog lets you discover, curate, and publish time-series backup data directly into your AI and analytics platforms — through a secure, read-only S3-compatible endpoint — across cloud and fully on-prem environments.

 

Integrations with Databricks and Microsoft Fabric are coming soon, with more on the way.

Register for Gaia Catalog Early Access

Thank you!

You’re on the list! Our product team will reach out to discuss Early Access opportunities. 

Spotlight integration: Databricks

Stop duplicating petabytes into Databricks. Query only what you actually need. 

Eliminate the full-pipeline duplication problem

Use Gaia Catalog to curate and expose only the data relevant for your AI/ML use case – without triggering a full downstream copy of your entire data estate. 

Cut the pipeline rebuild cycle

Every ingestion pipeline your team builds adds maintenance overhead, governance reconfiguration, and weeks of engineering time before data is usable. Gaia Catalog eliminates this cycle – curated datasets are approved, registered, and queryable in Databricks without rebuilding your access controls or classification from scratch. 

Access governed enterprise data natively in Databricks

Once a dataset is curated and approved, it appears inside Databricks as an external data source readable via S3-compatible endpoint. Your team queries it directly; the data stays in Cohesity. No permission rebuilds. No migration project. No pipeline engineering required. 

Governance that doesn't break when data crosses platforms

RBAC, immutability, and auditability are inherited from the Cohesity Data Cloud — not rebuilt after exposure. Sensitive data is identified and tagged before the dataset is ever shared downstream. Every access through the endpoint is authenticated, logged, and policy-enforced. 

How it works

From backup data to an AI-ready dataset in three steps

Discover and curate

Search protected backups using attributes like file type, path, ownership, permissions, and time range. Build governed datasets across historical versions. 

Enrich and classify

Apply intelligent classification models to tag and contextualize unstructured data – helping identify high-value datasets for AI and analytics use cases. 

Publish without duplicating data

Expose curated datasets through a secure, read-only S3-compatible endpoint directly on top of protected data. No duplication. No new ETL pipelines. No permission rebuilds. 

Overview

Governance that travels with your data 

Role-based access controls, inherited from the source

Role-based access controls (RBAC) are carried forward from the Cohesity Data Cloud into every exposed dataset. When a dataset is registered and approved for access, permissions travel with it – no manual rebuild, no policy reconfiguration required on the receiving platform. 

Data integrity

Datasets are read from immutable backup data – the underlying files cannot be altered through the Gaia Catalog layer. Every access through the S3-compatible endpoint is authenticated and logged, giving your security and compliance teams a full, auditable trail of who accessed what and when. 

Sensitive data protection

Gaia Catalog applies Data Security Posture Management (DSPM) scanning during the enrichment step – before any dataset is published downstream. Sensitive data is identified, tagged, and flagged at the source. Your analytics platform receives a dataset that’s already been assessed, not one that needs to be re-scanned after it arrives. 

Activate the data you already protect

70-90%
of enterprise data is unstructured data
100%
of it already exists as protected backup data
0
new ETL pipelines
Loading