01 / 10 Data Infrastructure · Lakehouse Engineering
Lexidy

Nexus

HubSpot CDC Ingestion Engine
for Lakehouse Architectures

Extract, track, and deliver HubSpot data in a history-aware, schema-evolving, lakehouse-compatible format — without orchestrators, streaming frameworks, or vendor lock-in.

pip install . | 190+ tests · 7 phases · production-hardened

By Lexidy

02 / 10
The Data Ingestion Gap

HubSpot is API-based, schema-flexible, and operational —
not analytics-ready.

Your CRM produces mission-critical data every second. But when you try to query it alongside product, finance, or support data, you hit a wall.

🔌

API-Based, Not CDC

HubSpot exposes REST + Search APIs, not a change data log. No native streaming, no transaction log — you are polling timestamps, not events.

🧩

Schema-Flexible (Chaotic)

5,788 properties across 25 object schemas. Contacts alone: 1,746 properties, 1,358 custom. Schema is fluid — fields appear, vanish, and change types without notice.

📉

Analytics Blind Spot

No built-in time travel, no history tracking, no export format compatible with Iceberg, Trino, or BigQuery. HubSpot is an operational island.

Scale Ceilings

Search API caps at 10K results per query. 129K contacts means you need pagination, windowing, and cursor safety — or you lose data silently.

⏩ Organizations need reliable incremental ingestion, historical correctness, schema evolution, and lakehouse-native output.

03 / 10
The Product

CDC Emulation, Not ETL.
Bronze-First Extraction.

Nexus bridges HubSpot's operational API to your lakehouse by emulating Change Data Capture through timestamp-based polling — no Kafka, no Airflow, no vendor lock-in.

HubSpot
REST + Search API
Nexus Engine
CDC · Schema · State
Bronze Layer
Parquet · JSONL
Silver / Iceberg
Queryable Tables
🕒

Incremental Sync

Resumable pagination via lastmodifieddate. Batch-level recovery — interrupt and resume; zero data loss.

📜

Time Travel Built In

Append-only JSONL + Parquet with per-record _nexus_synced_at. Point-in-time queries, change diffs, or latest-state snapshots with jq.

🧬

Schema Metadata CDC

Deterministic schema hashing. Immutable schema versions. Separate metadata stream under _metadata/schemas/ — schema and data evolve independently.

04 / 10
Architecture

Minimal by Design.
Pluggable by Convention.

Two registries, one protocol, zero orchestrator dependency. Every component is independently testable, swappable, and debuggable.

# SOURCE_REGISTRY — pluggable extractors
class SourceProtocol(Protocol):
    def fetch(...) → AsyncIterator[Record]
    def get_state(...) → State

# Registered sources
SOURCE_REGISTRY = {"hubspot", "stub"}
# DESTINATION_REGISTRY — pluggable storage
build_destination("local") → LocalStorage
build_destination("s3") → S3Storage
build_destination("gcs") → GCSStorage
# State backends
"file" → nexus.state.json
"postgres" → psql + Alembic migrations

# Output formats
"json" → append-only JSONL
"parquet" → narrow/wide Bronze layout

# Bronze layout
├── snapshots/dt=YYYY-MM-DD/*.parquet
├── property_events/dt=YYYY-MM-DD/*
└── _metadata/schemas/{type}/
    ├── current.json
    ├── index.jsonl
    └── versions/{hash}.json

Key philosophy: Each phase is testable, deployable, and independently valuable. No big-bang. No premature optimization.

05 / 10
Production Hardening

Battle-Tested for Real HubSpot
at Production Scale.

168 passing tests, 3 skipped, zero failures. Every edge case observed in real HubSpot environments has been modeled, tested, and fixed.

190+
Tests
5.8K
Properties Managed
129K
First-Sync Contacts
25
Object Schemas

🛡 Idempotent Ingestion

Batch hash tracking in state backend suppresses duplicates on re-run. State advances only after durable write success.

⏱ Bounded Exponential Backoff

Transient error classification (timeouts, 429, 5xx) with jitter. Honors Retry-After headers with bounded fallback.

🗄 PostgreSQL State Backend

Atomic read/update, auto-migration via Alembic at startup. Per-object cursor property override map for nonstandard schemas.

🔍 Search API Cap Detection

10K-result ceiling raises a clear error instead of silently advancing state past unprocessed records. Recovery is a re-run with a smaller window.

06 / 10
Physical Design

Narrow Bronze Layout · Schema Metadata CDC

Production HubSpot has 5,788 total properties. Wide Parquet with all-contact-properties creates catastrophic metadata overhead. Nexus solves this with sparse physical modeling.

📦 Narrow Bronze Snapshot

# Stable envelope columns
hs_object_id → string (non-null)
_nexus_synced_at → timestamp
_nexus_batch_id → string
_nexus_object_type → string

# Sparse payload — one JSON blob
properties_json → string (JSON)

# Schema pointer
_nexus_schema_hash → string

Stable schema, small files, fast writes. Expansion happens downstream in Silver.

📋 Schema Metadata CDC

# Separate from data path
nexus metadata sync \   --all-object-types \   --dest s3 \   --output-base-path output
Deterministic hash
Same schema → same hash every time
Immutable versions
Write-once, never mutated
current.json
Latest known schema for fast lookup

Schema and data evolve independently. Iceberg handles physical evolution; Nexus tracks semantic evolution.

💡 Design insight: Schema metadata is source-system semantic metadata — separate from Iceberg physical table metadata. Nexus tracks what properties exist and their types; Iceberg manages how they are stored on disk.

07 / 10
Phase 6 · In Progress

The Silver Lakehouse Layer
Bronze → Silver Contract

Bronze remains immutable and append-only. Silver is a separate downstream transform that reads Bronze snapshots + schema metadata — and produces queryable, Iceberg-compatible tables.

Bronze
Immutable · Narrow
Silver Transform
Expand · Type · Dedup
Iceberg / Parquet
ACID · Evolve · Partition
🔬

Expand & Type

Explode properties_json into typed columns using schema metadata. Decode enums, cast dates, materialize nulls.

🪪

Idempotent Dedup

Source + object_type + object_id dedup with timestamp tiebreakers. Re-running the same Bronze range never duplicates rows.

🗂

Iceberg Native

IcebergSilverWriter behind optional pyiceberg dependency. ACID tables, hidden partitioning, schema evolution, time travel.

# CLI — silver sync (in progress)
nexus silver sync \   --source hubspot \   --object-types contacts,companies \   --start-date 2026-05-01 \   --mode incremental \   --writer iceberg
08 / 10
Performance & Scale

Built for Real HubSpot Tenants.

Validated against production HubSpot accounts. These are not synthetic benchmarks — these are real-world observations that shaped every design decision.

129K
Contacts — First Sync via List API
10K
Search API Result Cap — Detected & Guarded
1,746
Properties on a Single Object (Contacts)
3
Storage Backends — Local · S3 · GCS

📐 Physical Design Decisions Driven by Data

Narrow Bronze
Wide Parquet with 1,746 columns creates multi-MB metadata per file. Narrow layout keeps files small and writes fast.
Dual Cursor Strategy
Contacts use lastmodifieddate; companies/deals use hs_lastmodifieddate. Configurable override map for custom objects.
S3 Pagination
Discovered that S3 list_objects_v2 has a 1K-key ceiling. Fixed with paginated listing.

🔌 Integration Targets

Apache Iceberg
Silver layer with IcebergWriter, hidden partitioning, schema evolution
Trino / Presto
Bronze Parquet is query-ready. Trino reads directly.
BigQuery
GCS-backed Parquet can be external-tabled in BigQuery
Spark
Phase 7 plans for end-to-end Iceberg/Spark integration tests
09 / 10
Platform Play

Extensible by Design.
Multi-Source Ready.

Nexus is not a HubSpot-only tool. Phase 5 introduced pluggable source and destination registries — any data source can become a first-class extractor.

🔌 SOURCE_REGISTRY

class SalesforceSource:
    def fetch(self, ctx):
        ...

SOURCE_REGISTRY["salesforce"] = SalesforceSource

Implement SourceProtocol → register it → nexus extract salesforce. That's it. No core changes needed.

Built-in examples: HubSpotClient, StubSource (testing, CSV/file input, dry-run).

💾 DESTINATION_REGISTRY

# One flag switches the entire storage layer
--dest local → LocalStorage
--dest s3 → S3Storage
--dest gcs → GCSStorage

Per-backend flags are scoped and validated. Add a new backend by implementing StorageBackend — no extraction code changes.

Retry, backoff, and metrics apply uniformly across all backends.

The result: Nexus becomes your organization's ingestion control plane — not just a HubSpot tool. One protocol, any source, any storage.

10 / 10
Roadmap

What's Next
for Nexus.

Seven phases defined. Six complete. Phase 6 is in active development. Here's where we're taking the platform.

Phase 1–5 ✅
Core Extractor → Extensibility
HubSpot extraction, Parquet, multi-object, schema CDC, hardening, plugin architecture — all shipped and tested.
Phase 6 🔵 In Progress
Silver Lakehouse Layer
Bronze-to-Silver transform, Iceberg writer, idempotent syncs, backfill, partitioning strategy.
Phase 7 ⚫
Integration Readiness
Docker packaging, Apache Iceberg/Spark end-to-end testing, CI integration suite, production docs.

🎯 Immediate Priorities (Phase 6)

  • Ship nexus silver sync CLI with full/incremental modes
  • Iceberg writer for ACID, hidden partitioning, schema evolution
  • nexus silver history for per-object version tracking
  • Bronze → Iceberg validation with Spark local stack
  • CI-gated integration suite

📈 Beyond Phase 7

Multi-source ingestion (Salesforce, SQL, REST APIs), event-triggered extraction, webhooks, and full Lakehouse-native observability.

Every phase is independently valuable. No big-bang. No premature optimization.

By Lexidy

Lexidy

lexidy.com  ·  By Lexidy  ·  Nexus — HubSpot CDC Ingestion Engine for Lakehouse Architectures