HubSpot CDC Ingestion Engine
for Lakehouse Architectures
Extract, track, and deliver HubSpot data in a history-aware, schema-evolving, lakehouse-compatible format — without orchestrators, streaming frameworks, or vendor lock-in.
By Lexidy
Your CRM produces mission-critical data every second. But when you try to query it alongside product, finance, or support data, you hit a wall.
HubSpot exposes REST + Search APIs, not a change data log. No native streaming, no transaction log — you are polling timestamps, not events.
5,788 properties across 25 object schemas. Contacts alone: 1,746 properties, 1,358 custom. Schema is fluid — fields appear, vanish, and change types without notice.
No built-in time travel, no history tracking, no export format compatible with Iceberg, Trino, or BigQuery. HubSpot is an operational island.
Search API caps at 10K results per query. 129K contacts means you need pagination, windowing, and cursor safety — or you lose data silently.
⏩ Organizations need reliable incremental ingestion, historical correctness, schema evolution, and lakehouse-native output.
Nexus bridges HubSpot's operational API to your lakehouse by emulating Change Data Capture through timestamp-based polling — no Kafka, no Airflow, no vendor lock-in.
Resumable pagination via lastmodifieddate. Batch-level recovery — interrupt and resume; zero data loss.
Append-only JSONL + Parquet with per-record _nexus_synced_at. Point-in-time queries, change diffs, or latest-state snapshots with jq.
Deterministic schema hashing. Immutable schema versions. Separate metadata stream under _metadata/schemas/ — schema and data evolve independently.
Two registries, one protocol, zero orchestrator dependency. Every component is independently testable, swappable, and debuggable.
Key philosophy: Each phase is testable, deployable, and independently valuable. No big-bang. No premature optimization.
168 passing tests, 3 skipped, zero failures. Every edge case observed in real HubSpot environments has been modeled, tested, and fixed.
Batch hash tracking in state backend suppresses duplicates on re-run. State advances only after durable write success.
Transient error classification (timeouts, 429, 5xx) with jitter. Honors Retry-After headers with bounded fallback.
Atomic read/update, auto-migration via Alembic at startup. Per-object cursor property override map for nonstandard schemas.
10K-result ceiling raises a clear error instead of silently advancing state past unprocessed records. Recovery is a re-run with a smaller window.
Production HubSpot has 5,788 total properties. Wide Parquet with all-contact-properties creates catastrophic metadata overhead. Nexus solves this with sparse physical modeling.
Stable schema, small files, fast writes. Expansion happens downstream in Silver.
Schema and data evolve independently. Iceberg handles physical evolution; Nexus tracks semantic evolution.
💡 Design insight: Schema metadata is source-system semantic metadata — separate from Iceberg physical table metadata. Nexus tracks what properties exist and their types; Iceberg manages how they are stored on disk.
Bronze remains immutable and append-only. Silver is a separate downstream transform that reads Bronze snapshots + schema metadata — and produces queryable, Iceberg-compatible tables.
Explode properties_json into typed columns using schema metadata. Decode enums, cast dates, materialize nulls.
Source + object_type + object_id dedup with timestamp tiebreakers. Re-running the same Bronze range never duplicates rows.
IcebergSilverWriter behind optional pyiceberg dependency. ACID tables, hidden partitioning, schema evolution, time travel.
Validated against production HubSpot accounts. These are not synthetic benchmarks — these are real-world observations that shaped every design decision.
lastmodifieddate; companies/deals use hs_lastmodifieddate. Configurable override map for custom objects.Nexus is not a HubSpot-only tool. Phase 5 introduced pluggable source and destination registries — any data source can become a first-class extractor.
Implement SourceProtocol → register it → nexus extract salesforce. That's it. No core changes needed.
Built-in examples: HubSpotClient, StubSource (testing, CSV/file input, dry-run).
Per-backend flags are scoped and validated. Add a new backend by implementing StorageBackend — no extraction code changes.
Retry, backoff, and metrics apply uniformly across all backends.
The result: Nexus becomes your organization's ingestion control plane — not just a HubSpot tool. One protocol, any source, any storage.
Seven phases defined. Six complete. Phase 6 is in active development. Here's where we're taking the platform.
nexus silver sync CLI with full/incremental modes
nexus silver history for per-object version tracking
Multi-source ingestion (Salesforce, SQL, REST APIs), event-triggered extraction, webhooks, and full Lakehouse-native observability.
Every phase is independently valuable. No big-bang. No premature optimization.
By Lexidy
lexidy.com · By Lexidy · Nexus — HubSpot CDC Ingestion Engine for Lakehouse Architectures