One of the largest real estate investment services firm of the United States, struggled with manual integration of data
Data Engineering & Cloud-Native Integr.
AWS, Dagster, DuckDB, PostgreSQL, S3
The consequences were predictable: agents spent time on data management instead of sales, errors and duplicates crept into the CRM, updates were slow and infrequent, and as data volumes grew, the workflow simply couldn't keep up. Auditing was impossible because there was no automated tracking of where data came from or how it had been transformed.
The architecture was built around four core principles. First, orchestration with full visibility: we implemented Dagster to manage end-to-end workflow dependencies, scheduling, and metadata capture. Every pipeline run is tracked, every transformation is logged, and the entire data lineage is auditable at any point. Second, high-performance transformation: DuckDB handles data processing with deduplication and standardization rules that enforce consistency across all three data sources. Third, scalable infrastructure: the pipeline runs on containerized applications deployed on AWS EKS, designed for resilience and horizontal scalability. Fourth, decoupled storage and resilient loading: AWS S3 stores data across distinct stages for durability and auditability, while an SQS-driven asynchronous mechanism handles bulk loading into the client's Gemini PostgreSQL database efficiently.
Beyond the immediate gains, the architecture positioned the firm for growth. The AWS infrastructure can handle increased data loads and additional third-party sources without redesign. And for the first time, the firm has full data governance — clear lineage, processing history, and auditability that meets compliance requirements.
