Capillaries

Blog

Source code and docs

It processes data

Capillaries: addresses scalability issues and manages intermediate data store, enabling

Users to concentrate on data transforms and data quality control

Capillaries bridges the gap between

distributed, scalable data processing/integration solutions, and

the necessity to produce enriched, customer-ready, production-quality, human-curated data within SLA time limits

Highlights

Incremental computing

Allows splitting the entire data processing pipeline into separate runs that can be initiated independently and re-run if needed.

Parallel processing

Splits large data volumes into smaller batches processed in parallel. Executes multiple data processing tasks (DAG nodes) simultaneously.

Operator interaction

Enables human data validation for selected data processing stages.

Fault tolerance

Enables human data validation for selected data processing stages.

Works with structured data artifacts

Consumes and produces CSV and Apache Parquet files, utilizes database tables internally, provides ETL/ELT capabilities, and implements a subset of relational algebra.

Use scenarios

Capable of processing large amounts of data within SLA time limits, efficiently utilizing powerful computational resources (hardware, VMs, containers) and storage resources (Cassandra), with or without human monitoring/validation/intervention.