Allows splitting the entire data processing pipeline into separate runs that can be initiated independently and re-run if needed.
Splits large data volumes into smaller batches processed in parallel. Executes multiple data processing tasks (DAG nodes) simultaneously.
Enables human data validation for selected data processing stages.
Enables human data validation for selected data processing stages.
Consumes and produces CSV and Apache Parquet files, utilizes database tables internally, provides ETL/ELT capabilities, and implements a subset of relational algebra.
Capable of processing large amounts of data within SLA time limits, efficiently utilizing powerful computational resources (hardware, VMs, containers) and storage resources (Cassandra), with or without human monitoring/validation/intervention.