Allows splitting the whole data processing pipeline into separate runs that can be started independently and re-run if needed.
Splits large data volumes into smaller batches processed in parallel. Executes multiple data processing tasks (DAG nodes) in parallel.
Allows human data validation for selected data processing stages.
Survives temporary underlying database connectivity issues and processing node failures.
Consumes and produces delimited text files, uses database tables internally. Provides ETL/ELT capabilities. Implements a subset of the relational algebra.
Capable of processing large amounts of data within SLA time limits, efficiently utilizing powerful computational (hardware, VM, containers) and storage (Cassandra) resources, with or without human monitoring/validation/intervention.