Capillaries: distributed data processing platform focused on delivering enriched, customer-ready, production-quality data within SLA time limits

What are the use cases where Capillaries excels?

predict

Predictable performance

Capillaries prioritizes predictable performance, which is critical for managing SLAs. Ideally, after a few test runs, data engineers should be able to provide reasonably accurate estimates for data transformation completion times in cases such as:
  • larger datasets of the same nature
  • larger deployments on the same cloud provider
predict

Configuration management consistency

Capillaries enforces consistency through a declarative, structured configuration:

1. The data processing DAG and relational algebra operations are specified declaratively in a Capillaries script as JSON configuration.
2. Field transformations use one-liner Go expressions, minimizing complexity and erro
3. Row-level Python formulas, while potentially complex, can be thoroughly covered with unit tests.
predict

Conservative cloud resource usage

Capillaries excels in data-heavy, compute-intensive workloads that run periodically (daily, weekly, quarterly). It is designed to work efficiently on both private and public VM or container infrastructure, which can be:

  • allocated and provisioned in minutes
  • quickly disposed of once transformations are complete
predict

Operator interaction

Capillaries allows operator interaction at designated processing steps, enabling human validation and manual go/no-go decisions within automated pipelines.

Technical highlights

predictParallel processing

1. Executes multiple data processing tasks (DAG nodes) concurrently.
2. Splits large datasets into smaller batches for parallel processing.

predictFault tolerance

Designed to survive temporary database connectivity issues and individual worker node failures.

predictIncremental computing

The data processing pipeline can be broken into independent runs, allowing for flexible execution scheduling and easy re-runs of selected parts.

Q & A

Is Capillaries ETL or ELT?

Capillaries is much more about the "T" (Transform) rather than "E" (Extract) or "L" (Load):

  • Simple filters and field-level transformations can occur during loading
  • More complex transformations are performed after the data is fully loaded
  • Data is only temporarily stored - just long enough to complete all transformations and output the results
Capillaries can be probably best described as "etlT"

Is Capillaries "low-code" or "no-code"?

Capillaries is definitely "some-code" because data transformation rules may involve Go expressions and/or complex Python formulas. The "code" part applies only to the business logic, while the "orchestration" part does not require any coding at all.

Why choose Capillaries over custom data pipelines?

Capillaries handles orchestration, scalability, and intermediate data storage, so you can focus entirely on your transformation logic.

Why choose Capillaries over other distributed processing systems?

  • it's free and open-source
  • fast to deploy on any VM/container environment (and easy to tear down)
  • it's better than no-code systems because it allows you to perform complex Python calculations at the row level
  • it's better than code-heavy systems because it doesn’t require a thorough knowledge of any programming language
  • with all intermediate data stored in Cassandra tables, data processing steps are extremely transparent, making troubleshooting easier

What do I need to run Capillaries?

To set up Capillaries, you will need:
  • a Cassandra cluster
  • a few VMs/containers running Capillaries workers
  • a VM/container running Capillaries Webapi and UI
  • a VM/container running RabbitMQ server
  • monitoring and logging infrastructure (optional, but recommended)
To run data processing jobs for a specific dataset, you need to provide configuration and data files, served from an NFS drive, HTTP(S) server, or S3 bucket:

and a browser to use the Capillaries UI or a REST API client to call Capillaries Webapi directly. After a Capillaries run is complete, containing transformed data is available on NFS or S3.

Do I need to know SQL or a similar query language to define Capillaries transforms?

No. Capillaries supports transformations inspired by relational algebra (e.g., lookups, grouping, and denormalization), but they are declared in a JSON script, not written as SQL.