Capillaries

Distributed data processing platform focused on delivering enriched, customer-ready, production-quality data within SLA time limits

What are the use cases where Capillaries excels?

predict

Predictable performance

Capillaries prioritizes predictable performance, which is crucial for SLA management. Ideally, after a few test runs, data engineers should be able to give reasonably accurate predictions about data transformation completion times for:
  • larger datasets of the same nature
  • bigger deployments from the same cloud provider
predict

Configuration management consistency

1. The data processing DAG and relational algebra operations are part of the Capillaries script and are specified declaratively as JSON configuration.
2. Go expressions used in field transformations are just one-liners, leaving little room for error.
3. While row-level Python formulas can be very complex, they can easily be covered with unit tests.
predict

Conservative cloud resource use

Capillaries shines when data processing is very calculation-heavy, data-heavy, and must be performed periodically (daily, weekly, quarterly). It can run on private or public VM or container infrastructure, which can be allocated and provisioned within minutes and disposed of immediately after all transformations are complete.
predict

Operator interaction

Capillaries allows operators to validate data at selected processing steps and decide whether to proceed or not.

Technical highlights

predictParallel processing

1. Executes multiple data processing tasks (DAG nodes) simultaneously.
2. Splits large data volumes into smaller batches for parallel processing.

predictFault tolerance

Designed to withstand temporary database connectivity issues and worker node failures.

predictIncremental computing

Allows splitting the entire data pipeline into separate runs that can be initiated independently and re-run if needed.

Q & A

Is Capillaries ETL or ELT?

Capillaries is much more about the "T" than the "E" or "L":

  • simple transformations and filtering can be performed when the data is being loaded, while complex transformations are performed after the data is loaded
  • the data is intended to be stored only until all transformations are complete and the result files are produced
Capillaries is probably best described as "etlT"

Is Capillaries "low-code" or "no-code"?

Capillaries is definitely "some-code" because data transformation rules may include Go expressions and/or complex Python formulas. The "code" part applies only to the business logic, while the "orchestration" part does not require coding at all.

Why should I prefer Capillaries over my custom data pipelines?

Capillaries handles orchestration, scalability, and intermediate data storage, so you can focus solely on the transformation logic.

Why should I prefer Capillaries over other distributed processing systems?

  • it's free and open-source
  • it can be quickly deployed on private or public VM or container infrastructure and disposed of when no longer needed
  • it's better than no-code systems because it allows you to perform complex Python calculations at the row level
  • it's better than code-heavy systems because it doesn’t require deep knowledge of any programming language
  • with intermediate data stored in Cassandra tables, all data processing steps are extremely transparent, making troubleshooting easier

What do I need to run Capillaries?

To set up a Capillaries environment, you need to provide:
  • a Cassandra cluster
  • a few VMs/containers running Capillaries workers
  • a VM/container running Capillaries Webapi and UI
  • a VM/container running RabbitMQ server
  • monitoring and logging infrastructure (optional, but recommended)
To run data processing for a specific dataset, you need to provide data in files, served from an NFS drive, HTTP(S) server, or S3 bucket: and a browser to use the Capillaries UI or a REST API client to call Capillaries Webapi directly. After a Capillaries run is complete, you get a set of files (NFS or S3) containing transformed data.

Do I need to know SQL or a similar query language to define Capillaries transforms?

No. Capillaries implements some transformations that use relational algebra concepts like lookups, grouping, and denormalization, but users specify these transformations declaratively in the script file.