Capillaries: distributed data processing platform focused on delivering enriched, customer-ready, production-quality data within SLA time limits

What are the use cases where Capillaries excels?

Predictable performance

Capillaries prioritizes predictable performance, which is critical for managing SLAs. Ideally, after a few test runs, data engineers should be able to provide reasonably accurate estimates for data transformation completion times in cases such as:

larger datasets of the same nature
larger deployments on the same cloud provider

Configuration management consistency

Capillaries enforces consistency through a declarative, structured configuration:

1. The data processing DAG and relational algebra operations are specified declaratively in a Capillaries script as JSON configuration.
2. Field transformations use one-liner Go expressions, minimizing complexity and erro
3. Row-level Python formulas, while potentially complex, can be thoroughly covered with unit tests.

Conservative cloud resource usage

Capillaries excels in data-heavy, compute-intensive workloads that run periodically (daily, weekly, quarterly). It is designed to work efficiently on both private and public VM or container infrastructure, which can be:

allocated and provisioned in minutes
quickly disposed of once transformations are complete

Operator interaction

Capillaries allows operator interaction at designated processing steps, enabling human validation and manual go/no-go decisions within automated pipelines.

Technical highlights

Parallel processing

1. Executes multiple data processing tasks (DAG nodes) concurrently.
2. Splits large datasets into smaller batches for parallel processing.

Fault tolerance

Designed to survive temporary database connectivity issues and individual worker node failures.

Incremental computing

The data processing pipeline can be broken into independent runs, allowing for flexible execution scheduling and easy re-runs of selected parts.

Q & A

Is Capillaries ETL or ELT?

Capillaries is much more about the "T" (Transform) rather than "E" (Extract) or "L" (Load):

Simple filters and field-level transformations can occur during loading
More complex transformations are performed after the data is fully loaded
Data is only temporarily stored - just long enough to complete all transformations and output the results

Capillaries can be probably best described as "etlT"

Is Capillaries "low-code" or "no-code"?

Capillaries is definitely "some-code" because data transformation rules may involve Go expressions and/or complex Python formulas. The "code" part applies only to the business logic, while the "orchestration" part does not require any coding at all.

Why choose Capillaries over custom data pipelines?

Capillaries handles orchestration, scalability, and intermediate data storage, so you can focus entirely on your transformation logic.

Why choose Capillaries over other distributed processing systems?

it's free and open-source
fast to deploy on any VM/container environment (and easy to tear down)
it's better than no-code systems because it allows you to perform complex Python calculations at the row level
it's better than code-heavy systems because it doesn’t require a thorough knowledge of any programming language
with all intermediate data stored in Cassandra tables, data processing steps are extremely transparent, making troubleshooting easier

What do I need to run Capillaries?

To set up Capillaries, you will need:

a Cassandra cluster
a few VMs/containers running Capillaries workers
a VM/container running Capillaries Webapi and UI
a VM/container running RabbitMQ server
monitoring and logging infrastructure (optional, but recommended)

To run data processing jobs for a specific dataset, you need to provide configuration and data files, served from an NFS drive, HTTP(S) server, or S3 bucket:

a Capillaries script containing the DAG and Go field transformation expressions
Python files with formulas used for row-level data transformations (optional)
input data in CSV or Parquet files

and a browser to use the Capillaries UI or a REST API client to call Capillaries Webapi directly. After a Capillaries run is complete, containing transformed data is available on NFS or S3.

Do I need to know SQL or a similar query language to define Capillaries transforms?

No. Capillaries supports transformations inspired by relational algebra (e.g., lookups, grouping, and denormalization), but they are declared in a JSON script, not written as SQL.