Kolibri - The Execution Engine that loves E-Commerce Search
Kolibri is the german word for hummingbird. I picked it as project name to reflect the general aim to do many smaller things fast. And this describes the batch processing logic still quite well.
Built in Scala, based on the ZIO framework, Kolibri provides easy-to-use mechanisms to compose computing tasks, define how to batch the tasks, do grouped aggregations and represent these aspects in such a way that the jobs can easily be distributed among the worker-nodes, which do not need to be tightly coupled in a cluster setting, but perform all necessary synchronizations via the used storage. The worker nodes do not need to live in the same environment, so whether it is purely local testing on one machine, testing with colleagues on a combination of local machines, cloud deployments or a mixture thereof - the only thing that matters is that the nodes have read and write access to the storage. This could either be a file-based such as local file system, s3, gcs, or any other cloud storage that simulates a file system or databases such as redis. Right now only local file system and s3 are implemented, more are expected to follow.
Further, while the framework is usable for all kinds of computations, its main focus in terms of pre-defined jobs
is on e-commerce search
related functionality to ease evaluation of results. For this certain tasks are
already implemented, such as:
Why the … should you use this?
You might want to consider Kolibri
if you:
distributions of result attribute values
as in STRING_SEQUENCE_VALUE_OCCURRENCE_HISTOGRAM
or comparing results of two search systems with a JACCARD
metric,
or information retrieval metrics such as DCG
, NDCG
, PRECISION
, RECALL
, ERR
, general metrics such as IDENTITY
(when extracting single
attributes, such as numDocs), FIRST_TRUE
, FIRST_FALSE
, TRUE_COUNT
, FALSE_COUNT
,
BINARY_PRECISION_TRUE_AS_YES
, BINARY_PRECISION_FALSE_AS_YES
Schematic overview:
A short description of the single libraries is given in the following.
In the further sub-sections of this documentation you will find step-by-step descriptions on how to use kolibri
.
This library contains basic datatypes to simplify common tasks in batch processing and async state keeping.
This library contains the storage implementations, such as file-based local disc, cloud-based such as AWS and GCP, and will likely contain non-file-based implementations such as redis at some point.
Contains the actual job definitions without the actual execution mechanism, to provide the parts that can be utilized / processed using in a respective service such as kolibri-fleet-zio.
Kolibri Fleet ZIO
provides a multi-node batch execution setup.
Batch definitions are flexible and make use of
Akka-Streams, allowing the definition of flexible execution flows.
Results are aggregated per batch and on demand aggregated to
an overall result.
Features include:
Use case job definitions include:
Kolibri-Fleet-ZIO on DockerHub
Kolibri Watch
provides a UI for the Kolibri project, allowing monitoring of job execution progress,
definition of the executions and submission to the Kolibri backend for execution.
Response Juggler
is only listed here as mock service for trying out kolibri.
You will find an example configuration in the docker-compose file in the root of the kolibri-project.
It is used there to mimick a search service by sampling results that are then processed according
to the job definition posted via kolibri-fleet-zio.
This way you can directly out of the box try out whether handling of a specific response format
is correct and/or execute benchmarks.
It allows definition of a main template that defines how a response is supposed to look
and definitions of the placeholders (enclosed in {{
and }}
), for example:
{
"response": {
"docs": {{DOCS}},
"numFound": {{NUM_FOUND}}
}
}
The placeholders need to have a sampling mechanism configured. For example {{DOCS}}
could
be configured to map to a partial-file that defines the following json format:
{
"product_id": {{PID}},
"bool": {{BOOL}},
"string_val_1": {{STRING_VAL1}}
}
Since this last json snippet is already at the leave level and no place-holder comes below, we can configure specific values for them instead of other json values with other placeholders. A specific value sampling definition can look similar to this one (values to left just refer to set environment valiables, where the right side describes the set value for that variable):
RESPONSE_FIELD_IDENT_STRING_VAL1: "{{STRING_VAL1}}"
RESPONSE_FIELD_SAMPLER_TYPE_STRING_VAL1: "SINGLE"
RESPONSE_FIELD_SAMPLER_ELEMENT_CAST_STRING_VAL1: "STRING"
RESPONSE_FIELD_SAMPLER_SELECTION_STRING_VAL1: "p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19"
The above specifies that the placeholder {{STRING_VAL1}}
corresponds to a single (SINGLE
) value,
which is of type string (STRING
) and is samples out of a comma-separated selection
among the values p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19
.
For further information on usage, refer to below reference to the project page.