Getting it started

Lets first dive into how to get the project started on your machine. There are multiple configuration options available, which will be detailled in the following.

Starting locally with docker-compose

Find the docker compose in project root. If youre referencing an existing image, you dont need to build anything beforehand. In case you want to start a local version, make sure to package the jar, create the properly tagged docker image and reference this in the docker-compose. The steps for this would be (assuming youre in the project root folder):

  1. (Optional; the docker-compose file references a public image provided in public dockerhub repo) Build docker image:
    1. ./scripts/buildJar.sh
    2. docker build . -t kolibri-base:[versionTag] (e.g versionTag = 0.1.0-rc0)
  2. set the correct version in the docker-compose.yml kolibri image setting
    1. awagen/kolibri-base:[versionTag] (for publicly available images)
    2. kolibri-base:[versionTag] (in case of custom local build)
  3. (Optional) repeat above steps for the kolibri-watch project and response-juggler (only needed in case you dont wanna use the referenced public docker images)
  4. start up the kolibri cluster along with prometheus and grafana, kolibri-watch and response-juggler: docker-compose up

Note that this is a complete setup where response-juggler just creates fake-responses to mock a real search system according to some sampling criteria (check the project itself for more details, https://github.com/awagen/response-juggler). If you want to execute the jobs on an actual search system, you only need to reference the right connections in your job definition (and you can comment out the response juggler from the docker-compose).

Lets go over the settings configurable by passing env vars, taking the docker-compose file as example:

kolibri1:
    image: awagen/kolibri-base:0.1.0-rc0
    ports:
      - "8000:8000"
      - "5266:5266"
      - "9095:9095"
    user: "1000:1000"
    environment:
      JVM_OPTS: >
        -XX:+UseG1GC
        -Xms1024m
        -Xmx4096m
      PROFILE: prod
      ROLES: httpserver
      KAMON_PROMETHEUS_PORT: 9095
      KAMON_STATUSPAGE_PORT: 5266
      CLUSTER_NODE_HOST: kolibri1
      CLUSTER_NODE_PORT: 8001
      HTTP_SERVER_INTERFACE: kolibri1
      HTTP_SERVER_PORT: 8000
      MANAGEMENT_HOST: kolibri1
      MANAGEMENT_PORT: 8558
      MANAGEMENT_BIND_HOSTNAME: '0.0.0.0'
      MANAGEMENT_BIND_PORT: 8558
      CLUSTER_NODE_BIND_HOST: '0.0.0.0'
      CLUSTER_NODE_BIND_PORT: 8001
      DISCOVERY_SERVICE_NAME: kolibri-service
      KOLIBRI_ACTOR_SYSTEM_NAME: KolibriAppSystem
      DISCOVERY_METHOD: config
      REQUEST_PARALLELISM: 16
      USE_CONNECTION_POOL_FLOW: 'false'
      RUNNING_TASK_BASELINE_COUNT: 2
      KOLIBRI_DISPATCHER_PARALLELISM_MIN: 8
      KOLIBRI_DISPATCHER_PARALLELISM_FACTOR: 8.0
      KOLIBRI_DISPATCHER_PARALLELISM_MAX: 32
      KOLIBRI_DISPATCHER_THROUGHPUT: 10
      DEFAULT_DISPATCHER_PARALLELISM_FACTOR: 1.0
      DEFAULT_DISPATCHER_PARALLELISM_MAX: 2
      DEFAULT_DISPATCHER_PARALLELISM_MIN: 1
      HTTP_CLIENT_CONNECTION_TIMEOUT: '5s'
      HTTP_CLIENT_IDLE_TIMEOUT: '10s'
      HTTP_CONNECTION_POOL_MAX_OPEN_REQUESTS: 1024
      HTTP_CONNECTION_POOL_MAX_RETRIES: 3
      HTTP_CONNECTION_POOL_MAX_CONNECTIONS: 1024
      HTTP_CONNECTION_POOL_SUBSCRIPTION_TIMEOUT: '60 seconds'
      USE_RESULT_ELEMENT_GROUPING: 'true'
      RESULT_ELEMENT_GROUPING_COUNT: 2000
      RESULT_ELEMENT_GROUPING_INTERVAL_IN_MS: 1000
      RESULT_ELEMENT_GROUPING_PARALLELISM: 1
      USE_AGGREGATOR_BACKPRESSURE: 'true'
      AGGREGATOR_RECEIVE_PARALLELISM: 32
      MAX_NR_BATCH_RETRIES: 2
      # persistence mode is one of ['AWS', 'GCP', 'LOCAL', 'CLASS']
      PERSISTENCE_MODE: 'CLASS'
      PERSISTENCE_MODULE_CLASS: 'de.awagen.kolibri.base.config.di.modules.persistence.LocalPersistenceModule'
      # properties in case PERSISTENCE_MODE is 'AWS' (or 'CLASS' and AwsPersistenceModule is referenced in PERSISTENCE_MODULE_CLASS)
      AWS_PROFILE: 'developer'
      AWS_S3_BUCKET: 'kolibri-dev'
      AWS_S3_PATH: 'metric_test'
      AWS_S3_REGION: 'EU_CENTRAL_1'
      # properties in case PERSISTENCE_MODE is 'LOCAL' (or 'CLASS' and LocalPersistenceModule is referenced in PERSISTENCE_MODULE_CLASS)
      LOCAL_STORAGE_WRITE_BASE_PATH: : '/app/data'
      LOCAL_STORAGE_WRITE_RESULTS_SUBPATH: 'test-results'
      LOCAL_STORAGE_READ_BASE_PATH: : '/app/data'
      # properties in case PERSISTENCE_MODE is 'GCP' (or 'CLASS' and GCPPersistenceModule is referenced in PERSISTENCE_MODULE_CLASS)
      GCP_GS_BUCKET: [bucket name without gs:// prefix]
      GCP_GS_PATH: [path from bucket root to append to all paths that are requested]
      GCP_GS_PROJECT_ID: [the project id for which the used service account is defined and for which the gs bucket was created]
      # JOB_TEMPLATES_PATH must be relative to the base path or bucket path, depending on the persistence selected
      JOB_TEMPLATES_PATH: 'templates/jobs'
      JUDGEMENT_FILE_SOURCE_TYPE: 'CSV'
      # if judgement file format set to 'JSON_LINES', need to set 'DOUBLE' in case judgements are numeric in the json,
      # if the numeric value is represented as string, use 'STRING'. This purely refers to how the json value is interpreted,
      # later this will be cast to double either way
      JUDGEMENT_FILE_JSON_LINES_JUDGEMENT_VALUE_TYPE_CAST: 'STRING'
      # if ure requesting via https yet ne server provides no valid certificate
      USE_INSECURE_SSL_ENGINE: 'true'
      # properties if discovery mode is kubernetes-api
      # (Optional) path to file where namespath is written in file. Available by default when setting namespace in k8s charts for the deployment / pod
      K8S_DISCOVERY_POD_NAMESPACE_PATH: '/var/run/secrets/kubernetes.io/serviceaccount/namespace'
      # namespace to which pods are assigned, if not set properly k8s access rights will not allow them finding each other.
      # make sure this corresponds to the namespace you deploy in. If set, K8S_DISCOVERY_POD_NAMESPACE_PATH doesnt have effect.
      K8S_DISCOVERY_POD_NAMESPACE: 'kolibri'
      # label selector by which to identify the right pods
      K8S_DISCOVERY_POD_LABEL_SELECTOR: 'app=%s'
    volumes:
      - ./test-files:/app/data
      - [absolute-path-containing-your-aws-config-folder]/.aws:/home/kolibri/.aws:ro
      - [path-to-dir-containing-key-file-on-local-machine]/:/home/kolibri/gcp:ro

Configuration Options in Detail

Exposed Port Settingssee ports definition above
HTTP_SERVER_PORTthe port under which to reach the endpoints exposed by the application
KAMON_PROMETHEUS_PORTport under which to scrape metrics in prometheus format
KAMON_STATUSPAGE_PORTkamon statuspage port. Kamon exposes this endpoint as status overview of which metrics are collected
General Setup Settings
PROFILEdetermines which application-[profile].conf file is picked up for config settings
ROLESeither httpserver, compute or both httpserver,compute (comma-separated). If httpserver is within the option, node starts http server on the defined HTTP_SERVER_PORT and exposes the application endpoints. If compute is set, the node is used for computations.
KAMON_PROMETHEUS_PORTport under which kamon exposes prometheus metrics to be scraped
KAMON_STATUSPAGE_PORTport on which kamon status page is exposed
CLUSTER_NODE_HOSTcluster node host
CLUSTER_NODE_PORTcluster node port
HTTP_SERVER_INTERFACEhttp server interface for the http server (only needed if httpserver is one of the above defined ROLES)
HTTP_SERVER_PORTin case one of the node roles is ‘httpserver’, the routes are exposed on that port
MANAGEMENT_HOSTmanagement host
MANAGEMENT_PORTmanagement port
MANAGEMENT_BIND_HOSTNAMEmanagement bind host name
MANAGEMENT_BIND_PORTmanagement bind port
CLUSTER_NODE_BIND_HOSTbind host for cluster node
CLUSTER_NODE_BIND_PORTbind port for cluster node, should be same for all nodes in the cluster
DISCOVERY_SERVICE_NAMEthe service name used for node discovery. Must be same for all nodes of the cluster
KOLIBRI_ACTOR_SYSTEM_NAMEthe name of the ActorSystem to use for the Kolibri application
DISCOVERY_METHODcluster node discovery method to use. Can be ‘aws-api-ec2-tag-based’ (based on ec2 instance tags in AWS), ‘config’ (defining the endpoints per config), ‘dns’ or ‘kubernetes-api’ (see examples for ‘config’ in the docker-compose.yml and ‘kubernetes-api’ in the example helm setup). Refer to the akka discovery documentation for details on the differend modes
REQUEST_PARALLELISMparallelism with which http requests are executed
USE_CONNECTION_POOL_FLOW’true’/‘false’. If false, uses single requests API (which should use connection pool under the hood), or flow of requests through connection pool flow (supposed to be more efficient than single request API). Note that it is essential to consume responses directly when they’re available to avoid running in timeouts. Note that each ‘.via(someFlow)’ call is another processing stage whose processing can be delayed relative to the one before, and might cause responses not to be consumed within the timeout if backpressure applied. The single request usage seems to be safer in this regard.
RUNNING_TASK_BASELINE_COUNTthe initial baseline count of concurrently processed batches (this number can be increased via exposed API per job)
Kolibri dispatcher settingsUsed for the processing provided by Kolibri. Should use majority of resources
KOLIBRI_DISPATCHER_PARALLELISM_MINminimal parallelism
KOLIBRI_DISPATCHER_PARALLELISM_FACTORfactor applied to number of available processors to determine number of threads
KOLIBRI_DISPATCHER_PARALLELISM_MAXmaximal parallelism
KOLIBRI_DISPATCHER_THROUGHPUTminimal parallelism
Default dispatcher settingsOnly used for some internals and kamon metrics handling. Should use only a fraction of resources since majority should be reserved for the kolibri dispatcher (settings above)
DEFAULT_DISPATCHER_PARALLELISM_FACTORfactor applied to number of available processors to determine number of threads
DEFAULT_DISPATCHER_PARALLELISM_MAXmaximal parallelism
DEFAULT_DISPATCHER_PARALLELISM_MINminimal parallelism
Http client/connection pool settings
HTTP_CLIENT_CONNECTION_TIMEOUThttp client connection timeout, in the format ‘5s’ (or ‘1m’ or similar)
HTTP_CLIENT_IDLE_TIMEOUThttp client idle timeout, in the format ’10s’ (or ‘1m’ or similar)
HTTP_CONNECTION_POOL_MAX_OPEN_REQUESTSmax concurrently open requests in the connection pool
HTTP_CONNECTION_POOL_MAX_RETRIESmax retries when executing a request
HTTP_CONNECTION_POOL_MAX_CONNECTIONSmax nr of connections for the connection pool
HTTP_CONNECTION_POOL_SUBSCRIPTION_TIMEOUTenter a FiniteDuration, e.g for example ‘60 seconds’ to be used as connection pool subscription timeout
Partial result groupingThese ‘RESULT_ELEMENT*’ settings determine how many single results are at most collected over a given timespan in an “aggregator buffer” after which the aggregated ones so far are sent to the actual aggregator. This reduces the single messages sent to aggregator quite a bit.
USE_RESULT_ELEMENT_GROUPING’true’/‘false’. Turns on/off the partial result buffering. If turned off, the other ‘RESULT_ELEMENT_GROUPING_*’ parameters below dont have any effect
RESULT_ELEMENT_GROUPING_COUNTmax elements to group after which the buffer result is sent to the actual aggregator
RESULT_ELEMENT_GROUPING_INTERVAL_IN_MSmaximal interval in ms after which the buffer result, irrespective of how many elements were buffered yet, are sent to the actual aggregator
RESULT_ELEMENT_GROUPING_PARALLELISMthe grouping parallelism
Aggregators, retries, persistence mode
USE_AGGREGATOR_BACKPRESSURE’true’/‘false’. If set to true, uses ACK messages from the aggregator to apply backpressure on the processing if aggregator can not aggregate fast enough
AGGREGATOR_RECEIVE_PARALLELISMparallelism with which result messages are sent to the aggregator
MAX_NR_BATCH_RETRIESdefines the number of retries that are executed for failed batches till they succeed
PERSISTENCE_MODEdefines where to write to and read from. Valid values: ‘LOCAL’, ‘AWS’. If ‘LOCAL’, set ‘LOCAL_STORAGE_DIR’, if ‘AWS’ set below ‘AWS_*’ vars
PERSISTENCE_MODULE_CLASSif PERSISTENCE_MODE is ‘CLASS’, full class path for the module to be loaded (needs a non-args constructor and extend PersistenceDIModule), e.g ‘de.awagen.kolibri.base.config.di.modules.persistence.LocalPersistenceModule’
The following settings are just valid if PERSISTENCE_MODE is ‘AWS’
AWS_PROFILEthis name should match a profile for which there exists a configuration in the .aws folder volume-mounted in the docker-compose definition (e.g ‘developer’ if such a profile exists)
AWS_S3_BUCKETthe bucket name (without s3:// - prefix). The AWS_PROFILE selected should have rights to read from and write to the bucket
AWS_S3_PATHthe “directory” path within the bucket defined by “AWS_S3_BUCKET”. E.g ‘metric_test’ or ‘folder1/folder2’ (yep, I know, the conception of “directories” as such does not exist in s3, but you can use it analogous)
AWS_S3_REGIONthe AWS region to utilize. Check com.amazonaws.regions.Regions enum in the utilized AWS lib to see valid values, e.g ‘EU_CENTRAL_1’
The following settings are just valid if PERSISTENCE_MODE is ‘LOCAL’
LOCAL_STORAGE_WRITE_BASE_PATHshould be set to any subPath that the local volume on the host machine is mounted to. Files are written relative to this path
LOCAL_STORAGE_WRITE_RESULTS_SUBPATHpath relative to LOCAL_STORAGE_WRITE_BASE_PATH where results are written in subFolders corresponding to the result output identifiers used in the jobs
LOCAL_STORAGE_READ_BASE_PATHshould be set to any subPath that the local volume on the host machine is mounted to. Files are read relative to this path
The following settings are just valid if PERSISTENCE_MODE is ‘GCP’
GCP_GS_BUCKETbucket name without gs:// prefix
GCP_GS_PATHpath from bucket root to append to all paths that are requested
GCP_GS_PROJECT_IDthe project id for which the used service account is defined and for which the gs bucket was created
GOOGLE_APPLICATION_CREDENTIALSFull path within container to serviceaccount key json file (see below volume mount), e.g ‘/home/kolibri/gcp/[sa-key-file-name].json’
The following settings are just valid if DISCOVERY_METHOD is ‘kubernetes-api’
K8S_DISCOVERY_POD_NAMESPACE_PATH(Optional) path to file where namespath is written in file. Available by default when setting namespace in k8s charts for the deployment / pod
K8S_DISCOVERY_POD_NAMESPACEnamespace to which pods are assigned, if not set properly k8s access rights will not allow them finding each other. Make sure this corresponds to the namespace you deploy in. If set, K8S_DISCOVERY_POD_NAMESPACE_PATH doesnt have effect.
K8S_DISCOVERY_POD_LABEL_SELECTORlabel selector by which to identify the right pods
The following additional general settings
JOB_TEMPLATES_PATHmust be relative to the base path or bucket path, depending on the persistence selected
JUDGEMENT_FILE_SOURCE_TYPEformat the judgement file is given in
JUDGEMENT_FILE_JSON_LINES_JUDGEMENT_VALUE_TYPE_CASTgives the type to cast the judgement value to in case JUDGEMENT_FILE_SOURCE_TYPE is a json. E.g if numerical values are wrapped in string, such as in “0.55”, select STRING, in case they’re given numerical, use DOUBLE
USE_INSECURE_SSL_ENGINEif ure requesting via https yet ne server provides no valid certificate
VolumesSome mounts needed to access data within docker container
./test-files:/app/datamounting workspace test-files folder to /app/data folder within container
[absolute-path-containing-your-aws-config-folder]/.aws:/home/kolibri/.aws:roread-only mount of folder on local machine containing aws credentials into standard location in container where its picked up automatically by aws lib
[path-to-dir-containing-key-file-on-local-machine]/:/home/kolibri/gcp:roread-only mount of json key file for gcp service account on local machine into location in container where its picked up by setting env variable GOOGLE_APPLICATION_CREDENTIALS to it