Performance Considerations

Our connectors and the query pipelines are designed for heavy duty and high load scenarios.

Our query pipelines run with optimal complexity and generate one API request towards the search engine(s) per run. The search interfaces generate as many requests against the query pipelines as result lists are configured on the respective vertical.

The connectors are designed to perform a minimum amount of API calls towards the content source, to implement smart crawl strategies and caching where needed. Our strategies try to avoid combinatorial explosions as well as nested loops.

O(n) is our goal and sometimes we see O(nlogn) complexity where we cannot avoid a better complexity.

Crawling

However, there are boundaries which limit crawl speeds:

  1. Hardware of the underlying VM Technical Prerequisites

  2. Content source performance. If the connector is configured to crawl the entire content source, it will need to access all contents. Even though the crawl strategies are smart, the content source needs to answer many API calls. If the source is under high load or slow, then crawling gets delayed.

  3. Database connectivity and performance (cf. Database Settings ). The database is the heart of the

  4. connector and the connector needs to maintain a lot of information in this database.

  5. Content transformation performance. Embeddings, content categorization through LLMs, etc. take time and will slow down the initial crawl speed. Text extractions from Words, PDFs, etc. however should be fast operations, if CPU is available and fast enough.

  6. Search engine. If the search engine is not fast in accepting sent documents, then this may delay the connector.

You can always double check the journey of a document, by finding the document in the document state, e.g.,

Query APIs and Search Interface

Query processing and searches via the search interface mainly rely on the processing steps within the configured query pipelines and the performance of the search engine. A main cost driver are embeddings and a potential vector search.

If performance is not as wanted, please consider the measurements which are visible within the log files.