Document Processing in Azure AI Search

6. February 2025

Azure AI Search is a scalable and flexible search engine from Microsoft. Hosted in Azure, Azure AI Search offers a wide range of use cases, from website searches, enterprise search, to Q&A bots and interactive knowledge applications.

In this blog post we will show hot to index contents in Azure AI Search and what advantages or disadvantages the variants below come with.

Azure AI Search REST APIs

Our connectors use the REST APIs of Azure AI Search to index documents into Azure AI Search.

Using the REST APIs has advantages and disadvantages. Using the REST APIs gives a connector or crawler framework the control to index or remove documents from the index without any detours. A detour would be to retrieve documents and then to store them into a file share or similar so that Azure AI Search must retrieve it again.

But, “behind” the REST APIs of Azure AI Search, no further transformations take place. This means that the connector or web crawler framework typically needs to perform the following processing steps before actual indexing, i.e., sending the documents to the indexing APIs:

  • Text extraction from binary documents (such as PDFs) or from HTML pages

  • Image recognition

  • Named Entity Recognition

  • Document classifications / Tagging

  • Speech to text

  • Vectorization or embedding

The connector or crawler framework must carry out these steps when using the REST APIs.

Therefore, the RheinInsights Retrieval Suite delivers document transformation pipelines. These make it possible to carry out the above-mentioned processing steps efficiently and in parallel before indexing (see Content Transformation). Our Suite offers an administration interface as no-code configuration for the respective processing steps.

Use of AI Search Indexern

Azure AI Search offers so-called indexers as another indexing method (see Indexer overview). These can capture content from various sources, such as Azure Blob Storage, various Azure SQL databases or Azure Files. These indexers bring the benefit that you can also configure document processing using the so-called data wizards (see Import wizards in Azure portal). This kind of low-code interface allows to easily configure the content (and query) transformation steps in Azure AI Search.

What can you do if your content is not orginally located in Azure Blob Storage or Azure SQL databases but you want to however still use the Data Wizards? In this case, the connectors or crawlers have to store the retrieved documents in one of these stores. Deleted documents must in turn be removed from these stores so that they get erased from the search index.

It is planned that our Retrieval Suite will support this functionality from March 2025.