Retrieval Augmented Generation for Regulatory Documents

March 2025

Retrieval Augmented Generation (RAG) and Enterprise Search can significantly reduce costs when it comes to training of staff in highly regulated areas. For instance, when it comes to train networks, air traffic or internet service providers where administrators and operators have to study and keep track of many rules and regulations.

As is well-known by now, large language models are extremely good in handling law questions. But as a small disclaimer, they never replace skilled personell and you should always get advice from professionals rather than blindly following what a search engine tells you.

What is the benefit of having regulations in an RAG application?

Of course trainees need to learn many concepts by heart. A Q&A bot can however answer regulatory questions right away. Moreover, it is able to cite the right regulations which shortcuts the learning process significantly. Even formulations which are imprecise can be reasonably answered with this approach.

Setting Up Your RAG environment

A Retrieval Augmented Generation setup always comprises a search engine (retrieval), indexed data, a large language model and some kind of a user interface.

Setting up the environment means that you need to decide on a (vector) search engine, to deploy and configure it. Cloud search engines, such as Azure AI Search or similar are available as managed service and you can start quite easy without setting up hardware and deploying on your own. On the other hand, it might be more cost-effective to have an on-premise search engine.
You need to decide on a large language model for embeddings, i.e., transforming your text into vectors. And you need to decide on a large language model for completions, i.e., to formulate answers to questions.
We often use OpenAI GPT or open source models, for example textembedding-ada-02 for vectorization and gpt-4o-mini for completions.
Then you need to index your data. Here you normally need to crawl a file share, an enterprise content system, like SharePoint, or web pages. This is taken over by a connector or crawler.
Prior to pushing the documents to the search engine, you need to extract the texts and contexts. Here you can use Apache Tika for converting binary documents (for instance PDFs or Word documents) into plain text.
Also, we usually recommend splitting the document into smaller chunks, as the relevancy becomes much better and answers become more precise.
Last but not least, you need to vectorize the document contents using your embedding.
These vectors can then be indexed in search.

Regulatory documents usually come as long-ish PDFs or series of long PDFs. Sometimes they are stored on web pages. By configuring your crawler in Step 3 the right way, you are able to index all applicable laws and regulations into your search engine.

Retrieving Answers Based on Your Regulations

Now you have all your documents in your vector search. You only need to expose this knowledge to your users. For this, you need to build up the user facing portion of your RAG application. This works as follows.

Create a small RESTful service which accepts user queries or questions
The user input must be vectorized, using the same model as in Step 6
Then you can query the search engine using this vector
Together with the user query, the search results are then taken to the completions API of your large language model. For instance as follows:
“A user asks the following question ‘'<question>’'. Here are matching regulations ‘'<searchResult1>’', ‘'<searchResult2>’', ‘'<searchResult3>’'. Can you answer the question using these results?
This is sent to the completions API of your language model deployment.
Afterwards, return the completion to the REST API.

You can integrate such a REST API into Teams, into Slack or on a web page - wherever your users and trainees might need to get support.

Our Product - the RheinInsights Retrieval Suite

As outlined above, you can easily build your own regulatory Q&A-application. The benefit which comes from the RheinInsights Retrieval Suite is that it offers crawling, text processing, document splitting, vectorization on the one hand and a rich interface and REST APIs on the other. This will simplify getting started while you have full control on the underlying data and search engine.

More insights

Sourcing Corporate Data Sets for Machine Learning > Retrieval Augmented Generation for Regulatory Documents > RheinInsights Retrieval Suite - March 2025 Release