RheinInsights Logo
  • Home
  • Insights
  • Retrieval Suite
    • RheinInsights Retrieval Suite
    • Your Enterprise AI
    • Feature Comparison

    • Getting Started in Minutes
    • Enterprise Connectors
    • AI Pipelines
    • Integrations and UX
  • Documentation
  • About us
    • About us
    • Contact
Downloads

Documentation

  • Getting Started
  • Deployment icon
  • Technical Prerequisites icon
  • Data Privacy and Routing icon
  • Administration icon
  • Enterprise Search Connectors icon
    • Managing Connectors
    • Connector Home View
    • Sources icon
    • Content Transformation icon
      • ACL Assigner
      • Adjustable Text Extractor
      • Data Logger
      • Document Splitter
      • Html Token Remover
      • Metadata Assigner
      • Metadata Extractor
      • Metadata Mapper
      • Text Extractor
      • Vectorizer and Embeddings
    • Security Transformation icon
    • General Crawl Settings
    • Performance Considerations
    • Crawl Modes
    • Crawl Scheduling
    • Standard Schema
    • State View
    • Principal State View
  • Search Experiences icon
  • Query Pipelines icon
  • Search Engines icon
  • MCP, Agents and Bot Integrations icon
  • Backup and Restore Concept
  • Software Updates and Upgrade
  • Releases and Release Notes

Text Extractor

This stage uses Apache Tika to extract textual contents from a given binary, text or HTML document. It also adds additional metadata which are generated during text extraction to the document metadata.

This stage does not have additional configuration parameters.

image-20241005-084953.png

This stage does not remove any HTML tokens from an HTML document. Here, you should use Html Token Remover .

Back to top

© 2025 RheinInsights GmbH · Data privacy · Imprint · Copyrights and Open Source Licenses · Contact