Skip to content

Apache Tika document loader

Apache Tika support is an opt-in parser for broad document extraction in Largestack AI. Use it when a workflow needs one parser for formats such as PDF, DOC/DOCX, PPT/PPTX, XLS/XLSX, HTML, RTF, and other formats supported by Apache Tika.

Tika server backend

Run Apache Tika in a trusted environment, then point Largestack at the server:

export TIKA_SERVER_URL=http://127.0.0.1:9998
from largestack._loaders.tika import load_with_tika

docs = await load_with_tika(
    "file.pdf",
    server_url="http://127.0.0.1:9998",
)

The HTTP backend is the default. It uses /rmeta/text for recursive metadata and content extraction, then falls back to /tika/text if recursive metadata is unavailable.

Dispatcher usage

Use the Tika parser through the loader dispatcher without changing default loader behavior:

from largestack._loaders import load

docs = await load("file.docx", parser="tika")

Calling load("file.docx") without parser="tika" continues to use the built-in Largestack loader for that file type.

Python package backend

The optional Python backend uses the PyPI tika package:

pip install "largestack[tika]"
from largestack._loaders.tika import load_with_tika

docs = await load_with_tika("file.pdf", backend="python")

This backend may require Java and may start or download Apache Tika server assets depending on the user environment. Prefer the HTTP server backend for production and controlled-pilot deployments.

Security note

Files are sent to the configured Apache Tika server. For production workloads, run Tika on trusted internal infrastructure, avoid sending sensitive documents to unknown remote servers, and do not log extracted document content.