Skip to content

📖 feat: Add Native ODT Document Parser Support#12303

Merged
danny-avila merged 8 commits into
danny-avila:devfrom
pol9061:fix/odt-native-parser
Mar 19, 2026
Merged

📖 feat: Add Native ODT Document Parser Support#12303
danny-avila merged 8 commits into
danny-avila:devfrom
pol9061:fix/odt-native-parser

Conversation

@pol9061

@pol9061 pol9061 commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #12302

ODT support was partially introduced in #11959, which added the MIME type to the allowlist but two pieces were missing for the full pipeline to work:

  1. Added application/vnd.oasis.opendocument.text to documentParserMimeTypes in packages/data-provider/src/file-config.ts, routing ODT files through the document parser pipeline in processAgentFileUpload
  2. Added odtToText() parser in packages/api/src/files/documents/crud.ts that extracts text using jszip to read content.xml from the ODT ZIP archive

Change Type

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Translation update

Testing

Added parseDocument() parses text from odt in packages/api/src/files/documents/crud.spec.ts using a real sample.odt fixture. Run with:

cd packages/api
npx jest --testPathPatterns=crud.spec --no-coverage

Manual end-to-end test

  1. Run registry.librechat.ai/danny-avila/librechat-dev:latest
  2. No fileConfig overrides in librechat.yaml
  3. Create an agent with the context capability
  4. Upload an .odt file using "Upload as Text"
  5. Ask the model about the file content
  6. Model correctly returns the extracted text content

Before fix: model responds with raw binary/ZIP dump description and cannot read the content
After fix: model correctly reads and answers questions about the document content

Test Configuration:

No special environment variables required. Tests run without any external services or API keys.

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • I have made pertinent documentation changes
  • My changes do not introduce new warnings
  • I have written tests demonstrating that my changes are effective or that my feature works
  • Local unit tests pass with my changes
  • Any changes dependent on mine have been merged and published in downstream modules.
  • A pull request for updating the documentation has been submitted.

Copilot AI review requested due to automatic review settings March 18, 2026 21:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds full native ODT parsing support so .odt uploads are routed through the document parser pipeline and have text extracted (fixing #12302) instead of being treated as raw binary/ZIP.

Changes:

  • Route application/vnd.oasis.opendocument.text through the document parser via documentParserMimeTypes.
  • Add an ODT parser (odtToText) and wire it into getParserForMimeType().
  • Add a Jest integration test + sample.odt fixture.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File Description
packages/data-provider/src/file-config.ts Adds ODT MIME regex to the native document-parser routing list.
packages/api/src/files/documents/crud.ts Adds ODT parsing implementation and selects it for the ODT MIME type.
packages/api/src/files/documents/crud.spec.ts Adds a test asserting ODT text extraction works end-to-end.
packages/api/src/files/documents/sample.odt Adds a real ODT fixture used by the new test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/data-provider/src/file-config.ts
Comment thread packages/api/src/files/documents/crud.ts Outdated
Comment thread packages/api/src/files/documents/crud.ts Outdated
Comment thread packages/api/src/files/documents/crud.ts Outdated
Comment thread packages/data-provider/src/file-config.ts Outdated
@pol9061 pol9061 marked this pull request as draft March 18, 2026 21:56
@pol9061 pol9061 force-pushed the fix/odt-native-parser branch from cd5a3be to 3dbc166 Compare March 18, 2026 21:59
@pol9061 pol9061 marked this pull request as ready for review March 18, 2026 22:31
- Scope extraction to <office:body> to exclude metadata/style nodes
- Map </text:p> and </text:h> closings to newlines, preserving paragraph
  structure instead of collapsing everything to a single line
- Handle <text:line-break/> as explicit newlines
- Strip remaining tags, normalize horizontal whitespace, cap consecutive
  blank lines at one
- Regenerate sample.odt as a two-paragraph fixture so the test exercises
  multi-paragraph output
- Add empty.odt fixture and test asserting 'No text found in document'
- Use static `import JSZip from 'jszip'` instead of dynamic import;
  jszip is CommonJS-only with no ESM/Jest-isolation concern (F1)
- Decode the five standard XML entities after tag-stripping so
  documents with &, <, >, ", ' send correct text to the LLM (F2)
- Remove @types/jszip devDependency; jszip ships bundled declarations
  and @types/jszip is a stale 2020 stub that would shadow them (F3)
- Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic
  tag stripper so tab-aligned and multi-space content is preserved (F4)
- Add sample-entities.odt fixture and test covering entity decoding,
  tab, and spacing-element handling (F5)
- Rename 'throws for empty odt' → 'throws for odt with no extractable
  text' to distinguish from a zero-byte/corrupt file case (F8)
Reads uncompressed entry sizes from the JSZip internal metadata before
extracting any content. Throws if the total exceeds 50MB, preventing a
crafted ODT with a high-ratio compressed payload from exhausting heap.

Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on
disk, 51MB uncompressed) to verify the guard fires before any extraction.
.java files were rejected with "Unable to determine file type" because
browsers send an empty MIME type for them and codeTypeMapping had no
'java' entry for inferMimeType() to fall back on.

text/x-java was already present in all five validation lists
(fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList,
textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain)
ensures .java uploads work for both File Search and Code Interpreter.

Closes danny-avila#12307
A: regenerate package-lock.json after removing @types/jszip from
   package.json; without this npm ci was still installing the stale
   2020 type stubs and TypeScript was resolving against them
B: replace dynamic import('jszip') in the zip-bomb test with the same
   static import already used in production; jszip is CJS-only with no
   ESM/Jest isolation concern
C: document that the _data.uncompressedSize guard fails open if jszip
   renames the private field (accepted limitation, test would catch it)
D: rename 'preserves tabs' test to 'normalizes tab and spacing elements
   to spaces' since <text:tab> is collapsed to a space, not kept as \t
E: fix test.each([ formatting artifact (missing newline after '[')
@danny-avila danny-avila changed the title fix: add ODT support to native document parser 📖 feat: Add Native ODT Document Parser Support Mar 19, 2026
@danny-avila danny-avila merged commit 7e74165 into danny-avila:dev Mar 19, 2026
12 checks passed
jcbartle pushed a commit to jcbartle/LibreChat that referenced this pull request May 11, 2026
* fix: add ODT support to native document parser

* fix: replace execSync with jszip for ODT parsing

* docs: update documentParserMimeTypes comment to include odt

* fix: improve ODT XML extraction and add empty.odt fixture

- Scope extraction to <office:body> to exclude metadata/style nodes
- Map </text:p> and </text:h> closings to newlines, preserving paragraph
  structure instead of collapsing everything to a single line
- Handle <text:line-break/> as explicit newlines
- Strip remaining tags, normalize horizontal whitespace, cap consecutive
  blank lines at one
- Regenerate sample.odt as a two-paragraph fixture so the test exercises
  multi-paragraph output
- Add empty.odt fixture and test asserting 'No text found in document'

* fix: address review findings in ODT parser

- Use static `import JSZip from 'jszip'` instead of dynamic import;
  jszip is CommonJS-only with no ESM/Jest-isolation concern (F1)
- Decode the five standard XML entities after tag-stripping so
  documents with &, <, >, ", ' send correct text to the LLM (F2)
- Remove @types/jszip devDependency; jszip ships bundled declarations
  and @types/jszip is a stale 2020 stub that would shadow them (F3)
- Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic
  tag stripper so tab-aligned and multi-space content is preserved (F4)
- Add sample-entities.odt fixture and test covering entity decoding,
  tab, and spacing-element handling (F5)
- Rename 'throws for empty odt' → 'throws for odt with no extractable
  text' to distinguish from a zero-byte/corrupt file case (F8)

* fix: add decompressed content size cap to odtToText (F6)

Reads uncompressed entry sizes from the JSZip internal metadata before
extracting any content. Throws if the total exceeds 50MB, preventing a
crafted ODT with a high-ratio compressed payload from exhausting heap.

Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on
disk, 51MB uncompressed) to verify the guard fires before any extraction.

* fix: add java to codeTypeMapping for file upload support

.java files were rejected with "Unable to determine file type" because
browsers send an empty MIME type for them and codeTypeMapping had no
'java' entry for inferMimeType() to fall back on.

text/x-java was already present in all five validation lists
(fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList,
textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain)
ensures .java uploads work for both File Search and Code Interpreter.

Closes danny-avila#12307

* fix: address follow-up review findings (A-E)

A: regenerate package-lock.json after removing @types/jszip from
   package.json; without this npm ci was still installing the stale
   2020 type stubs and TypeScript was resolving against them
B: replace dynamic import('jszip') in the zip-bomb test with the same
   static import already used in production; jszip is CJS-only with no
   ESM/Jest isolation concern
C: document that the _data.uncompressedSize guard fails open if jszip
   renames the private field (accepted limitation, test would catch it)
D: rename 'preserves tabs' test to 'normalizes tab and spacing elements
   to spaces' since <text:tab> is collapsed to a space, not kept as \t
E: fix test.each([ formatting artifact (missing newline after '[')

---------

Co-authored-by: Danny Avila <danny@librechat.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants