📖 feat: Add Native ODT Document Parser Support by pol9061 · Pull Request #12303 · danny-avila/LibreChat

pol9061 · 2026-03-18T21:50:53Z

Summary

ODT support was partially introduced in #11959, which added the MIME type to the allowlist but two pieces were missing for the full pipeline to work:

Added application/vnd.oasis.opendocument.text to documentParserMimeTypes in packages/data-provider/src/file-config.ts, routing ODT files through the document parser pipeline in processAgentFileUpload
Added odtToText() parser in packages/api/src/files/documents/crud.ts that extracts text using jszip to read content.xml from the ODT ZIP archive

Change Type

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Translation update

Testing

Added parseDocument() parses text from odt in packages/api/src/files/documents/crud.spec.ts using a real sample.odt fixture. Run with:

cd packages/api
npx jest --testPathPatterns=crud.spec --no-coverage

Manual end-to-end test

Run registry.librechat.ai/danny-avila/librechat-dev:latest
No fileConfig overrides in librechat.yaml
Create an agent with the context capability
Upload an .odt file using "Upload as Text"
Ask the model about the file content
Model correctly returns the extracted text content

Before fix: model responds with raw binary/ZIP dump description and cannot read the content
After fix: model correctly reads and answers questions about the document content

Test Configuration:

No special environment variables required. Tests run without any external services or API keys.

Checklist

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented in any complex areas of my code
I have made pertinent documentation changes
My changes do not introduce new warnings
I have written tests demonstrating that my changes are effective or that my feature works
Local unit tests pass with my changes
Any changes dependent on mine have been merged and published in downstream modules.
A pull request for updating the documentation has been submitted.

Copilot

Pull request overview

Adds full native ODT parsing support so .odt uploads are routed through the document parser pipeline and have text extracted (fixing #12302) instead of being treated as raw binary/ZIP.

Changes:

Route application/vnd.oasis.opendocument.text through the document parser via documentParserMimeTypes.
Add an ODT parser (odtToText) and wire it into getParserForMimeType().
Add a Jest integration test + sample.odt fixture.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File	Description
packages/data-provider/src/file-config.ts	Adds ODT MIME regex to the native document-parser routing list.
packages/api/src/files/documents/crud.ts	Adds ODT parsing implementation and selects it for the ODT MIME type.
packages/api/src/files/documents/crud.spec.ts	Adds a test asserting ODT text extraction works end-to-end.
packages/api/src/files/documents/sample.odt	Adds a real ODT fixture used by the new test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Scope extraction to <office:body> to exclude metadata/style nodes - Map </text:p> and </text:h> closings to newlines, preserving paragraph structure instead of collapsing everything to a single line - Handle <text:line-break/> as explicit newlines - Strip remaining tags, normalize horizontal whitespace, cap consecutive blank lines at one - Regenerate sample.odt as a two-paragraph fixture so the test exercises multi-paragraph output - Add empty.odt fixture and test asserting 'No text found in document'

- Use static `import JSZip from 'jszip'` instead of dynamic import; jszip is CommonJS-only with no ESM/Jest-isolation concern (F1) - Decode the five standard XML entities after tag-stripping so documents with &, <, >, ", ' send correct text to the LLM (F2) - Remove @types/jszip devDependency; jszip ships bundled declarations and @types/jszip is a stale 2020 stub that would shadow them (F3) - Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic tag stripper so tab-aligned and multi-space content is preserved (F4) - Add sample-entities.odt fixture and test covering entity decoding, tab, and spacing-element handling (F5) - Rename 'throws for empty odt' → 'throws for odt with no extractable text' to distinguish from a zero-byte/corrupt file case (F8)

Reads uncompressed entry sizes from the JSZip internal metadata before extracting any content. Throws if the total exceeds 50MB, preventing a crafted ODT with a high-ratio compressed payload from exhausting heap. Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on disk, 51MB uncompressed) to verify the guard fires before any extraction.

.java files were rejected with "Unable to determine file type" because browsers send an empty MIME type for them and codeTypeMapping had no 'java' entry for inferMimeType() to fall back on. text/x-java was already present in all five validation lists (fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList, textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain) ensures .java uploads work for both File Search and Code Interpreter. Closes danny-avila#12307

A: regenerate package-lock.json after removing @types/jszip from package.json; without this npm ci was still installing the stale 2020 type stubs and TypeScript was resolving against them B: replace dynamic import('jszip') in the zip-bomb test with the same static import already used in production; jszip is CJS-only with no ESM/Jest isolation concern C: document that the _data.uncompressedSize guard fails open if jszip renames the private field (accepted limitation, test would catch it) D: rename 'preserves tabs' test to 'normalizes tab and spacing elements to spaces' since <text:tab> is collapsed to a space, not kept as \t E: fix test.each([ formatting artifact (missing newline after '[')

* fix: add ODT support to native document parser * fix: replace execSync with jszip for ODT parsing * docs: update documentParserMimeTypes comment to include odt * fix: improve ODT XML extraction and add empty.odt fixture - Scope extraction to <office:body> to exclude metadata/style nodes - Map </text:p> and </text:h> closings to newlines, preserving paragraph structure instead of collapsing everything to a single line - Handle <text:line-break/> as explicit newlines - Strip remaining tags, normalize horizontal whitespace, cap consecutive blank lines at one - Regenerate sample.odt as a two-paragraph fixture so the test exercises multi-paragraph output - Add empty.odt fixture and test asserting 'No text found in document' * fix: address review findings in ODT parser - Use static `import JSZip from 'jszip'` instead of dynamic import; jszip is CommonJS-only with no ESM/Jest-isolation concern (F1) - Decode the five standard XML entities after tag-stripping so documents with &, <, >, ", ' send correct text to the LLM (F2) - Remove @types/jszip devDependency; jszip ships bundled declarations and @types/jszip is a stale 2020 stub that would shadow them (F3) - Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic tag stripper so tab-aligned and multi-space content is preserved (F4) - Add sample-entities.odt fixture and test covering entity decoding, tab, and spacing-element handling (F5) - Rename 'throws for empty odt' → 'throws for odt with no extractable text' to distinguish from a zero-byte/corrupt file case (F8) * fix: add decompressed content size cap to odtToText (F6) Reads uncompressed entry sizes from the JSZip internal metadata before extracting any content. Throws if the total exceeds 50MB, preventing a crafted ODT with a high-ratio compressed payload from exhausting heap. Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on disk, 51MB uncompressed) to verify the guard fires before any extraction. * fix: add java to codeTypeMapping for file upload support .java files were rejected with "Unable to determine file type" because browsers send an empty MIME type for them and codeTypeMapping had no 'java' entry for inferMimeType() to fall back on. text/x-java was already present in all five validation lists (fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList, textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain) ensures .java uploads work for both File Search and Code Interpreter. Closes danny-avila#12307 * fix: address follow-up review findings (A-E) A: regenerate package-lock.json after removing @types/jszip from package.json; without this npm ci was still installing the stale 2020 type stubs and TypeScript was resolving against them B: replace dynamic import('jszip') in the zip-bomb test with the same static import already used in production; jszip is CJS-only with no ESM/Jest isolation concern C: document that the _data.uncompressedSize guard fails open if jszip renames the private field (accepted limitation, test would catch it) D: rename 'preserves tabs' test to 'normalizes tab and spacing elements to spaces' since <text:tab> is collapsed to a space, not kept as \t E: fix test.each([ formatting artifact (missing newline after '[') --------- Co-authored-by: Danny Avila <danny@librechat.ai>

Copilot AI review requested due to automatic review settings March 18, 2026 21:50

pol9061 mentioned this pull request Mar 18, 2026

[Bug]: ODT file upload silently fails in native document parser despite PR #11959 #12302

Closed

1 task

Copilot AI reviewed Mar 18, 2026

View reviewed changes

pol9061 marked this pull request as draft March 18, 2026 21:56

fix: add ODT support to native document parser

3dbc166

pol9061 force-pushed the fix/odt-native-parser branch from cd5a3be to 3dbc166 Compare March 18, 2026 21:59

pol9061 added 2 commits March 18, 2026 23:17

fix: replace execSync with jszip for ODT parsing

b87d033

docs: update documentParserMimeTypes comment to include odt

86a549a

pol9061 marked this pull request as ready for review March 18, 2026 22:31

danny-avila added 5 commits March 19, 2026 13:13

danny-avila changed the title ~~fix: add ODT support to native document parser~~ 📖 feat: Add Native ODT Document Parser Support Mar 19, 2026

This was linked to issues Mar 19, 2026

[Bug]: ODT file upload silently fails in native document parser despite PR #11959 #12302

Closed

[Enhancement]: Add missing 'java' extension to codeTypeMapping for file search support #12307

Closed

danny-avila merged commit 7e74165 into danny-avila:dev Mar 19, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📖 feat: Add Native ODT Document Parser Support#12303

📖 feat: Add Native ODT Document Parser Support#12303
danny-avila merged 8 commits into
danny-avila:devfrom
pol9061:fix/odt-native-parser

pol9061 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pol9061 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Testing

Test Configuration:

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pol9061 commented Mar 18, 2026 •

edited

Loading