📖 feat: Add Native ODT Document Parser Support#12303
Merged
Merged
Conversation
1 task
Contributor
There was a problem hiding this comment.
Pull request overview
Adds full native ODT parsing support so .odt uploads are routed through the document parser pipeline and have text extracted (fixing #12302) instead of being treated as raw binary/ZIP.
Changes:
- Route
application/vnd.oasis.opendocument.textthrough the document parser viadocumentParserMimeTypes. - Add an ODT parser (
odtToText) and wire it intogetParserForMimeType(). - Add a Jest integration test +
sample.odtfixture.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| packages/data-provider/src/file-config.ts | Adds ODT MIME regex to the native document-parser routing list. |
| packages/api/src/files/documents/crud.ts | Adds ODT parsing implementation and selects it for the ODT MIME type. |
| packages/api/src/files/documents/crud.spec.ts | Adds a test asserting ODT text extraction works end-to-end. |
| packages/api/src/files/documents/sample.odt | Adds a real ODT fixture used by the new test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cd5a3be to
3dbc166
Compare
- Scope extraction to <office:body> to exclude metadata/style nodes - Map </text:p> and </text:h> closings to newlines, preserving paragraph structure instead of collapsing everything to a single line - Handle <text:line-break/> as explicit newlines - Strip remaining tags, normalize horizontal whitespace, cap consecutive blank lines at one - Regenerate sample.odt as a two-paragraph fixture so the test exercises multi-paragraph output - Add empty.odt fixture and test asserting 'No text found in document'
- Use static `import JSZip from 'jszip'` instead of dynamic import; jszip is CommonJS-only with no ESM/Jest-isolation concern (F1) - Decode the five standard XML entities after tag-stripping so documents with &, <, >, ", ' send correct text to the LLM (F2) - Remove @types/jszip devDependency; jszip ships bundled declarations and @types/jszip is a stale 2020 stub that would shadow them (F3) - Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic tag stripper so tab-aligned and multi-space content is preserved (F4) - Add sample-entities.odt fixture and test covering entity decoding, tab, and spacing-element handling (F5) - Rename 'throws for empty odt' → 'throws for odt with no extractable text' to distinguish from a zero-byte/corrupt file case (F8)
Reads uncompressed entry sizes from the JSZip internal metadata before extracting any content. Throws if the total exceeds 50MB, preventing a crafted ODT with a high-ratio compressed payload from exhausting heap. Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on disk, 51MB uncompressed) to verify the guard fires before any extraction.
.java files were rejected with "Unable to determine file type" because browsers send an empty MIME type for them and codeTypeMapping had no 'java' entry for inferMimeType() to fall back on. text/x-java was already present in all five validation lists (fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList, textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain) ensures .java uploads work for both File Search and Code Interpreter. Closes danny-avila#12307
A: regenerate package-lock.json after removing @types/jszip from
package.json; without this npm ci was still installing the stale
2020 type stubs and TypeScript was resolving against them
B: replace dynamic import('jszip') in the zip-bomb test with the same
static import already used in production; jszip is CJS-only with no
ESM/Jest isolation concern
C: document that the _data.uncompressedSize guard fails open if jszip
renames the private field (accepted limitation, test would catch it)
D: rename 'preserves tabs' test to 'normalizes tab and spacing elements
to spaces' since <text:tab> is collapsed to a space, not kept as \t
E: fix test.each([ formatting artifact (missing newline after '[')
This was
linked to
issues
Mar 19, 2026
jcbartle
pushed a commit
to jcbartle/LibreChat
that referenced
this pull request
May 11, 2026
* fix: add ODT support to native document parser * fix: replace execSync with jszip for ODT parsing * docs: update documentParserMimeTypes comment to include odt * fix: improve ODT XML extraction and add empty.odt fixture - Scope extraction to <office:body> to exclude metadata/style nodes - Map </text:p> and </text:h> closings to newlines, preserving paragraph structure instead of collapsing everything to a single line - Handle <text:line-break/> as explicit newlines - Strip remaining tags, normalize horizontal whitespace, cap consecutive blank lines at one - Regenerate sample.odt as a two-paragraph fixture so the test exercises multi-paragraph output - Add empty.odt fixture and test asserting 'No text found in document' * fix: address review findings in ODT parser - Use static `import JSZip from 'jszip'` instead of dynamic import; jszip is CommonJS-only with no ESM/Jest-isolation concern (F1) - Decode the five standard XML entities after tag-stripping so documents with &, <, >, ", ' send correct text to the LLM (F2) - Remove @types/jszip devDependency; jszip ships bundled declarations and @types/jszip is a stale 2020 stub that would shadow them (F3) - Handle <text:tab/> → \t and <text:s .../> → ' ' before the generic tag stripper so tab-aligned and multi-space content is preserved (F4) - Add sample-entities.odt fixture and test covering entity decoding, tab, and spacing-element handling (F5) - Rename 'throws for empty odt' → 'throws for odt with no extractable text' to distinguish from a zero-byte/corrupt file case (F8) * fix: add decompressed content size cap to odtToText (F6) Reads uncompressed entry sizes from the JSZip internal metadata before extracting any content. Throws if the total exceeds 50MB, preventing a crafted ODT with a high-ratio compressed payload from exhausting heap. Adds a corresponding test using a real DEFLATE-compressed ZIP (~51KB on disk, 51MB uncompressed) to verify the guard fires before any extraction. * fix: add java to codeTypeMapping for file upload support .java files were rejected with "Unable to determine file type" because browsers send an empty MIME type for them and codeTypeMapping had no 'java' entry for inferMimeType() to fall back on. text/x-java was already present in all five validation lists (fullMimeTypesList, codeInterpreterMimeTypesList, retrievalMimeTypesList, textMimeTypes, retrievalMimeTypes), so mapping to it (not text/plain) ensures .java uploads work for both File Search and Code Interpreter. Closes danny-avila#12307 * fix: address follow-up review findings (A-E) A: regenerate package-lock.json after removing @types/jszip from package.json; without this npm ci was still installing the stale 2020 type stubs and TypeScript was resolving against them B: replace dynamic import('jszip') in the zip-bomb test with the same static import already used in production; jszip is CJS-only with no ESM/Jest isolation concern C: document that the _data.uncompressedSize guard fails open if jszip renames the private field (accepted limitation, test would catch it) D: rename 'preserves tabs' test to 'normalizes tab and spacing elements to spaces' since <text:tab> is collapsed to a space, not kept as \t E: fix test.each([ formatting artifact (missing newline after '[') --------- Co-authored-by: Danny Avila <danny@librechat.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #12302
ODT support was partially introduced in #11959, which added the MIME type to the allowlist but two pieces were missing for the full pipeline to work:
application/vnd.oasis.opendocument.texttodocumentParserMimeTypesinpackages/data-provider/src/file-config.ts, routing ODT files through the document parser pipeline inprocessAgentFileUploadodtToText()parser inpackages/api/src/files/documents/crud.tsthat extracts text usingjszipto readcontent.xmlfrom the ODT ZIP archiveChange Type
Testing
Added
parseDocument() parses text from odtinpackages/api/src/files/documents/crud.spec.tsusing a realsample.odtfixture. Run with:Manual end-to-end test
registry.librechat.ai/danny-avila/librechat-dev:latestfileConfigoverrides inlibrechat.yamlcontextcapability.odtfile using "Upload as Text"Before fix: model responds with raw binary/ZIP dump description and cannot read the content
After fix: model correctly reads and answers questions about the document content
Test Configuration:
No special environment variables required. Tests run without any external services or API keys.
Checklist