Click or drag to resize

DocumentExchange

Supported DocumentExchange file formats (IdClassification.DocumentExchange - Document exchange formats. Document exchange formats are non-program specific, i.e., different applications can output these exchangable document formats.)

  • All entries in table below are supported for file format identification.
  • 'X' in "Text" column indicates text extraction is supported for the file format.
  • 'X**' in "Text" column indicates text extraction is supported BUT binary-to-text filtering is used on partially parsed document records.
  • 'X' in "Metadata" column indicates metadata extraction is supported for the file format.
  • 'X' in "EmbeddedItem" column indicates embedded item/attachment extraction is supported for the file format.
  • 'X' in "ContentHash" column indicates a content hash is supported for the file format (see MD5ContentHash and SHA1ContentHash)

If a file format does not have a supported content extractor that extracts text then, optionally (default), a binary-to-text content extractor will be used to extract UTF-8, UTF-16, Windows-1252, and ASCII from the binary. In many cases, indexable text can be extract from unknown document formats.

DocumentExchange Supported File Formats

File Format Id Enum Value

Text

Metadata

EmbeddedItem

ContentHash

Description

MicrosoftXPS

X

X

Microsoft XPS (Open XML Paper Specification) (.xps).

MicrosoftXPSCorrupted

X

X

Microsoft XPS (Open XML Paper Specification) that is potentially corrupted. The format's zip container failed inspection (zip potentially truncated) and format had to be identified using an alternate means (.xps).

AdobePDF

X

X

X

Adobe Portable Document Format (PDF) (.pdf).

AdobePDFEncrypted

X

X

X

Encrypted Adobe Portable Document Format (PDF) (.pdf).

AdobePDF_Portfolio

X

X

X

Adobe Portable Document Format (PDF) Portfolio. A PDF Portfolio contains multiple files assembled into an integrated PDF unit (.pdf).

AdobePDF_PortfolioEncrypted

X

X

X

Encrypted Adobe Portable Document Format (PDF) Portfolio. A PDF Portfolio contains multiple files assembled into an integrated PDF unit (.pdf).

AdobePDF_XFA

X

X

X

Adobe Portable Document Format (PDF) XML Forms Architecture (XFA). An XFA PDF is a interactive and dynamic form created with AEM Forms Designer (.pdf).

AdobePDF_XFAEncrypted

X

X

X

Encrypted Adobe Portable Document Format (PDF) XML Forms Architecture (XFA). An XFA PDF is a interactive and dynamic form created with AEM Forms Designer (.pdf).

AdobePDFAcroForm

X

X

X

Adobe Portable Document Format (PDF) AcroForm. AcroForm is Adobe’s older interactive form technology (.pdf).

AdobePDFAcroFormEncrypted

X

X

X

Encrypted Adobe Portable Document Format (PDF) AcroForm. AcroForm is Adobe’s older interactive form technology (.pdf).

AdobeFormsDataFormat

Acrobat Forms Data Format (FDF) (.fdf)

AdobeXDP

Adobe XML Data Package (XDP) format, this format allows PDF and/or XFA content resources to be packaged within an XML container (.xdp).

AdobeXFDF

X

Adobe XML Forms Data Format (XFDF) is a format for representing forms data and annotations in a PDF document. XFDF is an XML version of Forms Data Format (FDF) (.xfdf).

RichTextFormat

X

X

X

Microsoft Rich Text Format (*.rtf)

DjVu

DjVu file format. This file format was designed primarily to store scanned documents but is also used as eBook format and has been promoted as an alternative to PDF (.djv;.djvu).

DjVuEncrypted

Encrypted (secure) DjVu file format. This format designed primarily to store scanned documents but is also used as eBook format and has been promoted as an alternative to PDF (.djv;.djvu).

PostScript

Adobe Postscript (.ps).

EncapsulatedPostScript

Adobe Encapsulated Postscript (.eps;.epsf;.ps).

EncapsulatedPostScriptWithPreviewImage

Encapsulated PostScript with content preview image (usually TIFF image) (.eps;.epsf;.epsi).

DocBookXml

X

OASIS DocBook XML document for general and technical publishing (.xml).