Click or drag to resize

Text

Supported Text file formats (IdClassification.Text - Text document encoding formats)

  • All entries in table below are supported for file format identification.
  • 'X' in "Text" column indicates text extraction is supported for the file format.
  • 'X**' in "Text" column indicates text extraction is supported BUT binary-to-text filtering is used on partially parsed document records.
  • 'X' in "Metadata" column indicates metadata extraction is supported for the file format.
  • 'X' in "EmbeddedItem" column indicates embedded item/attachment extraction is supported for the file format.
  • 'X' in "ContentHash" column indicates a content hash is supported for the file format (see MD5ContentHash and SHA1ContentHash)

If a file format does not have a supported content extractor that extracts text then, optionally (default), a binary-to-text content extractor will be used to extract UTF-8, UTF-16, Windows-1252, and ASCII from the binary. In many cases, indexable text can be extract from unknown document formats.

Text Supported File Formats

File Format Id Enum Value

Text

Metadata

EmbeddedItem

ContentHash

Description

Text7BitASCII

X

Text in 7-bit ASCII encoding (.txt).

TextUTF7

X

Text in Unicode UTF-7 encoding. UTF-7 is not an official Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32.

TextUTF8

X

Text in Unicode UTF-8 encoding.

TextUnicode16LE

X

Text in Unicode UTF-16LE encoding.

TextUnicode16BE

X

Text in Unicode UTF-16BE encoding.

TextUnicode32LE

X

Text in Unicode UTF-32LE encoding.

TextUnicode32BE

X

Text in Unicode UTF-32BE encoding.

TextUnicodeEBCDIC

X

Text in Unicode UTF-EBCDIC encoding (No longer a part of Unicode standard). UTF-EBCDIC is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.

Text_ANSI8

Text in an 8-bit undetermined or unsupported OEM code page (has high bytes > 128 and no unprintable ASCII characters found in buffer used to identify).

Text_ISO_8859_1

X

Text encoded in ISO-8859-1 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish (.txt).

Text_ISO_8859_2

X

Text encoded in ISO-8859-2 Czech, Hungarian, Polish, Romanian (.txt).

Text_ISO_8859_5

X

Text encoded in ISO-8859-5 Russian (.txt).

Text_ISO_8859_6

X

Text encoded in ISO-8859-6 Arabic (.txt).

Text_ISO_8859_7

X

Text encoded in ISO-8859-7 Greek (.txt).

Text_ISO_8859_8

X

Text encoded in ISO-8859-8-I Hebrew (.txt).

Text_ISO_8859_9

X

Text encoded in ISO-8859-9, Turkish (.txt).

Text_Windows_1250

X

Text encoded in Windows-1250 Czech, Hungarian, Polish, Romanian (.txt).

Text_Windows_1251

X

Text encoded in Windows-1251 Russian (.txt).

Text_Windows_1252

X

Text encoded in Windows-1252 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish (.txt).

Text_Windows_1253

X

Text encoded in Windows-1253 Greek (.txt).

Text_Windows_1254

X

Text encoded in Windows-1254 Turkish (.txt).

Text_Windows_1255

X

Text encoded in Windows-1255 Hebrew (.txt).

Text_Windows_1256

X

Text encoded in Windows-1256 Arabic (.txt).

Text_KOI8_R

X

Text encoded in KOI8-R, designed to cover Russian, which uses a Cyrillic alphabet (.txt).

Text_IBM_424

X

Text encoded in IBM 424, Hebrew (.txt).

Text_IBM_420

X

Text encoded in IBM 420 Arabic (.txt).

Text_EBCDIC_500

X

Text encoded in EBCDIC 500 full Latin-1-charset (.txt).

Text_IBM_866

X

Text encoded in IBM 866 Russian (.txt).

Text_Shift_JIS

X

Text encoded in Shift_JIS Japanese (.txt).

Text_ISO_2022_JP

X

Text encoded in ISO-2022-JP Japanese (.txt).

Text_ISO_2022_CN

X

Text encoded in ISO-2022-CN Simplified Chinese (.txt).

Text_ISO_2022_KR

X

Text encoded in ISO-2022-KR Korean (.txt).

Text_GB18030

X

Text encoded in GB18030 Chinese (.txt).

Text_Big5

X

Text encoded in Big5 Traditional Chinese (.txt).

Text_EUC_JP

X

Text encoded in EUC-JP Japanese (.txt).

Text_EUC_KR

X

Text encoded in EUC-KR Korean (.txt).