Text |
Supported Text file formats (IdClassification.Text - Text document encoding formats)
If a file format does not have a supported content extractor that extracts text then, optionally (default), a binary-to-text content extractor will be used to extract UTF-8, UTF-16, Windows-1252, and ASCII from the binary. In many cases, indexable text can be extract from unknown document formats.
File Format Id Enum Value | Text | Metadata | EmbeddedItem | ContentHash | Description |
|---|---|---|---|---|---|
X | Text in 7-bit ASCII encoding (.txt). | ||||
X | Text in Unicode UTF-7 encoding. UTF-7 is not an official Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32. | ||||
X | Text in Unicode UTF-8 encoding. | ||||
X | Text in Unicode UTF-16LE encoding. | ||||
X | Text in Unicode UTF-16BE encoding. | ||||
X | Text in Unicode UTF-32LE encoding. | ||||
X | Text in Unicode UTF-32BE encoding. | ||||
X | Text in Unicode UTF-EBCDIC encoding (No longer a part of Unicode standard). UTF-EBCDIC is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. | ||||
Text in an 8-bit undetermined or unsupported OEM code page (has high bytes > 128 and no unprintable ASCII characters found in buffer used to identify). | |||||
X | Text encoded in ISO-8859-1 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish (.txt). | ||||
X | Text encoded in ISO-8859-2 Czech, Hungarian, Polish, Romanian (.txt). | ||||
X | Text encoded in ISO-8859-5 Russian (.txt). | ||||
X | Text encoded in ISO-8859-6 Arabic (.txt). | ||||
X | Text encoded in ISO-8859-7 Greek (.txt). | ||||
X | Text encoded in ISO-8859-8-I Hebrew (.txt). | ||||
X | Text encoded in ISO-8859-9, Turkish (.txt). | ||||
X | Text encoded in Windows-1250 Czech, Hungarian, Polish, Romanian (.txt). | ||||
X | Text encoded in Windows-1251 Russian (.txt). | ||||
X | Text encoded in Windows-1252 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish (.txt). | ||||
X | Text encoded in Windows-1253 Greek (.txt). | ||||
X | Text encoded in Windows-1254 Turkish (.txt). | ||||
X | Text encoded in Windows-1255 Hebrew (.txt). | ||||
X | Text encoded in Windows-1256 Arabic (.txt). | ||||
X | Text encoded in KOI8-R, designed to cover Russian, which uses a Cyrillic alphabet (.txt). | ||||
X | Text encoded in IBM 424, Hebrew (.txt). | ||||
X | Text encoded in IBM 420 Arabic (.txt). | ||||
X | Text encoded in EBCDIC 500 full Latin-1-charset (.txt). | ||||
X | Text encoded in IBM 866 Russian (.txt). | ||||
X | Text encoded in Shift_JIS Japanese (.txt). | ||||
X | Text encoded in ISO-2022-JP Japanese (.txt). | ||||
X | Text encoded in ISO-2022-CN Simplified Chinese (.txt). | ||||
X | Text encoded in ISO-2022-KR Korean (.txt). | ||||
X | Text encoded in GB18030 Chinese (.txt). | ||||
X | Text encoded in Big5 Traditional Chinese (.txt). | ||||
X | Text encoded in EUC-JP Japanese (.txt). | ||||
X | Text encoded in EUC-KR Korean (.txt). |