Click or drag to resize

DataFile

Supported DataFile file formats (IdClassification.DataFile - Data and data serialization document formats)

  • All entries in table below are supported for file format identification.
  • 'X' in "Text" column indicates text extraction is supported for the file format.
  • 'X**' in "Text" column indicates text extraction is supported BUT binary-to-text filtering is used on partially parsed document records.
  • 'X' in "Metadata" column indicates metadata extraction is supported for the file format.
  • 'X' in "EmbeddedItem" column indicates embedded item/attachment extraction is supported for the file format.
  • 'X' in "ContentHash" column indicates a content hash is supported for the file format (see MD5ContentHash and SHA1ContentHash)

If a file format does not have a supported content extractor that extracts text then, optionally (default), a binary-to-text content extractor will be used to extract UTF-8, UTF-16, Windows-1252, and ASCII from the binary. In many cases, indexable text can be extract from unknown document formats.

DataFile Supported File Formats

File Format Id Enum Value

Text

Metadata

EmbeddedItem

ContentHash

Description

OpenDiscoverDocumentArchive

Open Discover document data archive output file that stores extracted document metadata and attributes (.dda).

OpenDiscoverAttachmentArchive

Open Discover attachment data archive output file that stores extracted container items and extracted document attachments (.ada).

OpenDiscoverTextArchive

Open Discover text data archive output file that stores extracted document text (.tda).

BinaryPropertyList

Binary Property List format for storing program settings and other data in Apple OS X, iOS, NextSTEP applications (.plist).

XmlPropertyList

XML Property List format for storing program settings and other data in Apple OS X, iOS, NextSTEP applications (this XML format was introduced by Apple to replace the earlier format used in NeXTSTEP) (.plist).

JSON

X

JavaScript Object Notation (JSON) open standard format is a text based format to transmit data objects consisting of attribute–value pairs (.json).

JSON_LD

X

JavaScript Object Notation for Linked Data (JSON-LD) format for encoding Linked Data using JSON (.jsonld).

CBOR

Concise Binary Object Representation (CBOR) data format (.cbor).

BabylonGlossaryBuilder

Babylon Glossary Builder glossary file (.bgl).

GnuGMO

GNU Gettext Machine Object file. MO (Machine Object) files are compiled, machine-readable PO (Portable Object) files (.mo;.gmo).

MicrosoftServiceQualityMonitoringFile

Microsoft Service Quality Monitoring file used to assist in monitoring quality of applications such as Windows Live Messenger, Microsoft Office, etc. (.sqm).

MacBinHex4

X

Mac OS BinHex 4.0 (binary-to-hexadecimal) format, used for sending binary files through email (.hqx).

AppleSingle1

X

AppleSingle version 1. This format contains both file contents and attributes.

AppleSingle2

X

AppleSingle version 2. This format contains both file contents and attributes.

AppleDouble1

X

AppleDouble resource fork version 1 (The AppleDouble format keeps the data fork of the file in its original format and filename). This format only stores the file attributes.

AppleDouble2

X

AppleDouble resource fork version 2 (The AppleDouble format keeps the data fork of the file in its original format and filename). This format only stores the file attributes.

MSBinder

X

X

Microsoft Binder (Microsoft Office 95, 97, and 2000. Discontinued after Office 2000) (.obd).

MicrosoftOleDataMso

X

Microsoft "oledata.mso": The MSO file allows other HTML email clients (other than Outlook) to render HTML email messages sent by Microsoft Outlook correctly. Other formats can contain MSO files, and these MSO files can contain useful embedded objects such as MS Office documents.

TimeStampedDataEnvelope

X

Time-stamped data that is used to bind a file with one or more time-stamp tokens obtained for that file. A Cryptographic Message Syntax (CMS) envelope is used as the time-stamped data content envelope (.tsd).

CommaSeparatedValuesFile

X

Comma separated value (CSV) file (.csv).

TabSeparatedValuesFile

X

Tab separated value (TSV) file (.tsv;.tab).

Parquet

Apache Parquet is an open source, column-oriented data file format (.parquet).