Click or drag to resize

Markup

Supported Markup file formats (IdClassification.Markup - Markup document (e.g., XML or HTML))

  • All entries in table below are supported for file format identification.
  • 'X' in "Text" column indicates text extraction is supported for the file format.
  • 'X**' in "Text" column indicates text extraction is supported BUT binary-to-text filtering is used on partially parsed document records.
  • 'X' in "Metadata" column indicates metadata extraction is supported for the file format.
  • 'X' in "EmbeddedItem" column indicates embedded item/attachment extraction is supported for the file format.
  • 'X' in "ContentHash" column indicates a content hash is supported for the file format (see MD5ContentHash and SHA1ContentHash)

If a file format does not have a supported content extractor that extracts text then, optionally (default), a binary-to-text content extractor will be used to extract UTF-8, UTF-16, Windows-1252, and ASCII from the binary. In many cases, indexable text can be extract from unknown document formats.

Markup Supported File Formats

File Format Id Enum Value

Text

Metadata

EmbeddedItem

ContentHash

Description

HTML

X

X

HyperText Markup Language (HTML) (.htm;.html).

HTML5

X

X

The fifth and current version of the HyperText Markup Language (HTML) standard (.htm;.html).

XHTML

X

X

Extensible Hypertext Markup Language (XHTML).

CSS

Cascading Style Sheet (.css).

MHTML

X

X

X

Microsoft Web Archive: MHT is a web page archive file format which is an MHTML (short for MIME HTML) document type (.mht;mhtml).

MimeGeneric

X

X

X

Generic MIME (RFC 822) format.

SMimeGenericClearSigned

X

X

X

Generic (non-email) secure MIME (S/MIME) clear-signed. Clear-signed MIMEs have MIME media type "multipart/signed" (.p7s).

SMimeGenericOpaqueSigned

Generic (non-email) secure MIME (S/MIME) opaque-signed. Opaque-signed MIMEs have exactly one MIME entity and this MIME entity usually has the media type "application/pkcs7-mime" (.p7s).

SMimeGenericCompressed

X

X

X

Generic (non-email) secure MIME (S/MIME) with compression (.p7z;.txt).

SMimeGenericEncrypted

X

X

X

Generic (non-email) secure MIME (S/MIME) with encryption (enveloped-data) (.p7m;.txt).

XML

X

Extensible Markup Language (XML) file of unknown format/use. Includes files with XML-like markup that do not have XML declaration at beginning of file (.xml).

RSS

X

RSS (Rich Site Summary) feed format (.rss).

XMLSchemaDefinition

X

XML Schema Definition (.xsd;.xml).

EnrichedText

Enriched text - a simple formatted text developed for MIME (Content-Type: "text/enriched" or "text/richtext").

WirelessMarkupLanguage

X

Wireless Markup Language (WML), is a markup language (XML) intended for devices that implement the Wireless Application Protocol (WAP) specification, such as mobile phones (.wml).

MusicXML

X

MusicXML is an XML-based file format for representing Western musical notation (.mxl;.xml).

MathML

X

Mathematical Markup Language (MathML) XML file format (XML for describing mathematical notations) (.mml).

MathcadXml

X

Mathcad XML document (.xmcd).

WindowMediaPlayList

X

Windows Media Playlist (.wpl).

DominoXmlGeneric

X

X

X

X

Domino XML (DXL) generic (unknown, document is missing 'form' attribute) export file format (.dxl).

DominoXmlCustomForm

X

X

X

X

Domino XML (DXL) custom form document (unknown 'form' attribute type) export file format (.dxl).

LaTeX

X

LaTeX markup language widely used in academia for the communication and publication of scientific documents (.tex).

EBML

Extensible Binary Meta Language (EBML) is a generalized file format for any kind of data, aiming to be a binary equivalent to XML (Matroska, WebM, and other formats are based on this format).

XMP

X

Extensible Metadata Platform Packet (XMP) metadata file format. XMP is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for all kinds of resources (.xmp).