Document Class

Represents a document and all of its extracted content.

Definition

Namespace: OpenDiscoverSDK.Interfaces.Platform
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2026.2.6.0 (2026.02.06)
C#
[DataContractAttribute]
[KnownTypeAttribute(typeof(BooleanProperty))]
[KnownTypeAttribute(typeof(DateTimeProperty))]
[KnownTypeAttribute(typeof(DoubleProperty))]
[KnownTypeAttribute(typeof(Int32Property))]
[KnownTypeAttribute(typeof(Int64Property))]
[KnownTypeAttribute(typeof(StringProperty))]
[KnownTypeAttribute(typeof(BooleanListProperty))]
[KnownTypeAttribute(typeof(DateTimeListProperty))]
[KnownTypeAttribute(typeof(DoubleListProperty))]
[KnownTypeAttribute(typeof(Int32ListProperty))]
[KnownTypeAttribute(typeof(Int64ListProperty))]
[KnownTypeAttribute(typeof(StringListProperty))]
public class Document
Inheritance
Object    Document

Constructors

Document Constructor.
Document(Byte) Constructor. This contructor is useful for ICustomDocumentSource implementers.
Document(Document, ChildDocument, ControlNumberingType, String) Constructor.

Properties

AttachmentArchiveRecordOffset The zero-byte offset of the attachment document in the attachment data archive (.ada) file given by FilePath.
Attributes Document attributes (see DocumentAttributes enumeration).
BccRecipients "Bcc" recipients email address information.
CcRecipients "Cc" recipients email address information.
ChildDocuments Child documents (attachments, embedded, or contained).
ContainerComment Parent container comment associated with this contained document. Some archive container types support user comments for container items.
ContainerPassword Decryption password found that successfully decrypted the document from it's parent container (e.g., password encrypted archive items).
ContainerRelativePath The relative path of this document inside its parent document container.
ContentExtractionTimeTicks Time, in CPU timer ticks, to extract content document.
CreationTimeFileSystem File system creation time (UTC) of the document.
CustomMetadata Contains custom (user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values.
DatabaseTableInfo If document is a supported database format (see help for supported formats), then this property contains a list of TableInfo objects that describe the tables and table columns in the database.
DatabaseTableMaxRowCount If document is a supported database format (see help for supported formats), then this property contains the maximum number of table rows found in all tables in the database (i.e., the row count of the table in DatabaseTableInfo that has the most rows).
DocControlNumber

[Optional] Document control number.

If this property is set on an input document (see Documents), then each child document (recursively) will have its DocControlNumber set to its parent DocControlNumber with its child index (one offset) appended. For example, if input document has DocControlNumber of "00000265", then its second child document will have a DocControlNumber of "00000265_2" (and so on, if the child document itself has child documents).

For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

DocGuid Unique GUID document ID generated for every document by the Document class constructor.
DocId

This is a convenience property for users that may want to attach their own unique id to the document (e.g., SQL table primary key for document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set.

If this property is set on an input document (see Documents), then each child document (recursively) will have its DocId set to its parent DocId with its child index (one offset) appended. For example, if input document has DocId of "00000265", then its second child document will have a DocId of "00000265.2" (and so on, if the child document itself has child documents).

DuplicateGroupId This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default values is null, which indicates not set. If the property is set, it must be set to the unique ID of the selected representing document of a duplicate document group, and the selected document to represent this duplicate group must have IsRepresentingDuplicate set to true.
EDRMMessageIdentificationHash The EDRM Message Identification Hash (MIH) is the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages.
EmailBody Email text body used in extracted email text and content hash.
EmailBodyType The source email body format type for the text of EmailBody.
EmailCreationDate Email creation time, if document is an email (see IsEmailType).
EmailEntryId Entry ID for Outlook PST/OST extracted message objects in hexadecimal string format.
EmailLastModificationTime Email last modified time, if document is an email (see IsEmailType).
EmailReceivedTime Email received date/time, if document is an email (see IsEmailType).
EmailSentTime Email sent date/time, if document is an email (see IsEmailType).
EmailSubject Email subject.
EnrichedBody If HasFlowedBody is true, then this property contains the Enriched formatted email body.
EntityExtractionTimeTicks
Time, in CPU timer ticks, to extract entity items in this document's extracted text and metadata.
EntityResult Entity detection results from scanning the document's extracted text, metadata, and URLs.
Extension The document's file extension.
ExtractedText Extracted document text. See remarks for text extraction limitations.
ExtractedTextSize Gets the document's extracted text, in characters.
FailedPdfPages If document is a PDF (a Document with FormatId property with ID equal to AdobePDF), then this property contains a list of PDF pages (PdfPageInfo) in the PDF document where either an exception occured processing the page or where the text extracted length was below the PageExtractedTextCriteria criteria.
FamilyControlNumber [Optional] Document family control number. This is the DocControlNumber of the top-most non-container document in a parent/child hierarchy.
FamilyDocGuid [Optional] Document family GUID. This is the DocGuid of the top-most non-container document in a parent/child hierarchy.
FamilySHA1Hash

[Optional] This is the SHA-1 hash of all documents in a family given by FamilyControlNumber.

This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

FamilySHA256Hash

[Optional] This is the SHA-256 hash of all documents in a family given by FamilyControlNumber.

This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

FileAttributes Document file system attributes.
FileEntropy Shannon entropy calculated from the document's raw bytes.
FilePath Document's file path.
FirstContainerParentId The first container "ID" ** (see remarks) in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the "ID" of the first container, if any, that contains this document. This property returns null if this document does not have a container parent.
FlowedBody If HasFlowedBody is true, then this property contains the Flowed formatted email body.
FormatId Document format identification result.
From Specifies the author(s) of the message; that is, the EmailAddress(es) of the person(s) or system(s) responsible for the writing of the message.
HasEnrichedBody Specifies if this email has a Enriched formatted body.
HasFlowedBody Specifies if this email has a Flowed formatted body.
HasHtmlBody Specifies if this email has an HTML body.
HasRtfBody Specifies if this email has an RTF body.
HasTextBody Specifies if this email has a plain-text body.
HtmlBaseUrl HTML document "base" element tag specifies the base URL/target for all relative URLs in a document.
HtmlBody If HasHtmlBody is true, then this property contains the email HTML body.
HtmlImageTags HTML document 'img' tag info.
HtmlTitle HTML document "title" element text.
HyperLinks Document hyperlinks.
IdentificationTimeTicks Time, in CPU timer ticks, to identify this document.
ImageHeightInPixels For supported raster image file formats, this property will hold the image's height in pixels; otherwise the default value is -1, meaning not determined.
ImageWidthInPixels For supported raster image file formats, this property will hold the image's width in pixels; otherwise the default value is -1, meaning not determined.
Index Zero-offset index of the document in the order it is found embedded in the parent container document for most container types. For some mail store container types it is the internal file format node ID of the message object.
InlineImageContentId If IsInlineEmailImage is true; then this property contains the "Content-ID" of the inline image, if it exists.
IsAttachmentArchiveFilePath If true, then the file was an extracted attachment and saved to an attachment archive file given by FilePath and its zero-byte offset in the archive is given by AttachmentArchiveRecordOffset.
IsContainerType Specifies whether the document is a document container (i.e., FormatId.Classification is equal to IdClassification.Archive, IdClassification.MailStore, IdClassification.MessagingStore, or IdClassification.MediaImage).
IsEmailType Specifies whether this document is an email type.
IsEmbedded If true, this document is embedded in a Microsoft Office document.
IsEncrypted Specifies whether the document is password protected/encrypted.
IsEncryptedInContainer Specifies whether document is encrypted inside of its parent container. This property only applies to archive child documents.
IsExcludedType Specifies whether document was excluded from processing based on it format type (ID).
IsExternalAttachment Specifies whether this file is an external file attachment to a parent document (example: MS OneNote .onebin files).
IsHtmlType Specifies whether this document is an HTML document.
IsInlineEmailImage Specifies whether this document is an inline email image.
IsLargeFile Specifies whether this file is considered "large" document, i.e., file size exceeds a configurable size, LargeDocumentCritera.
IsMimePartialMessage True if this is an MIME email partial message (ContentType MIME header with MIME-type = "message/partial"); false otherwise.
IsNist Specifies whether the document's SHA1 hash is contained in the National Software Reference Library (NSRL) Reference Data Set (RDS) database.
IsRepresentingDuplicate This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default value is true, which indicates this document is the representing duplicate of a group of duplicate documents. A de-duplication process will need to set this property to false for the non-representing duplicated and also set property DuplicateGroupId.
IsTextArchivePath Specifies whether the extracted text was saved to a text archive file, in which case TextFilePath points to the containing archive file.
LanguageIdResults Extracted text language identification results (see LanguageIdResult).
LanguageIdTimeTicks
Time, in CPU timer ticks, to identify languages in this document's extracted text.
LastAccessTimeFileSystem File system last access time (UTC) of the document.
LastModifiedTimeFileSystem File system last modified time (UTC) of the document.
MD5BinaryHash MD5 document hash (hash of all document bytes).
MD5ContentHash MD5 content hash is a proprietary hash (very similar to EDRM email hash) on only the content part of document file format.
Metadata Contains standard (non-user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values.
MimePartialMessageId If IsMimePartialMessage is true, then property holds the unique partial message 'id' that all parts (separate MIME files) of the partial message contain.
MimePartialMessagePartNumber If IsMimePartialMessage is true, then property holds the index (out of 1 to MimePartialMessageTotalParts) of this message part.
MimePartialMessageTotalParts If IsMimePartialMessage is true, then property holds the total number of message parts. It is not always guaranteed that each message part will have the total parts set in MIME 'ContentType' header value, often it is just the first or last message part that has this value set.
Name Filename of the document with extension.
NumOfContainerItems If document is a supported container format this property holds the number of container child items.
OriginalFilePath Embedded document original file path. Some embedded documents with container wrappers store the location from where the embedded child document was originally located on the file system before being embedded into the parent container. For most child documents, this value will be null.
PackedSize Document's packed size, in bytes, within its parent archive container. If the document is a compressed item in a parent container then this value is the compressed size. This property only applies to archive child documents and to the archive formats that store this item metadata.
ParentChildSourcePath Hierarchical parent/child relative path. This path is the same as ParentChildVirtualPath property, except for emails where the email subject is used as part of the virtual path. If the email does not have a subject then "(No Subject)" is used as the email's subject.
ParentChildVirtualPath Hierarchical parent/child virtual path. This path is only set for child documents of processing input documents and this property is set by the DocumentDataArchiveReader class when de-serializing a document data archive (.dda) processing output file.
ParentDocControlNumber

[Optional] Parent document control number. See DocControlNumber.

For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

ParentDocGuid The GUID ID of this document's immediate parent. The property only gets set if this document is a child document of an input document.
ParentId This is a convenience property for users that may want to attach their own unique id for the parent document of this document (e.g., SQL table primary key for document's parent document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set.
Password Decryption password found that successfully decrypted the document.
PrimaryDate PrimaryDate can be used to sort documents by a common property date that is meaningful across document formats. See PrimaryDateType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate.
PrimaryDateType This property specifies which metadata date field was used for this document's PrimaryDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate.
Result Content extraction result. Check this value to see the state of the the content extraction.
ResultErrorMessage Error message associated with Result, if any. This property is only set when Result is not set to 'OK'.
ResultErrorStackTrace Exception stack trace that is only set if ResultErrorMessage is due to caught internal exception.
RtfBody If HasRtfBody is true, then this property contains the email RTF body.
Sender Sender information. The sender is the EmailAddress of the agent responsible for the actual transmission of the message. The sender and the From are often the same but can be different.
SHA1BinaryHash SHA-1 document hash (hash of all document bytes).
SHA1ContentHash SHA-1 content hash is a proprietary hash on only the content part of document file format.
SHA1EmailAttachmentHash SHA-1 hash of the concatenated SHA1 hash of each attachment binary data (includes hashes of inline images).
SHA1EmailAttachmentSortedHash SHA-1 hash of the SORTED and then concatenated SHA1 hash of each attachment binary data (includes hashes of inline images).
SHA1EmailBodyHash SHA1 hash of the email EmailBody text (converted to lower case and with all white space removed).
SHA1EmailHeaderHash SHA-1 email header hash.
SHA1EmailRecipientNamesHash SHA-1 hash of all recipient names concatenated together (all lower case).
SHA1EmailRecipientsHash SHA-1 hash of recipient names and email addresses concatenated together (all lower case).
SHA256BinaryHash SHA-256 document hash (hash of all document bytes).
SHA256ContentHash SHA-256 content hash is a proprietary hash on only the content part of document file format.
Size Gets the document's native file size, in bytes.
SortDate

SortDate can be used to sort documents by a common property date that is meaningful across document formats. For parent documents, this property is set to the PrimaryDate property value, and for child documents of non-container parent types, this property is set to the SortDate of the parent document.

See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate.

SortDateType This property specifies which metadata date field was used for this document's SortDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate.
TestedResult This property is set for archive items that have been tested for actual expansion size and extraction result (see also TestedSize).
TestedSize This property is set for archives and archive items that have been tested for expanded (de-compressed) size (see also TestedResult).
TextArchiveRecordOffset Zero-byte offset of the extracted text in the text data archive (.tda) file given by TextFilePath
TextBody If HasTextBody is true, then this property contains the email plain-text body.
TextFilePath File path to document's extracted text.
TextSourceType Get how the document's text was extracted.
TopMostContainerParentId Top-most container in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the DocGuid of the top-most container, if any, in the hierarchy that contains this document. This property returns null if this document does not have a container parent.
ToRecipients "To" recipients email address information.

Methods

Clear Clears all data from document instance.
EqualsDetermines whether the specified object is equal to the current object.
(Inherited from Object)
GetHashCodeServes as the default hash function.
(Inherited from Object)
GetTypeGets the Type of the current instance.
(Inherited from Object)
ToStringReturns a string that represents the current object.
(Inherited from Object)

See Also