Click or drag to resize

Document Properties

The Document type exposes the following members.

Properties
 NameDescription
Public propertyAttachmentArchiveRecordOffset The zero-byte offset of the attachment document in the attachment data archive (.ada) file given by FilePath.
Public propertyAttributes Document attributes (see DocumentAttributes enumeration).
Public propertyBccRecipients "Bcc" recipients email address information.
Public propertyCcRecipients "Cc" recipients email address information.
Public propertyChildDocuments Child documents (attachments, embedded, or contained).
Public propertyContainerComment Parent container comment associated with this contained document. Some archive container types support user comments for container items.
Public propertyContainerPassword Decryption password found that successfully decrypted the document from it's parent container (e.g., password encrypted archive items).
Public propertyContainerRelativePath The relative path of this document inside its parent document container.
Public propertyContentExtractionTimeTicks Time, in CPU timer ticks, to extract content document.
Public propertyCreationTimeFileSystem File system creation time (UTC) of the document.
Public propertyCustomMetadata Contains custom (user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values.
Public propertyDatabaseTableInfo If document is a supported database format (see help for supported formats), then this property contains a list of TableInfo objects that describe the tables and table columns in the database.
Public propertyDatabaseTableMaxRowCount If document is a supported database format (see help for supported formats), then this property contains the maximum number of table rows found in all tables in the database (i.e., the row count of the table in DatabaseTableInfo that has the most rows).
Public propertyDocControlNumber

[Optional] Document control number.

If this property is set on an input document (see Documents), then each child document (recursively) will have its DocControlNumber set to its parent DocControlNumber with its child index (one offset) appended. For example, if input document has DocControlNumber of "00000265", then its second child document will have a DocControlNumber of "00000265_2" (and so on, if the child document itself has child documents).

For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

Public propertyDocGuid Unique GUID document ID generated for every document by the Document class constructor.
Public propertyDocId

This is a convenience property for users that may want to attach their own unique id to the document (e.g., SQL table primary key for document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set.

If this property is set on an input document (see Documents), then each child document (recursively) will have its DocId set to its parent DocId with its child index (one offset) appended. For example, if input document has DocId of "00000265", then its second child document will have a DocId of "00000265.2" (and so on, if the child document itself has child documents).

Public propertyDuplicateGroupId This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default values is null, which indicates not set. If the property is set, it must be set to the unique ID of the selected representing document of a duplicate document group, and the selected document to represent this duplicate group must have IsRepresentingDuplicate set to true.
Public propertyEDRMMessageIdentificationHash The EDRM Message Identification Hash (MIH) is the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages.
Public propertyEmailBody Email text body used in extracted email text and content hash.
Public propertyEmailBodyType The source email body format type for the text of EmailBody.
Public propertyEmailCreationDate Email creation time, if document is an email (see IsEmailType).
Public propertyEmailEntryId Entry ID for Outlook PST/OST extracted message objects in hexadecimal string format.
Public propertyEmailLastModificationTime Email last modified time, if document is an email (see IsEmailType).
Public propertyEmailReceivedTime Email received date/time, if document is an email (see IsEmailType).
Public propertyEmailSentTime Email sent date/time, if document is an email (see IsEmailType).
Public propertyEmailSubject Email subject.
Public propertyEnrichedBody If HasFlowedBody is true, then this property contains the Enriched formatted email body.
Public propertyEntityExtractionTimeTicks
Time, in CPU timer ticks, to extract entity items in this document's extracted text and metadata.
Public propertyEntityResult Entity detection results from scanning the document's extracted text, metadata, and URLs.
Public propertyExtension The document's file extension.
Public propertyExtractedText Extracted document text. See remarks for text extraction limitations.
Public propertyExtractedTextSize Gets the document's extracted text, in characters.
Public propertyFailedPdfPages If document is a PDF (a Document with FormatId property with ID equal to AdobePDF), then this property contains a list of PDF pages (PdfPageInfo) in the PDF document where either an exception occured processing the page or where the text extracted length was below the PageExtractedTextCriteria criteria.
Public propertyFamilyControlNumber [Optional] Document family control number. This is the DocControlNumber of the top-most non-container document in a parent/child hierarchy.
Public propertyFamilyDocGuid [Optional] Document family GUID. This is the DocGuid of the top-most non-container document in a parent/child hierarchy.
Public propertyFamilySHA1Hash

[Optional] This is the SHA-1 hash of all documents in a family given by FamilyControlNumber.

This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

Public propertyFamilySHA256Hash

[Optional] This is the SHA-256 hash of all documents in a family given by FamilyControlNumber.

This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

Public propertyFileAttributes Document file system attributes.
Public propertyFileEntropy Shannon entropy calculated from the document's raw bytes.
Public propertyFilePath Document's file path.
Public propertyFirstContainerParentId The first container "ID" ** (see remarks) in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the "ID" of the first container, if any, that contains this document. This property returns null if this document does not have a container parent.
Public propertyFlowedBody If HasFlowedBody is true, then this property contains the Flowed formatted email body.
Public propertyFormatId Document format identification result.
Public propertyFrom Specifies the author(s) of the message; that is, the EmailAddress(es) of the person(s) or system(s) responsible for the writing of the message.
Public propertyHasEnrichedBody Specifies if this email has a Enriched formatted body.
Public propertyHasFlowedBody Specifies if this email has a Flowed formatted body.
Public propertyHasHtmlBody Specifies if this email has an HTML body.
Public propertyHasRtfBody Specifies if this email has an RTF body.
Public propertyHasTextBody Specifies if this email has a plain-text body.
Public propertyHtmlBaseUrl HTML document "base" element tag specifies the base URL/target for all relative URLs in a document.
Public propertyHtmlBody If HasHtmlBody is true, then this property contains the email HTML body.
Public propertyHtmlImageTags HTML document 'img' tag info.
Public propertyHtmlTitle HTML document "title" element text.
Public propertyHyperLinks Document hyperlinks.
Public propertyIdentificationTimeTicks Time, in CPU timer ticks, to identify this document.
Public propertyImageHeightInPixels For supported raster image file formats, this property will hold the image's height in pixels; otherwise the default value is -1, meaning not determined.
Public propertyImageWidthInPixels For supported raster image file formats, this property will hold the image's width in pixels; otherwise the default value is -1, meaning not determined.
Public propertyIndex Zero-offset index of the document in the order it is found embedded in the parent container document for most container types. For some mail store container types it is the internal file format node ID of the message object.
Public propertyInlineImageContentId If IsInlineEmailImage is true; then this property contains the "Content-ID" of the inline image, if it exists.
Public propertyIsAttachmentArchiveFilePath If true, then the file was an extracted attachment and saved to an attachment archive file given by FilePath and its zero-byte offset in the archive is given by AttachmentArchiveRecordOffset.
Public propertyIsContainerType Specifies whether the document is a document container (i.e., FormatId.Classification is equal to IdClassification.Archive, IdClassification.MailStore, IdClassification.MessagingStore, or IdClassification.MediaImage).
Public propertyIsEmailType Specifies whether this document is an email type.
Public propertyIsEmbedded If true, this document is embedded in a Microsoft Office document.
Public propertyIsEncrypted Specifies whether the document is password protected/encrypted.
Public propertyIsEncryptedInContainer Specifies whether document is encrypted inside of its parent container. This property only applies to archive child documents.
Public propertyIsExcludedType Specifies whether document was excluded from processing based on it format type (ID).
Public propertyIsExternalAttachment Specifies whether this file is an external file attachment to a parent document (example: MS OneNote .onebin files).
Public propertyIsHtmlType Specifies whether this document is an HTML document.
Public propertyIsInlineEmailImage Specifies whether this document is an inline email image.
Public propertyIsLargeFile Specifies whether this file is considered "large" document, i.e., file size exceeds a configurable size, LargeDocumentCritera.
Public propertyIsMimePartialMessage True if this is an MIME email partial message (ContentType MIME header with MIME-type = "message/partial"); false otherwise.
Public propertyIsNist Specifies whether the document's SHA1 hash is contained in the National Software Reference Library (NSRL) Reference Data Set (RDS) database.
Public propertyIsRepresentingDuplicate This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default value is true, which indicates this document is the representing duplicate of a group of duplicate documents. A de-duplication process will need to set this property to false for the non-representing duplicated and also set property DuplicateGroupId.
Public propertyIsTextArchivePath Specifies whether the extracted text was saved to a text archive file, in which case TextFilePath points to the containing archive file.
Public propertyLanguageIdResults Extracted text language identification results (see LanguageIdResult).
Public propertyLanguageIdTimeTicks
Time, in CPU timer ticks, to identify languages in this document's extracted text.
Public propertyLastAccessTimeFileSystem File system last access time (UTC) of the document.
Public propertyLastModifiedTimeFileSystem File system last modified time (UTC) of the document.
Public propertyMD5BinaryHash MD5 document hash (hash of all document bytes).
Public propertyMD5ContentHash MD5 content hash is a proprietary hash on only the content part of document file format.
Public propertyMetadata Contains standard (non-user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values.
Public propertyMimePartialMessageId If IsMimePartialMessage is true, then property holds the unique partial message 'id' that all parts (separate MIME files) of the partial message contain.
Public propertyMimePartialMessagePartNumber If IsMimePartialMessage is true, then property holds the index (out of 1 to MimePartialMessageTotalParts) of this message part.
Public propertyMimePartialMessageTotalParts If IsMimePartialMessage is true, then property holds the total number of message parts. It is not always guaranteed that each message part will have the total parts set in MIME 'ContentType' header value, often it is just the first or last message part that has this value set.
Public propertyName Filename of the document with extension.
Public propertyNumOfContainerItems If document is a supported container format this property holds the number of container child items.
Public propertyOriginalFilePath Embedded document original file path. Some embedded documents with container wrappers store the location from where the embedded child document was originally located on the file system before being embedded into the parent container. For most child documents, this value will be null.
Public propertyPackedSize Document's packed size, in bytes, within its parent archive container. If the document is a compressed item in a parent container then this value is the compressed size. This property only applies to archive child documents and to the archive formats that store this item metadata.
Public propertyParentChildSourcePath Hierarchical parent/child relative path. This path is the same as ParentChildVirtualPath property, except for emails where the email subject is used as part of the virtual path. If the email does not have a subject then "(No Subject)" is used as the email's subject.
Public propertyParentChildVirtualPath Hierarchical parent/child virtual path. This path is only set for child documents of processing input documents and this property is set by the DocumentDataArchiveReader class when de-serializing a document data archive (.dda) processing output file.
Public propertyParentDocControlNumber

[Optional] Parent document control number. See DocControlNumber.

For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file.

Public propertyParentDocGuid The GUID ID of this document's immediate parent. The property only gets set if this document is a child document of an input document.
Public propertyParentId This is a convenience property for users that may want to attach their own unique id for the parent document of this document (e.g., SQL table primary key for document's parent document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set.
Public propertyPassword Decryption password found that successfully decrypted the document.
Public propertyPrimaryDate PrimaryDate can be used to sort documents by a common property date that is meaningful across document formats. See PrimaryDateType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate.
Public propertyPrimaryDateType This property specifies which metadata date field was used for this document's PrimaryDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate.
Public propertyResult Content extraction result. Check this value to see the state of the the content extraction.
Public propertyResultErrorMessage Error message associated with Result, if any. This property is only set when Result is not set to 'OK'.
Public propertyResultErrorStackTrace Exception stack trace that is only set if ResultErrorMessage is due to caught internal exception.
Public propertyRtfBody If HasRtfBody is true, then this property contains the email RTF body.
Public propertySender Sender information. The sender is the EmailAddress of the agent responsible for the actual transmission of the message. The sender and the From are often the same but can be different.
Public propertySHA1BinaryHash SHA-1 document hash (hash of all document bytes).
Public propertySHA1ContentHash SHA-1 content hash is a proprietary hash on only the content part of document file format.
Public propertySHA1EmailAttachmentHash SHA-1 hash of the concatenated SHA1 hash of each attachment binary data (includes hashes of inline images).
Public propertySHA1EmailAttachmentSortedHash SHA-1 hash of the SORTED and then concatenated SHA1 hash of each attachment binary data (includes hashes of inline images).
Public propertySHA1EmailBodyHash SHA1 hash of the email EmailBody text (converted to lower case and with all white space removed).
Public propertySHA1EmailHeaderHash SHA-1 email header hash.
Public propertySHA1EmailRecipientNamesHash SHA-1 hash of all recipient names concatenated together (all lower case).
Public propertySHA1EmailRecipientsHash SHA-1 hash of recipient names and email addresses concatenated together (all lower case).
Public propertySHA256BinaryHash SHA-256 document hash (hash of all document bytes).
Public propertySHA256ContentHash SHA-256 content hash is a proprietary hash on only the content part of document file format.
Public propertySize Gets the document's native file size, in bytes.
Public propertySortDate

SortDate can be used to sort documents by a common property date that is meaningful across document formats. For parent documents, this property is set to the PrimaryDate property value, and for child documents of non-container parent types, this property is set to the SortDate of the parent document.

See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate.

Public propertySortDateType This property specifies which metadata date field was used for this document's SortDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate.
Public propertyTestedResult This property is set for archive items that have been tested for actual expansion size and extraction result (see also TestedSize).
Public propertyTestedSize This property is set for archives and archive items that have been tested for expanded (de-compressed) size (see also TestedResult).
Public propertyTextArchiveRecordOffset Zero-byte offset of the extracted text in the text data archive (.tda) file given by TextFilePath
Public propertyTextBody If HasTextBody is true, then this property contains the email plain-text body.
Public propertyTextFilePath File path to document's extracted text.
Public propertyTextSourceType Get how the document's text was extracted.
Public propertyTopMostContainerParentId Top-most container in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the DocGuid of the top-most container, if any, in the hierarchy that contains this document. This property returns null if this document does not have a container parent.
Public propertyToRecipients "To" recipients email address information.
Top
See Also