Document Properties |
The Document type exposes the following members.
| Name | Description | |
|---|---|---|
| AttachmentArchiveRecordOffset | The zero-byte offset of the attachment document in the attachment data archive (.ada) file given by FilePath. | |
| Attributes | Document attributes (see DocumentAttributes enumeration). | |
| BccRecipients | "Bcc" recipients email address information. | |
| CcRecipients | "Cc" recipients email address information. | |
| ChildDocuments | Child documents (attachments, embedded, or contained). | |
| ContainerComment | Parent container comment associated with this contained document. Some archive container types support user comments for container items. | |
| ContainerPassword | Decryption password found that successfully decrypted the document from it's parent container (e.g., password encrypted archive items). | |
| ContainerRelativePath | The relative path of this document inside its parent document container. | |
| ContentExtractionTimeTicks | Time, in CPU timer ticks, to extract content document. | |
| CreationTimeFileSystem | File system creation time (UTC) of the document. | |
| CustomMetadata | Contains custom (user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values. | |
| DatabaseTableInfo | If document is a supported database format (see help for supported formats), then this property contains a list of TableInfo objects that describe the tables and table columns in the database. | |
| DatabaseTableMaxRowCount | If document is a supported database format (see help for supported formats), then this property contains the maximum number of table rows found in all tables in the database (i.e., the row count of the table in DatabaseTableInfo that has the most rows). | |
| DocControlNumber | [Optional] Document control number. If this property is set on an input document (see Documents), then each child document (recursively) will have its DocControlNumber set to its parent DocControlNumber with its child index (one offset) appended. For example, if input document has DocControlNumber of "00000265", then its second child document will have a DocControlNumber of "00000265_2" (and so on, if the child document itself has child documents). For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file. | |
| DocGuid | Unique GUID document ID generated for every document by the Document class constructor. | |
| DocId | This is a convenience property for users that may want to attach their own unique id to the document (e.g., SQL table primary key for document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set. If this property is set on an input document (see Documents), then each child document (recursively) will have its DocId set to its parent DocId with its child index (one offset) appended. For example, if input document has DocId of "00000265", then its second child document will have a DocId of "00000265.2" (and so on, if the child document itself has child documents). | |
| DuplicateGroupId | This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default values is null, which indicates not set. If the property is set, it must be set to the unique ID of the selected representing document of a duplicate document group, and the selected document to represent this duplicate group must have IsRepresentingDuplicate set to true. | |
| EDRMMessageIdentificationHash | The EDRM Message Identification Hash (MIH) is the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages. | |
| EmailBody | Email text body used in extracted email text and content hash. | |
| EmailBodyType | The source email body format type for the text of EmailBody. | |
| EmailCreationDate | Email creation time, if document is an email (see IsEmailType). | |
| EmailEntryId | Entry ID for Outlook PST/OST extracted message objects in hexadecimal string format. | |
| EmailLastModificationTime | Email last modified time, if document is an email (see IsEmailType). | |
| EmailReceivedTime | Email received date/time, if document is an email (see IsEmailType). | |
| EmailSentTime | Email sent date/time, if document is an email (see IsEmailType). | |
| EmailSubject | Email subject. | |
| EnrichedBody | If HasFlowedBody is true, then this property contains the Enriched formatted email body. | |
| EntityExtractionTimeTicks |
Time, in CPU timer ticks, to extract entity items in this document's extracted text and metadata.
| |
| EntityResult | Entity detection results from scanning the document's extracted text, metadata, and URLs. | |
| Extension | The document's file extension. | |
| ExtractedText | Extracted document text. See remarks for text extraction limitations. | |
| ExtractedTextSize | Gets the document's extracted text, in characters. | |
| FailedPdfPages | If document is a PDF (a Document with FormatId property with ID equal to AdobePDF), then this property contains a list of PDF pages (PdfPageInfo) in the PDF document where either an exception occured processing the page or where the text extracted length was below the PageExtractedTextCriteria criteria. | |
| FamilyControlNumber | [Optional] Document family control number. This is the DocControlNumber of the top-most non-container document in a parent/child hierarchy. | |
| FamilyDocGuid | [Optional] Document family GUID. This is the DocGuid of the top-most non-container document in a parent/child hierarchy. | |
| FamilySHA1Hash | [Optional] This is the SHA-1 hash of all documents in a family given by FamilyControlNumber. This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file. | |
| FamilySHA256Hash | [Optional] This is the SHA-256 hash of all documents in a family given by FamilyControlNumber. This property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file. | |
| FileAttributes | Document file system attributes. | |
| FileEntropy | Shannon entropy calculated from the document's raw bytes. | |
| FilePath | Document's file path. | |
| FirstContainerParentId | The first container "ID" ** (see remarks) in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the "ID" of the first container, if any, that contains this document. This property returns null if this document does not have a container parent. | |
| FlowedBody | If HasFlowedBody is true, then this property contains the Flowed formatted email body. | |
| FormatId | Document format identification result. | |
| From | Specifies the author(s) of the message; that is, the EmailAddress(es) of the person(s) or system(s) responsible for the writing of the message. | |
| HasEnrichedBody | Specifies if this email has a Enriched formatted body. | |
| HasFlowedBody | Specifies if this email has a Flowed formatted body. | |
| HasHtmlBody | Specifies if this email has an HTML body. | |
| HasRtfBody | Specifies if this email has an RTF body. | |
| HasTextBody | Specifies if this email has a plain-text body. | |
| HtmlBaseUrl | HTML document "base" element tag specifies the base URL/target for all relative URLs in a document. | |
| HtmlBody | If HasHtmlBody is true, then this property contains the email HTML body. | |
| HtmlImageTags | HTML document 'img' tag info. | |
| HtmlTitle | HTML document "title" element text. | |
| HyperLinks | Document hyperlinks. | |
| IdentificationTimeTicks | Time, in CPU timer ticks, to identify this document. | |
| ImageHeightInPixels | For supported raster image file formats, this property will hold the image's height in pixels; otherwise the default value is -1, meaning not determined. | |
| ImageWidthInPixels | For supported raster image file formats, this property will hold the image's width in pixels; otherwise the default value is -1, meaning not determined. | |
| Index | Zero-offset index of the document in the order it is found embedded in the parent container document for most container types. For some mail store container types it is the internal file format node ID of the message object. | |
| InlineImageContentId | If IsInlineEmailImage is true; then this property contains the "Content-ID" of the inline image, if it exists. | |
| IsAttachmentArchiveFilePath | If true, then the file was an extracted attachment and saved to an attachment archive file given by FilePath and its zero-byte offset in the archive is given by AttachmentArchiveRecordOffset. | |
| IsContainerType | Specifies whether the document is a document container (i.e., FormatId.Classification is equal to IdClassification.Archive, IdClassification.MailStore, IdClassification.MessagingStore, or IdClassification.MediaImage). | |
| IsEmailType | Specifies whether this document is an email type. | |
| IsEmbedded | If true, this document is embedded in a Microsoft Office document. | |
| IsEncrypted | Specifies whether the document is password protected/encrypted. | |
| IsEncryptedInContainer | Specifies whether document is encrypted inside of its parent container. This property only applies to archive child documents. | |
| IsExcludedType | Specifies whether document was excluded from processing based on it format type (ID). | |
| IsExternalAttachment | Specifies whether this file is an external file attachment to a parent document (example: MS OneNote .onebin files). | |
| IsHtmlType | Specifies whether this document is an HTML document. | |
| IsInlineEmailImage | Specifies whether this document is an inline email image. | |
| IsLargeFile | Specifies whether this file is considered "large" document, i.e., file size exceeds a configurable size, LargeDocumentCritera. | |
| IsMimePartialMessage | True if this is an MIME email partial message (ContentType MIME header with MIME-type = "message/partial"); false otherwise. | |
| IsNist | Specifies whether the document's SHA1 hash is contained in the National Software Reference Library (NSRL) Reference Data Set (RDS) database. | |
| IsRepresentingDuplicate | This is a convenience property for users that may want to track duplicate documents (de-duplication based on hashes). This value is not used or set during processing of documents. Default value is true, which indicates this document is the representing duplicate of a group of duplicate documents. A de-duplication process will need to set this property to false for the non-representing duplicated and also set property DuplicateGroupId. | |
| IsTextArchivePath | Specifies whether the extracted text was saved to a text archive file, in which case TextFilePath points to the containing archive file. | |
| LanguageIdResults | Extracted text language identification results (see LanguageIdResult). | |
| LanguageIdTimeTicks |
Time, in CPU timer ticks, to identify languages in this document's extracted text.
| |
| LastAccessTimeFileSystem | File system last access time (UTC) of the document. | |
| LastModifiedTimeFileSystem | File system last modified time (UTC) of the document. | |
| MD5BinaryHash | MD5 document hash (hash of all document bytes). | |
| MD5ContentHash | MD5 content hash is a proprietary hash on only the content part of document file format. | |
| Metadata | Contains standard (non-user-defined) document metadata as a dictionary of metadata field names as keys and metadata field data as corresponding values. | |
| MimePartialMessageId | If IsMimePartialMessage is true, then property holds the unique partial message 'id' that all parts (separate MIME files) of the partial message contain. | |
| MimePartialMessagePartNumber | If IsMimePartialMessage is true, then property holds the index (out of 1 to MimePartialMessageTotalParts) of this message part. | |
| MimePartialMessageTotalParts | If IsMimePartialMessage is true, then property holds the total number of message parts. It is not always guaranteed that each message part will have the total parts set in MIME 'ContentType' header value, often it is just the first or last message part that has this value set. | |
| Name | Filename of the document with extension. | |
| NumOfContainerItems | If document is a supported container format this property holds the number of container child items. | |
| OriginalFilePath | Embedded document original file path. Some embedded documents with container wrappers store the location from where the embedded child document was originally located on the file system before being embedded into the parent container. For most child documents, this value will be null. | |
| PackedSize | Document's packed size, in bytes, within its parent archive container. If the document is a compressed item in a parent container then this value is the compressed size. This property only applies to archive child documents and to the archive formats that store this item metadata. | |
| ParentChildSourcePath | Hierarchical parent/child relative path. This path is the same as ParentChildVirtualPath property, except for emails where the email subject is used as part of the virtual path. If the email does not have a subject then "(No Subject)" is used as the email's subject. | |
| ParentChildVirtualPath | Hierarchical parent/child virtual path. This path is only set for child documents of processing input documents and this property is set by the DocumentDataArchiveReader class when de-serializing a document data archive (.dda) processing output file. | |
| ParentDocControlNumber | [Optional] Parent document control number. See DocControlNumber. For child documents of input documents, this property value is only populated by the DocumentDataArchiveReader when deserializing a document data archive (.dda) file. | |
| ParentDocGuid | The GUID ID of this document's immediate parent. The property only gets set if this document is a child document of an input document. | |
| ParentId | This is a convenience property for users that may want to attach their own unique id for the parent document of this document (e.g., SQL table primary key for document's parent document). This value is not used in processing of documents other than logging. Default value is empty string (""), which indicates not set. | |
| Password | Decryption password found that successfully decrypted the document. | |
| PrimaryDate | PrimaryDate can be used to sort documents by a common property date that is meaningful across document formats. See PrimaryDateType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate. | |
| PrimaryDateType | This property specifies which metadata date field was used for this document's PrimaryDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's PrimaryDate. | |
| Result | Content extraction result. Check this value to see the state of the the content extraction. | |
| ResultErrorMessage | Error message associated with Result, if any. This property is only set when Result is not set to 'OK'. | |
| ResultErrorStackTrace | Exception stack trace that is only set if ResultErrorMessage is due to caught internal exception. | |
| RtfBody | If HasRtfBody is true, then this property contains the email RTF body. | |
| Sender | Sender information. The sender is the EmailAddress of the agent responsible for the actual transmission of the message. The sender and the From are often the same but can be different. | |
| SHA1BinaryHash | SHA-1 document hash (hash of all document bytes). | |
| SHA1ContentHash | SHA-1 content hash is a proprietary hash on only the content part of document file format. | |
| SHA1EmailAttachmentHash | SHA-1 hash of the concatenated SHA1 hash of each attachment binary data (includes hashes of inline images). | |
| SHA1EmailAttachmentSortedHash | SHA-1 hash of the SORTED and then concatenated SHA1 hash of each attachment binary data (includes hashes of inline images). | |
| SHA1EmailBodyHash | SHA1 hash of the email EmailBody text (converted to lower case and with all white space removed). | |
| SHA1EmailHeaderHash | SHA-1 email header hash. | |
| SHA1EmailRecipientNamesHash | SHA-1 hash of all recipient names concatenated together (all lower case). | |
| SHA1EmailRecipientsHash | SHA-1 hash of recipient names and email addresses concatenated together (all lower case). | |
| SHA256BinaryHash | SHA-256 document hash (hash of all document bytes). | |
| SHA256ContentHash | SHA-256 content hash is a proprietary hash on only the content part of document file format. | |
| Size | Gets the document's native file size, in bytes. | |
| SortDate | SortDate can be used to sort documents by a common property date that is meaningful across document formats. For parent documents, this property is set to the PrimaryDate property value, and for child documents of non-container parent types, this property is set to the SortDate of the parent document. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate. | |
| SortDateType | This property specifies which metadata date field was used for this document's SortDate. See DatePrecedenceType for the metdata field precedence rules (in order of DatePrecedenceType enumeration values) used in calculating a document's SortDate. | |
| TestedResult | This property is set for archive items that have been tested for actual expansion size and extraction result (see also TestedSize). | |
| TestedSize | This property is set for archives and archive items that have been tested for expanded (de-compressed) size (see also TestedResult). | |
| TextArchiveRecordOffset | Zero-byte offset of the extracted text in the text data archive (.tda) file given by TextFilePath | |
| TextBody | If HasTextBody is true, then this property contains the email plain-text body. | |
| TextFilePath | File path to document's extracted text. | |
| TextSourceType | Get how the document's text was extracted. | |
| TopMostContainerParentId | Top-most container in the parent/child hierarchy, if any, that contains this document. A document can be contained in a container that itself is contained in a container (and so on). This property returns the DocGuid of the top-most container, if any, in the hierarchy that contains this document. This property returns null if this document does not have a container parent. | |
| ToRecipients | "To" recipients email address information. |