Document |
public class DocumentDataArchiveReader : IDisposable
The DocumentDataArchiveReader type exposes the following members.
| Name | Description | |
|---|---|---|
| DocumentDataArchiveReader | Constructor. |
| Name | Description | |
|---|---|---|
| ClassificationCount | Gets a dictionary that contains IdClassification as key and the count of documents that have that file format classification as values. | |
| ContentResultCount | Gets a dictionary that contains ContentResult as key and a ContentResultInfo as value. | |
| CreationDate | Archive creation date (UTC). | |
| DirectoryHierarchy | All document data in input directory hierarchy. The hierarchy also contains document parent/child relationships. | |
| DocumentArchiveFolderPath | The root folder of the document data archive. | |
| DocumentByControlNumber | Returns all documents by DocControlNumber provided that the Documents had DocControlNumber set. | |
| DocumentByDocGuid | Gets a dictionary with DocGuid key and associated document value. | |
| EntityItemDocuments | All documents with at least 1 entity item found in extracted text and/or metadata. | |
| ExcludedDocuments | All documents with Result set to ExcludedType. | |
| FlatRecords | Gets all archive document entries as a flattened (non-hierarchival) list. | |
| FormatIdCount | Gets a dictionary that contains Id as key and the count of documents that have that file format identification as values. | |
| HasReadErrors | True if there were errors reading the document data archive (.dda). | |
| HierarchicalRecords | Gets all document data archive entries with parent/child hierarchy. | |
| IssueDocuments | All documents that do not have Result values set to either Ok, EmptyFile, ExcludedType, or RequeueAsSeparateTask | |
| LongRunningDocuments | All documents that have Result values set to LongRunningProcessingError. | |
| NistDocuments | All documents whose SHA1BinaryHash match a SHA1 hash in the NIST hash database (see PerformNistCheck and NistRdsDatabasePath). | |
| PdfDocumentsWithFailedPages | All PDF documents with at least 1 failed PDF page. | |
| ReaderMode | The DocumentDataArchiveReaderMode of this instance. | |
| ReadErrors | If HasReadErrors is true, this property will hold read error information. | |
| RequeueDocuments | All documents with Result set to either RequeueAsSeparateTask or UserRequeueAsSeparateTask. | |
| Settings | Task settings that were used to create this document data archive output. | |
| SHA1BinaryHashMatchGroups | Gets a list of HashMatchGroup that contain documents that have the same SHA1BinaryHash value. | |
| SHA1ContentHashMatchGroups | Gets a list of HashMatchGroup that contain documents that have the same SHA1ContentHash value. | |
| TotalFlatRecordSize | Total size in bytes of all documents in FlatRecords. | |
| TotalNumOfDocumentRecords | Total number of document records in document data archive. | |
| TotalSHA1BinaryHashMatches | Gets total number of documents that have same SHA1BinaryHash. | |
| TotalSHA1ContentHashMatches | Gets total number of documents that have same SHA1ContentHash. | |
| UnknownDocuments | All documents with FormatId set to either Unknown or UnknownCompoundFile. | |
| Version | Archive format version. |
| Name | Description | |
|---|---|---|
| Dispose | Dispose. | |
| Equals | Determines whether the specified object is equal to the current object. (Inherited from Object) | |
| GetDuplicateDocumentGroups | Gets all duplicate document groups present in the document data archive (.dda). | |
| GetHashCode | Serves as the default hash function. (Inherited from Object) | |
| GetType | Gets the Type of the current instance. (Inherited from Object) | |
| ReadDocumentFromControlNumberIndex | Reads a document from the DocumentControlNumberIndex given by the 'docControlNumber' argument. This archive must have been constructed with ControlNumberIndexAndHeaderOnly mode, or else this method will throw an exception. | |
| ToString | Returns a string that represents the current object. (Inherited from Object) |
Dispose of this DocumentDataArchiveReader instance when done with it so that it releases an internal BinaryReader and all resources.
If document data archive (.dda) is too 'large' to read into memory or if user doesn't need the extra summary information or hierarchical document relationships returned by this class, then consider using class DDARecordReader.
To use this class, keep document processing job tasks to 3-5 GB in total input size so that whole document data archives can be read into memory. Large archives and mail stores should have their own tasks and if too 'large', they should be partitioned into smaller processing tasks (see IsPartitioned and TotalPartitions for more information).
A document data archive (always named "DocumentDataArchive.dda") holds extracted document data (metadata/attributes/etc) from processed documents. Documents stored in a document archive file may also have links to either external individual extracted text/attachment files or links to external text/attachment archive files that act as compact archive containers for this information.