DocumentDataArchiveReader Class

Reads all document records from an existing document data archive (.dda) into memory.

Definition

Namespace: OpenDiscoverSDK.Platform.Archive
Assembly: OpenDiscoverSDK (in OpenDiscoverSDK.dll) Version: 2026.2.6.0 (2026.02.06)
C#
public class DocumentDataArchiveReader : IDisposable
Inheritance
Object    DocumentDataArchiveReader
Implements
IDisposable

Remarks

Dispose of this DocumentDataArchiveReader instance when done with it so that it releases an internal BinaryReader and all resources.

If document data archive (.dda) is too 'large' to read into memory or if user doesn't need the extra summary information or hierarchical document relationships returned by this class, then consider using class DDARecordReader.

To use this class, keep document processing job tasks to 3-5 GB in total input size so that whole document data archives can be read into memory. Large archives and mail stores should have their own tasks and if too 'large', they should be partitioned into smaller processing tasks (see IsPartitioned and TotalPartitions for more information).

A document data archive (always named "DocumentDataArchive.dda") holds extracted document data (metadata/attributes/etc) from processed documents. Documents stored in a document archive file may also have links to either external individual extracted text/attachment files or links to external text/attachment archive files that act as compact archive containers for this information.

Constructors

Properties

ClassificationCount Gets a dictionary that contains IdClassification as key and the count of documents that have that file format classification as values.
ContentResultCount Gets a dictionary that contains ContentResult as key and a ContentResultInfo as value.
CreationDate Archive creation date (UTC).
DirectoryHierarchy All document data in input directory hierarchy. The hierarchy also contains document parent/child relationships.
DocumentArchiveFolderPath The root folder of the document data archive.
DocumentByControlNumber Returns all documents by DocControlNumber provided that the Documents had DocControlNumber set.
DocumentByDocGuid Gets a dictionary with DocGuid key and associated document value.
EntityItemDocuments All documents with at least 1 entity item found in extracted text and/or metadata.
ExcludedDocuments All documents with Result set to ExcludedType.
FlatRecords Gets all archive document entries as a flattened (non-hierarchival) list.
FormatIdCount Gets a dictionary that contains Id as key and the count of documents that have that file format identification as values.
HasReadErrors True if there were errors reading the document data archive (.dda).
HierarchicalRecords Gets all document data archive entries with parent/child hierarchy.
IssueDocuments All documents that do not have Result values set to either Ok, EmptyFile, ExcludedType, or RequeueAsSeparateTask
LongRunningDocuments All documents that have Result values set to LongRunningProcessingError.
NistDocuments All documents whose SHA1BinaryHash match a SHA1 hash in the NIST hash database (see PerformNistCheck and NistRdsDatabasePath).
PdfDocumentsWithFailedPages All PDF documents with at least 1 failed PDF page.
ReaderMode The DocumentDataArchiveReaderMode of this instance.
ReadErrors If HasReadErrors is true, this property will hold read error information.
RequeueDocuments All documents with Result set to either RequeueAsSeparateTask or UserRequeueAsSeparateTask.
Settings Task settings that were used to create this document data archive output.
SHA1BinaryHashMatchGroups Gets a list of HashMatchGroup that contain documents that have the same SHA1BinaryHash value.
SHA1ContentHashMatchGroups Gets a list of HashMatchGroup that contain documents that have the same SHA1ContentHash value.
TotalFlatRecordSize Total size in bytes of all documents in FlatRecords.
TotalNumOfDocumentRecords Total number of document records in document data archive.
TotalSHA1BinaryHashMatches Gets total number of documents that have same SHA1BinaryHash.
TotalSHA1ContentHashMatches Gets total number of documents that have same SHA1ContentHash.
UnknownDocuments All documents with FormatId set to either Unknown or UnknownCompoundFile.
Version Archive format version.

Methods

Dispose Dispose.
EqualsDetermines whether the specified object is equal to the current object.
(Inherited from Object)
GetDuplicateDocumentGroups Gets all duplicate document groups present in the document data archive (.dda).
GetHashCodeServes as the default hash function.
(Inherited from Object)
GetTypeGets the Type of the current instance.
(Inherited from Object)
ReadDocumentFromControlNumberIndex Reads a document from the DocumentControlNumberIndex given by the 'docControlNumber' argument. This archive must have been constructed with ControlNumberIndexAndHeaderOnly mode, or else this method will throw an exception.
ToStringReturns a string that represents the current object.
(Inherited from Object)

See Also