Click or drag to resize

DocumentTaskEngine Class

Provides functionality to extract content from hundreds to thousands of documents as a single task (see DocumentTaskSettings), or from "large" archives and mail store containers that deserve their own separate tasks. The DocumentTaskEngine is a highly parallel document extraction engine that completely unrolls and processes deep parent document/child document (attachments/embedded objects/media) hierarchies.
Inheritance Hierarchy
SystemObject
  OpenDiscoverSDK.PlatformDocumentTaskEngine

Namespace: OpenDiscoverSDK.Platform
Assembly: OpenDiscoverSDK (in OpenDiscoverSDK.dll) Version: 2025.4.4.0 (2025.4.4)
Syntax
C#
public class DocumentTaskEngine

The DocumentTaskEngine type exposes the following members.

Constructors
 NameDescription
Public methodDocumentTaskEngine Constructor.
Top
Properties
 NameDescription
Public propertyCurrentNumberOfArchivesInProcess If a task is running, returns the current number of archives (Zip/7zip/Tar/Rar/etc) currently being processed.
Public propertyCurrentNumberOfDatabasesInProcess If a task is running, returns the current number of databases currently being processed.
Public propertyCurrentNumberOfLargeDocumentsInProcess If a task is running, returns the current number of "large" documents currently being processed. "Large" documents are defined by LargeDocumentCritera.
Public propertyCurrentNumberOfMailStoresInProcess If a task is running, returns the current number of mail stores (PST/OST/MBOX/etc) currently being processed.
Public propertyDocumentMetadataToFileStoreQueueCount If a task is running, returns the current number of documents waiting have their extracted metadata written to file store. This is the last step for a processed document.
Public propertyEmbeddedAndTextToFileStoreQueueCount If a task is running, returns the current number of extracted embedded documents and documents with extracted text waiting to be written to file store.
Public propertyExtractedDocumentQueueCount If a task is running, returns the current number of extracted documents waiting to be processed.
Public propertyInputDocumentQueueCount If a task is running, returns the current number of documents waiting to be read for processing.
Public propertyIsFileStoreWriterComplete Returns true if a task is currently running and the task has completed writing all attachments and extracted text to ar or flat files; false if still busy or a task is not running.
Public propertyIsolateCorruptDocument RESERVED - DO NOT USE OR SET PROPERTY. Reserved for internal testing.
Public propertyIsProcessingDocumentsComplete Returns true if a task is currently running and the task has completed processing all documents; false if still busy or a task is not running.
Public propertyIsReadingDocumentComplete Returns true if a task is currently running and the task has completed reading all the input documents; false if still has documents to read or a task is not running.
Public propertyIsTaskRunning True if a task is currently being executed via a previous call to method RunTask.
Public propertyLargeDocumentQueueCount If a task is running, returns the current number of "large" documents that require special processing that are waiting to be processed. "Large" documents are defined by LargeDocumentCritera.
Public propertyNumInputDocuments If a task is running, returns the total number of input documents to be processed for this task.
Public propertyProcessedDocuments If task is currently running, returns null. If task has finished running this property returns the input document hierarchy, that is extracted children documents (embedded, attachments, and container items) are populated (if any) in ChildDocuments.
Public propertyReadDocumentQueueCount If a task is running, returns the current number of documents read and waiting to be processed.
Public propertyTaskPercentComplete If a task is running, returns the task's estimated percent complete (0-100).
Public propertyTotalArchivesProcessed If a task is running, returns the total number of archives (Zip/7zip/Tar/Rar/etc) that have currently been fully processed.
Public propertyTotalDatabasesProcessed If a task is running, returns the total number of databases that have currently been fully processed.
Public propertyTotalDocumentsProcessed If a task is running, returns the total number of documents that have currently been fully processed (includes extracted embedded and container items).
Public propertyTotalInputDocumentsProcessed If a task is running, returns the total number of input documents to task that have currently been fully processed (does NOT include extracted embedded and container items).
Public propertyTotalMailStoresProcessed If a task is running, returns the total number of mail stores (PST/OST/MBOX/etc) that have currently been fully processed.
Top
Methods
 NameDescription
Public methodAbortTask Aborts the currently executing task started by RunTask or RunTaskBlocking. Aborting may cause the host to crash so should only be used to stop a rogue, or long running document task. Any cleanup, database updating, and task scheduler notifications should be done prior to calling this method.
Public methodStatic memberCreateNistRdsDatabase Creates a NIST National Software Reference Library (NSRL) Reference Data Set (RDS) database that can be used by DocumentTaskEngine to de-NIST documents while processing (see PerformNistCheck and NistRdsDatabasePath).
Public methodEqualsDetermines whether the specified object is equal to the current object.
(Inherited from Object)
Public methodGetHashCodeServes as the default hash function.
(Inherited from Object)
Public methodGetTypeGets the Type of the current instance.
(Inherited from Object)
Public methodRunTask Asynchronously executes the document task defined by the constructor DocumentTaskSettings argument.
Public methodRunTaskBlocking Executes the document task defined by the constructor DocumentTaskSettings argument synchronously (blocking).
Public methodToStringReturns a string that represents the current object.
(Inherited from Object)
Top
Events
 NameDescription
Public eventCompleted Task is completed event.
Public eventFatalException Fatal exception event.
Public eventLogUpdated Task log updated event.
Public eventLongProcessingDocumentWarning Long processing document warning event.
Top
Remarks

See methods RunTask and RunTaskBlocking, these methods provide highly parallel and very fast processing of batches of documents and processing of "large" archives or mail stores as a single task. Large archives and mail stores can also be broken into multiple separate tasks, see IsPartitioned.

When processing hundreds to thousands of documents as a single task, these documents should not add up to more than 4-5 gigabytes in combined file size, or else the outputted document data archive (.dda) could become too large to read into memory (see DocumentDataArchiveReader). If processing > 10 gigabytes of documents, break the documents into 2 to 4 gigabytes sized DocumentTaskSettings tasks and for "large" archive and mail stores ("large" being a subjective term) create separate tasks for these 'large' archives and mail stores using SingleArchive or SingleMailStore processing type, respectively.

For archives, be aware of the expansion size before deciding how to process. Archives can have very high compression ratios, for example, a 500 MB sized archive could expand into 50 GB worth of files. It is wise to test archives for true expansion size before expanding/extracting.

Breaking large document processings sets into separate tasks aides in distribution across multiple desktops or VMs and also aides in re-queuing of any failed task(s).

Note: DocumentTaskEngine can handle long file paths (>255 characters in length) for input documents.

See Also