Click or drag to resize

DocumentTaskSettings Class

Represents a document processing task (1 or more documents) that a DocumentTaskEngine instance can process.
Inheritance Hierarchy
SystemObject
  OpenDiscoverSDK.Interfaces.SettingsContentExtractionSettings
    OpenDiscoverSDK.Interfaces.Platform.SettingsDocumentTaskSettings

Namespace: OpenDiscoverSDK.Interfaces.Platform.Settings
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.4.0 (2025.4.4)
Syntax
C#
[DataContractAttribute]
public class DocumentTaskSettings : ContentExtractionSettings

The DocumentTaskSettings type exposes the following members.

Constructors
 NameDescription
Public methodDocumentTaskSettings Constructor.
Top
Properties
 NameDescription
Public propertyAllowLongProcessingDocumentTest RESERVED - DO NOT USE OR SET THIS PROPERTY. Reserved for internal SDK testing and setting this property to TRUE can lead to unpredictable behavior.
Public propertyAttachmentArchiveFilenameFormat Base string format of attachment archive files and only used when OutputMode is set to Archive (read-only).
Public propertyAttachmentArchiveMaxSize Attachment archive file maximum size (default is 4GB).
Public propertyClientId [Optional] User defined client Id that owns the ProjectId.
Public propertyClientName [Optional] A user defined client name.
Public propertyCollectionId [Required] A user defined document collection (job) Id that uniquely identifies the document processing set that this processing task defined by TaskId belongs to.
Public propertyCollectionName [Optional] A user defined document collection (job) name.
Public propertyCollectionSourcePath The ingestion file path or directory path used as the source for the processing document collection.
Public propertyCollectionStartDateUtc The processing start DateTime in UTC for this document collection (see CollectionId).
Public propertyControlNumberingType Determines how DocControlNumber numbers are formatted. These numbers are only generated when de-serializing a data document archive (.dda) file using the DocumentDataArchiveReader class. The default control numbering (ID) format type is ParentChildWithUnderscoreSeparator.
Public propertyCountMailStoreMessageObjects If property ProcessingTaskType is set to SingleMailStore and this value is true, then mail store message objects are always counted before processing. If false; mail store message objects are not counted and property MailStoreMessageCount must be set with the number of message objects in the mail store.
Public propertyCpuCoreMode Specifies the DocumentTaskEngine CPU core usage mode.
Public propertyCreateEmptyTextFilesWhenNoTextExtracted If true (default value) and property OutputMode equals IndividualFiles, then the task will write out an empty extracted text file for any document whose text could not be extracted.
Public propertyCustodianId [Optional] User defined custodian Id that owns the CollectionId.
Public propertyCustodianName [Optional] A user defined custodian name.
Public propertyCustomDocumentSource The custom document source. Only set if ProcessingTaskType is set to CustomDocumentSource.
Public propertyDebugLogging If true, verbose debug logging is enabled (default value: false). Setting to true is not a recommended setting unless trying to track down a failed task or related issue. Setting to true degrades task execution performance.
Public propertyDocumentArchiveFilename Standardized name of all document archive output files (read-only) (.dda).
Public propertyDocumentArchiveRootPath Gets or sets the root extracted path for document archive file output.
Public propertyDocuments The input documents to process. See remarks.
Public propertyEmbeddedObjectExtraction Embedded document/attachment and embedded office media extraction setting.
(Inherited from ContentExtractionSettings)
Public propertyEntityExtractionSettings Options for entity extraction in extracted text, metadata, and URLs.
(Inherited from ContentExtractionSettings)
Public propertyExcludedDocumentTypes HashSet of document Ids to "exclude" from content extraction.
Public propertyExcludeInlineEmailImages Exclude all inline email images. The default value is true.
Public propertyExtractionType Document content extraction type. This property is read-only and this property's value is controlled by the value of property ProcessingMode.
(Overrides ContentExtractionSettingsExtractionType)
Public propertyExtractOfficeTrackedChanges If true, appends tracked change information/text from office document formats (that support tracked changes) to the end of the document's extracted text; otherwise, tracked changes text is not appended to document's extracted text.
(Inherited from ContentExtractionSettings)
Public propertyHashing Document hashing settings.
(Inherited from ContentExtractionSettings)
Public propertyIgnoreCorruptedDocuments RESERVED. Used by Open Discover Platform Worflow Management System (WMS).
Public propertyInMemorySizeMaxCritera The maximum document size (in bytes) that are processed in-memory.
Public propertyIsPartitioned True if this task is to process a partition (subset) of items in a single archive or single mail store; otherwise, if false, this task is to process all the archive or mail store container items. This property is ignored if ProcessingTaskType property is set to DocumentSet. See remarks.
Public propertyLanguageId Language identification of extracted text settings.
(Inherited from ContentExtractionSettings)
Public propertyLargeDocumentCritera Defines the "large" document criteria, in bytes, that determines what type of content extractor is returned by the content extractor factory for "large" unknown/unsupported formats and also "large" encoded text based formats.
(Inherited from ContentExtractionSettings)
Public propertyLongProcessingDocumentCriteriaInSec Long processing document criteria, in seconds, specifies an elapsed time criteria when the first DocumentTaskEngine.LongProcessingDocumentWarning event is fired (if at all).
Public propertyMailStoreMessageCount If CountMailStoreMessageObjects is false and property ProcessingTaskType is set to SingleMailStore, then this value must contain the already counted number of message objects in the mail store. Otherwise this property value is ignored.
Public propertyMaxArchiveCompressionRatio Archive maximum compression ratio security feature to help protect against archive compression 'ZIP-bombs'.
Public propertyMaxNumDatabaseTableRowsToOutput Maximum number of database table rows to output to extracted table text. The default value is -1 which means all rows. Database tables can potentially have 10's of millions of rows so users should use caution when processing databases of unknown origins. If value is 0, only the table name and table column names will be output to table text file.
Public propertyMetrics Task execution metrics. Metrics will be populated at end of task execution.
Public propertyNistRdsDatabasePath Full directory path to NIST RDS hash database.
Public propertyOutputEmailBodies If true, for email types, saves all extracted email bodies to the task's outputted document data archive file (DocumentDataArchive.dda). If false (default value), email bodies are not saved.
Public propertyOutputMode Output mode for extracted attachments and text. See remarks.
Public propertyPartitionTarget The archive or mail store partition to process out of the TotalPartitions container partitions.
Public propertyPasswords Gets or sets the password array to use for decryption of supported password-protected formats.
Public propertyPdfDocument PDF document extraction settings.
(Inherited from ContentExtractionSettings)
Public propertyPerformNistCheck Perform NIST check on document using SHA1BinaryHash hashes. The default value is true.
Public propertyPhysicalProcessorAffinity Specifies the DocumentTaskEngine physicall processor (CPU) affinity for multi-processor workstations or servers. Do not set for machines with virtual processors (CPU) cores.
Public propertyProcessingMode Processing mode. Determines the type of content that is extracted from documents.
Public propertyProcessingTaskType Processing type. Determines if we are processing a document set, a single archive (or a multi-part archive), or a single mail store.
Public propertyProjectId [Optional] User defined project Id that owns the CustodianId.
Public propertyProjectName [Optional] A user defined project name.
Public propertyRequeueLargeContainersAsOwnTask If true, "large" (defined by RequeueLargeContainerSizeCriteria) archive and mail store containers found in DocumentTaskSettings input documents (Documents) don't have their container items extracted but get their Result property set to RequeueAsSeparateTask and are not processed further.
Public propertyRequeueLargeContainerSizeCriteria Defines the size, in bytes, criteria for archives and mail store containers found in a DocumentTaskSettings task where they are considered too "large" for the task and should get their own task.
Public propertySaveAttachments If true (default value), the task will save extracted attachments, embedded documents, and embedded media. See remarks.
Public propertyTaskId [Optional] User defined task Id for this task.
Public propertyTaskParameter [Optional] User defined task parameter.
Public propertyTestArchives Test archives for actual expanded size and actual compression ratio. If false, no expansion test is performed before extracting from archive.
Public propertyTextArchiveFilenameFormat Base string format of text archive files and only used when OutputMode is set to Archive (read-only) .
Public propertyTextArchiveMaxSize Maximum text BLOB file size (default is 4GB) and only used when OutputMode is set to Archive.
Public propertyTextFileEncoding If property OutputMode is set to IndividualFiles, then this property determines if text files are written out as UTF-16 (default) or UTF-8 encoded (with BOM).
Public propertyTimeZoneAndEmail Settings for document collection time zone and related extracted DateTime metadata and email extracted text DateTime display.
(Inherited from ContentExtractionSettings)
Public propertyTotalPartitions The total number of processing partition tasks to partition an archive or mail store. This property must be set if property IsPartitioned is set to true and must be greater than or equal to 2.
Public propertyUnsupportedFiltering Binary-to-text filtering of unsupported/unknown document file format settings.
(Inherited from ContentExtractionSettings)
Public propertyUseLargeDocumentUTF16Encoding 'Large' document extracted text encoding (see base class property LargeDocumentCritera). This property is read-only, the base class setter is overriden and this property value is controlled by the value of property TextFileEncoding.
(Overrides ContentExtractionSettingsUseLargeDocumentUTF16Encoding)
Public propertyUserRequeueDocumentTypes User defined HashSet of document format Id's (Id) to not process further and mark as "requeue" for user custom processing workflow. If a document is found to have a format Id contained in this hash set, then its Result gets set to UserRequeueAsSeparateTask and is not processed further.
Top
Methods
 NameDescription
Public methodEqualsDetermines whether the specified object is equal to the current object.
(Inherited from Object)
Public methodGetHashCodeServes as the default hash function.
(Inherited from Object)
Public methodGetTypeGets the Type of the current instance.
(Inherited from Object)
Public methodToStringReturns a string that represents the current object.
(Inherited from Object)
Top
Remarks

This class stores the processing task settings for a document set, single archive, or single mail store that is to be processed by a DocumentTaskEngine instance.

A set of documents, a single archive**, or single mail store ** to process should not have a combined size greater than 5 gigabytes, or else the outputted document data archive (.dda) could become too large to read into memory. Large document sets should be broken up into 1-2 gigabyte of document subsets to process as tasks. Breaking a large document processing job (e.g., a 100 gigabyte worth of documents) into 1-2 gigabyte subsets worth of documents aids in distributing a big processing job across multiple desktops or server VMs - each running a DocumentTaskEngine instance(s).

** "Large" archives or mail stores can be processed as a single task, or can also be partitioned into many sub-tasks for distributable processing. See properties IsPartitioned, TotalPartitions and PartitionTarget.

Example
This example code snippet shows how to set up a DocumentTaskSettings object to process a single very "large" archive and how to break up the large archive into 4 separate partitions for distributable processing. In the snippet below, this task will only work on the 2nd partition (PartitionTarget) out of the 4 (TotalPartitions) total partitions. The other 3 partitions can be run as separate DocumentTaskEngine tasks with PartitionTarget property set to 1, 3, and 4, respectively.
C#
var taskSettings = new OpenDiscoverPlatform.DocumentTaskSettings();
taskSettings.CollectionId = "101";
taskSettings.TaskId       = Guid.NewGuid().ToString();

taskSettings.ProcessingTaskType = ProcessingType.SingleArchive; // Single (or multi-part) archive processing task
taskSettings.IsPartitioned   = true; // Archive will be partitioned
taskSettings.TotalPartitions = 4;    // Archive will be broken up and processed as 4 separate partitions (tasks).
taskSettings.PartitionTarget = 2;    // The partition # this task will work on (other DocumentTaskEngine instances can process the other partitions simultaneously)

var archivePath    = @"D:\InputDocuments\Archives\VeryLargeArchive.zip";
var outputRootPath = @"D:\Output\"; // Root path to store task output
var taskOutputPath = System.IO.Path.Combine(outputRootPath, string.Format(@"CollectionId_{0}\Task_{1}", taskSettings.CollectionId, taskSettings.TaskId));

// 
// For single archive or single mail store tasks, the input document(s) Document.FilePath and Document.FormatId properties should be set:
// 
var archiveDocument = new Document();
archiveDocument.FilePath = archivePath;
using (var docStream = System.IO.File.OpenRead(archivePath))
{
    archiveDocument.FormatId = OpenDiscoverSDK.DocumentIdentifier.Identify(docStream, archivePath);
}

//For a split (multi-part) archive, we would pass in a list of the split segment documents in order:
taskSettings.Documents      = new List<Document> () { archiveDocument };  
taskSettings.ProcessingMode = ProcessingMode.TextAndMetadata;
taskSettings.OutputMode     = OutputMode.IndividualFiles;  // Extracted attachments/ text will be saved as individual (flat) files.

// Set root path for processing output files:
taskSettings.DocumentArchiveRootPath = taskOutputPath;

taskSettings.Passwords        = null;  // No passwords to cycle through
taskSettings.PerformNistCheck = false; // No checking document binary hashes against NIST database.

taskSettings.EmbeddedObjectExtraction = EmbeddedExtractionType.EmbeddedDocumentsAndMedia;
taskSettings.ExcludeInlineEmailImages = true;

taskSettings.PdfDocument.ImageExtraction = PdfImageExtraction.OnlyFailedPdfPages;
taskSettings.PdfDocument.PageExtractedTextCriteria = 1;

taskSettings.TimeZoneAndEmail.CollectionTimeZone      = TimeZoneInfo.Utc;
taskSettings.TimeZoneAndEmail.ApplyTimeZoneToMetadata = false;
taskSettings.TimeZoneAndEmail.EmailDateTimeFormat     = EmailDateTimeFormat.MonthDayYearTime;
taskSettings.TimeZoneAndEmail.ShowUtcOffsetForTime    = true;

taskSettings.Hashing.HashingType = HashingType.BinaryAndContentHash;
taskSettings.Hashing.MaxBinaryHashLength = 10*1024*1024*1024;  // Hash up to a maximum of the first 10GB of a file
taskSettings.Hashing.IncludeBccRecipientsInEmailContentHash = false;

taskSettings.LanguageId.IdentifyLanguages = true;

taskSettings.UnsupportedFiltering.FilteringType = UnsupportedFilterType.Unsupported;
taskSettings.UnsupportedFiltering.LargeUnsupportedMaxFilteredChars = 1024 * 1024 * 1024; // Binary-to-text filter at max 1 billion chars

// 
// Create a document task engine instance to process the task:
// 
var documentTaskEngine = new DocumentTaskEngine(taskSettings);
documentTaskEngine.Completed      += _documentTaskEngine_Completed;
documentTaskEngine.FatalException += _documentTaskEngine_FatalException;
documentTaskEngine.LongProcessingDocumentWarning += _documentTaskEngine_LongProcessingDocumentWarning;

// 
// Run task synchronously (blocking):
// 
documentTaskEngine.RunTaskBlocking();

// TODO: do something with the output, like bulk insert into a document store or an eDiscovery document review system.
See Also