DocumentTaskSettings Class

Represents a document processing task (1 or more documents) that a DocumentTaskEngine instance can process.

Inheritance Hierarchy

SystemObject
OpenDiscoverSDK.Interfaces.SettingsContentExtractionSettings
OpenDiscoverSDK.Interfaces.Platform.SettingsDocumentTaskSettings

Namespace: OpenDiscoverSDK.Interfaces.Platform.Settings
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.6.0 (2025.4.6)

Syntax

Copy

[DataContractAttribute]
public class DocumentTaskSettings : ContentExtractionSettings

The DocumentTaskSettings type exposes the following members.

Constructors

	Name	Description
	DocumentTaskSettings	Constructor.

Top

Properties

	Name	Description
	AllowLongProcessingDocumentTest	RESERVED - DO NOT USE OR SET THIS PROPERTY. Reserved for internal SDK testing and setting this property to TRUE can lead to unpredictable behavior.
	AttachmentArchiveFilenameFormat	Base string format of attachment archive files and only used when OutputMode is set to Archive (read-only).
	AttachmentArchiveMaxSize	Attachment archive file maximum size (default is 4GB).
	ClientId	[Optional] User defined client Id that owns the ProjectId.
	ClientName	[Optional] A user defined client name.
	CollectionId	[Required] A user defined document collection (job) Id that uniquely identifies the document processing set that this processing task defined by TaskId belongs to.
	CollectionName	[Optional] A user defined document collection (job) name.
	CollectionSourcePath	The ingestion file path or directory path used as the source for the processing document collection.
	CollectionStartDateUtc	The processing start DateTime in UTC for this document collection (see CollectionId).
	ControlNumberingType	Determines how DocControlNumber numbers are formatted. These numbers are only generated when de-serializing a data document archive (.dda) file using the DocumentDataArchiveReader class. The default control numbering (ID) format type is ParentChildWithUnderscoreSeparator.
	CountMailStoreMessageObjects	If property ProcessingTaskType is set to SingleMailStore and this value is true, then mail store message objects are always counted before processing. If false; mail store message objects are not counted and property MailStoreMessageCount must be set with the number of message objects in the mail store.
	CpuCoreMode	Specifies the DocumentTaskEngine CPU core usage mode.
	CreateEmptyTextFilesWhenNoTextExtracted	If true (default value) and property OutputMode equals IndividualFiles, then the task will write out an empty extracted text file for any document whose text could not be extracted.
	CustodianId	[Optional] User defined custodian Id that owns the CollectionId.
	CustodianName	[Optional] A user defined custodian name.
	CustomDocumentSource	The custom document source. Only set if ProcessingTaskType is set to CustomDocumentSource.
	DebugLogging	If true, verbose debug logging is enabled (default value: false). Setting to true is not a recommended setting unless trying to track down a failed task or related issue. Setting to true degrades task execution performance.
	DocumentArchiveFilename	Standardized name of all document archive output files (read-only) (.dda).
	DocumentArchiveRootPath	Gets or sets the root extracted path for document archive file output.
	Documents	The input documents to process. See remarks.
	EmbeddedObjectExtraction	Embedded document/attachment and embedded office media extraction setting. (Inherited from ContentExtractionSettings)
	EntityExtractionSettings	Options for entity extraction in extracted text, metadata, and URLs. (Inherited from ContentExtractionSettings)
	ExcludedDocumentTypes	HashSet of document Ids to "exclude" from content extraction.
	ExcludeInlineEmailImages	Exclude all inline email images. The default value is true.
	ExtractionType	Document content extraction type. This property is read-only and this property's value is controlled by the value of property ProcessingMode. (Overrides ContentExtractionSettingsExtractionType)
	ExtractOfficeTrackedChanges	If true, appends tracked change information/text from office document formats (that support tracked changes) to the end of the document's extracted text; otherwise, tracked changes text is not appended to document's extracted text. (Inherited from ContentExtractionSettings)
	Hashing	Document hashing settings. (Inherited from ContentExtractionSettings)
	IgnoreCorruptedDocuments	RESERVED. Used by Open Discover Platform Worflow Management System (WMS).
	InMemorySizeMaxCritera	The maximum document size (in bytes) that are processed in-memory.
	IsPartitioned	True if this task is to process a partition (subset) of items in a single archive or single mail store; otherwise, if false, this task is to process all the archive or mail store container items. This property is ignored if ProcessingTaskType property is set to DocumentSet. See remarks.
	LanguageId	Language identification of extracted text settings. (Inherited from ContentExtractionSettings)
	LargeDocumentCritera	Defines the "large" document criteria, in bytes, that determines what type of content extractor is returned by the content extractor factory for "large" unknown/unsupported formats and also "large" encoded text based formats. (Inherited from ContentExtractionSettings)
	LongProcessingDocumentCriteriaInSec	Long processing document criteria, in seconds, specifies an elapsed time criteria when the first DocumentTaskEngine.LongProcessingDocumentWarning event is fired (if at all).
	MailStoreMessageCount	If CountMailStoreMessageObjects is false and property ProcessingTaskType is set to SingleMailStore, then this value must contain the already counted number of message objects in the mail store. Otherwise this property value is ignored.
	MaxArchiveCompressionRatio	Archive maximum compression ratio security feature to help protect against archive compression 'ZIP-bombs'.
	MaxNumDatabaseTableRowsToOutput	Maximum number of database table rows to output to extracted table text. The default value is -1 which means all rows. Database tables can potentially have 10's of millions of rows so users should use caution when processing databases of unknown origins. If value is 0, only the table name and table column names will be output to table text file.
	Metrics	Task execution metrics. Metrics will be populated at end of task execution.
	NistRdsDatabasePath	Full directory path to NIST RDS hash database.
	OutputEmailBodies	If true, for email types, saves all extracted email bodies to the task's outputted document data archive file (DocumentDataArchive.dda). If false (default value), email bodies are not saved.
	OutputMode	Output mode for extracted attachments and text. See remarks.
	PartitionTarget	The archive or mail store partition to process out of the TotalPartitions container partitions.
	Passwords	Gets or sets the password array to use for decryption of supported password-protected formats.
	PdfDocument	PDF document extraction settings. (Inherited from ContentExtractionSettings)
	PerformNistCheck	Perform NIST check on document using SHA1BinaryHash hashes. The default value is true.
	PhysicalProcessorAffinity	Specifies the DocumentTaskEngine physicall processor (CPU) affinity for multi-processor workstations or servers. Do not set for machines with virtual processors (CPU) cores.
	ProcessingMode	Processing mode. Determines the type of content that is extracted from documents.
	ProcessingTaskType	Processing type. Determines if we are processing a document set, a single archive (or a multi-part archive), or a single mail store.
	ProjectId	[Optional] User defined project Id that owns the CustodianId.
	ProjectName	[Optional] A user defined project name.
	RequeueLargeContainersAsOwnTask	If true, "large" (defined by RequeueLargeContainerSizeCriteria) archive and mail store containers found in DocumentTaskSettings input documents (Documents) don't have their container items extracted but get their Result property set to RequeueAsSeparateTask and are not processed further.
	RequeueLargeContainerSizeCriteria	Defines the size, in bytes, criteria for archives and mail store containers found in a DocumentTaskSettings task where they are considered too "large" for the task and should get their own task.
	SaveAttachments	If true (default value), the task will save extracted attachments, embedded documents, and embedded media. See remarks.
	TaskId	[Optional] User defined task Id for this task.
	TaskParameter	[Optional] User defined task parameter.
	TestArchives	Test archives for actual expanded size and actual compression ratio. If false, no expansion test is performed before extracting from archive.
	TextArchiveFilenameFormat	Base string format of text archive files and only used when OutputMode is set to Archive (read-only) .
	TextArchiveMaxSize	Maximum text BLOB file size (default is 4GB) and only used when OutputMode is set to Archive.
	TextFileEncoding	If property OutputMode is set to IndividualFiles, then this property determines if text files are written out as UTF-16 (default) or UTF-8 encoded (with BOM).
	TimeZoneAndEmail	Settings for document collection time zone and related extracted DateTime metadata and email extracted text DateTime display. (Inherited from ContentExtractionSettings)
	TotalPartitions	The total number of processing partition tasks to partition an archive or mail store. This property must be set if property IsPartitioned is set to true and must be greater than or equal to 2.
	UnsupportedFiltering	Binary-to-text filtering of unsupported/unknown document file format settings. (Inherited from ContentExtractionSettings)
	UseLargeDocumentUTF16Encoding	'Large' document extracted text encoding (see base class property LargeDocumentCritera). This property is read-only, the base class setter is overriden and this property value is controlled by the value of property TextFileEncoding. (Overrides ContentExtractionSettingsUseLargeDocumentUTF16Encoding)
	UserRequeueDocumentTypes	User defined HashSet of document format Id's (Id) to not process further and mark as "requeue" for user custom processing workflow. If a document is found to have a format Id contained in this hash set, then its Result gets set to UserRequeueAsSeparateTask and is not processed further.

Top

Methods

	Name	Description
	Equals	Determines whether the specified object is equal to the current object. (Inherited from Object)
	GetHashCode	Serves as the default hash function. (Inherited from Object)
	GetType	Gets the Type of the current instance. (Inherited from Object)
	ToString	Returns a string that represents the current object. (Inherited from Object)

Top

Remarks

This class stores the processing task settings for a document set, single archive, or single mail store that is to be processed by a DocumentTaskEngine instance.

A set of documents, a single archive**, or single mail store ** to process should not have a combined size greater than 5 gigabytes, or else the outputted document data archive (.dda) could become too large to read into memory. Large document sets should be broken up into 1-2 gigabyte of document subsets to process as tasks. Breaking a large document processing job (e.g., a 100 gigabyte worth of documents) into 1-2 gigabyte subsets worth of documents aids in distributing a big processing job across multiple desktops or server VMs - each running a DocumentTaskEngine instance(s).

** "Large" archives or mail stores can be processed as a single task, or can also be partitioned into many sub-tasks for distributable processing. See properties IsPartitioned, TotalPartitions and PartitionTarget.

Example

This example code snippet shows how to set up a DocumentTaskSettings object to process a single very "large" archive and how to break up the large archive into 4 separate partitions for distributable processing. In the snippet below, this task will only work on the 2nd partition (PartitionTarget) out of the 4 (TotalPartitions) total partitions. The other 3 partitions can be run as separate DocumentTaskEngine tasks with PartitionTarget property set to 1, 3, and 4, respectively.

Copy

var taskSettings = new OpenDiscoverPlatform.DocumentTaskSettings();
taskSettings.CollectionId = "101";
taskSettings.TaskId       = Guid.NewGuid().ToString();

taskSettings.ProcessingTaskType = ProcessingType.SingleArchive; // Single (or multi-part) archive processing task
taskSettings.IsPartitioned   = true; // Archive will be partitioned
taskSettings.TotalPartitions = 4;    // Archive will be broken up and processed as 4 separate partitions (tasks).
taskSettings.PartitionTarget = 2;    // The partition # this task will work on (other DocumentTaskEngine instances can process the other partitions simultaneously)

var archivePath    = @"D:\InputDocuments\Archives\VeryLargeArchive.zip";
var outputRootPath = @"D:\Output\"; // Root path to store task output
var taskOutputPath = System.IO.Path.Combine(outputRootPath, string.Format(@"CollectionId_{0}\Task_{1}", taskSettings.CollectionId, taskSettings.TaskId));

// 
// For single archive or single mail store tasks, the input document(s) Document.FilePath and Document.FormatId properties should be set:
// 
var archiveDocument = new Document();
archiveDocument.FilePath = archivePath;
using (var docStream = System.IO.File.OpenRead(archivePath))
{
    archiveDocument.FormatId = OpenDiscoverSDK.DocumentIdentifier.Identify(docStream, archivePath);
}

//For a split (multi-part) archive, we would pass in a list of the split segment documents in order:
taskSettings.Documents      = new List<Document> () { archiveDocument };  
taskSettings.ProcessingMode = ProcessingMode.TextAndMetadata;
taskSettings.OutputMode     = OutputMode.IndividualFiles;  // Extracted attachments/ text will be saved as individual (flat) files.

// Set root path for processing output files:
taskSettings.DocumentArchiveRootPath = taskOutputPath;

taskSettings.Passwords        = null;  // No passwords to cycle through
taskSettings.PerformNistCheck = false; // No checking document binary hashes against NIST database.

taskSettings.EmbeddedObjectExtraction = EmbeddedExtractionType.EmbeddedDocumentsAndMedia;
taskSettings.ExcludeInlineEmailImages = true;

taskSettings.PdfDocument.ImageExtraction = PdfImageExtraction.OnlyFailedPdfPages;
taskSettings.PdfDocument.PageExtractedTextCriteria = 1;

taskSettings.TimeZoneAndEmail.CollectionTimeZone      = TimeZoneInfo.Utc;
taskSettings.TimeZoneAndEmail.ApplyTimeZoneToMetadata = false;
taskSettings.TimeZoneAndEmail.EmailDateTimeFormat     = EmailDateTimeFormat.MonthDayYearTime;
taskSettings.TimeZoneAndEmail.ShowUtcOffsetForTime    = true;

taskSettings.Hashing.HashingType = HashingType.BinaryAndContentHash;
taskSettings.Hashing.MaxBinaryHashLength = 10*1024*1024*1024;  // Hash up to a maximum of the first 10GB of a file
taskSettings.Hashing.IncludeBccRecipientsInEmailContentHash = false;

taskSettings.LanguageId.IdentifyLanguages = true;

taskSettings.UnsupportedFiltering.FilteringType = UnsupportedFilterType.Unsupported;
taskSettings.UnsupportedFiltering.LargeUnsupportedMaxFilteredChars = 1024 * 1024 * 1024; // Binary-to-text filter at max 1 billion chars

// 
// Create a document task engine instance to process the task:
// 
var documentTaskEngine = new DocumentTaskEngine(taskSettings);
documentTaskEngine.Completed      += _documentTaskEngine_Completed;
documentTaskEngine.FatalException += _documentTaskEngine_FatalException;
documentTaskEngine.LongProcessingDocumentWarning += _documentTaskEngine_LongProcessingDocumentWarning;

// 
// Run task synchronously (blocking):
// 
documentTaskEngine.RunTaskBlocking();

// TODO: do something with the output, like bulk insert into a document store or an eDiscovery document review system.

Reference

OpenDiscoverSDK.Interfaces.Platform.Settings Namespace