Document |
[DataContractAttribute] public class DocumentTaskSettings : ContentExtractionSettings
The DocumentTaskSettings type exposes the following members.
| Name | Description | |
|---|---|---|
| DocumentTaskSettings | Constructor. |
| Name | Description | |
|---|---|---|
| AllowLongProcessingDocumentTest | RESERVED - DO NOT USE OR SET THIS PROPERTY. Reserved for internal SDK testing and setting this property to TRUE can lead to unpredictable behavior. | |
| AttachmentArchiveFilenameFormat | Base string format of attachment archive files and only used when OutputMode is set to Archive (read-only). | |
| AttachmentArchiveMaxSize | Attachment archive file maximum size (default is 4GB). | |
| ClientId | [Optional] User defined client Id that owns the ProjectId. | |
| ClientName | [Optional] A user defined client name. | |
| CollectionId | [Required] A user defined document collection (job) Id that uniquely identifies the document processing set that this processing task defined by TaskId belongs to. | |
| CollectionName | [Optional] A user defined document collection (job) name. | |
| CollectionSourcePath | The ingestion file path or directory path used as the source for the processing document collection. | |
| CollectionStartDateUtc | The processing start DateTime in UTC for this document collection (see CollectionId). | |
| ControlNumberingType | Determines how DocControlNumber numbers are formatted. These numbers are only generated when de-serializing a data document archive (.dda) file using the DocumentDataArchiveReader class. The default control numbering (ID) format type is ParentChildWithUnderscoreSeparator. | |
| CountMailStoreMessageObjects | If property ProcessingTaskType is set to SingleMailStore and this value is true, then mail store message objects are always counted before processing. If false; mail store message objects are not counted and property MailStoreMessageCount must be set with the number of message objects in the mail store. | |
| CpuCoreMode | Specifies the DocumentTaskEngine CPU core usage mode. | |
| CreateEmptyTextFilesWhenNoTextExtracted | If true (default value) and property OutputMode equals IndividualFiles, then the task will write out an empty extracted text file for any document whose text could not be extracted. | |
| CustodianId | [Optional] User defined custodian Id that owns the CollectionId. | |
| CustodianName | [Optional] A user defined custodian name. | |
| CustomDocumentSource | The custom document source. Only set if ProcessingTaskType is set to CustomDocumentSource. | |
| DebugLogging | If true, verbose debug logging is enabled (default value: false). Setting to true is not a recommended setting unless trying to track down a failed task or related issue. Setting to true degrades task execution performance. | |
| DocumentArchiveFilename | Standardized name of all document archive output files (read-only) (.dda). | |
| DocumentArchiveRootPath | Gets or sets the root extracted path for document archive file output. | |
| Documents | The input documents to process. See remarks. | |
| EmbeddedObjectExtraction |
Embedded document/attachment and embedded office media extraction setting.
(Inherited from ContentExtractionSettings) | |
| EntityExtractionSettings |
Options for entity extraction in extracted text, metadata, and URLs.
(Inherited from ContentExtractionSettings) | |
| ExcludedDocumentTypes | HashSet of document Ids to "exclude" from content extraction. | |
| ExcludeInlineEmailImages | Exclude all inline email images. The default value is true. | |
| ExtractionType |
Document content extraction type. This property is read-only and this property's value is controlled by the value of property ProcessingMode.
(Overrides ContentExtractionSettingsExtractionType) | |
| ExtractOfficeTrackedChanges |
If true, appends tracked change information/text from office document formats (that support tracked changes) to the end of the document's extracted text; otherwise,
tracked changes text is not appended to document's extracted text.
(Inherited from ContentExtractionSettings) | |
| Hashing |
Document hashing settings.
(Inherited from ContentExtractionSettings) | |
| IgnoreCorruptedDocuments | RESERVED. Used by Open Discover Platform Worflow Management System (WMS). | |
| InMemorySizeMaxCritera | The maximum document size (in bytes) that are processed in-memory. | |
| IsPartitioned | True if this task is to process a partition (subset) of items in a single archive or single mail store; otherwise, if false, this task is to process all the archive or mail store container items. This property is ignored if ProcessingTaskType property is set to DocumentSet. See remarks. | |
| LanguageId |
Language identification of extracted text settings.
(Inherited from ContentExtractionSettings) | |
| LargeDocumentCritera |
Defines the "large" document criteria, in bytes, that determines what type of content extractor is returned
by the content extractor factory for "large" unknown/unsupported formats and also "large" encoded text based formats.
(Inherited from ContentExtractionSettings) | |
| LongProcessingDocumentCriteriaInSec | Long processing document criteria, in seconds, specifies an elapsed time criteria when the first DocumentTaskEngine.LongProcessingDocumentWarning event is fired (if at all). | |
| MailStoreMessageCount | If CountMailStoreMessageObjects is false and property ProcessingTaskType is set to SingleMailStore, then this value must contain the already counted number of message objects in the mail store. Otherwise this property value is ignored. | |
| MaxArchiveCompressionRatio | Archive maximum compression ratio security feature to help protect against archive compression 'ZIP-bombs'. | |
| MaxNumDatabaseTableRowsToOutput | Maximum number of database table rows to output to extracted table text. The default value is -1 which means all rows. Database tables can potentially have 10's of millions of rows so users should use caution when processing databases of unknown origins. If value is 0, only the table name and table column names will be output to table text file. | |
| Metrics | Task execution metrics. Metrics will be populated at end of task execution. | |
| NistRdsDatabasePath | Full directory path to NIST RDS hash database. | |
| OutputEmailBodies | If true, for email types, saves all extracted email bodies to the task's outputted document data archive file (DocumentDataArchive.dda). If false (default value), email bodies are not saved. | |
| OutputMode | Output mode for extracted attachments and text. See remarks. | |
| PartitionTarget | The archive or mail store partition to process out of the TotalPartitions container partitions. | |
| Passwords | Gets or sets the password array to use for decryption of supported password-protected formats. | |
| PdfDocument |
PDF document extraction settings.
(Inherited from ContentExtractionSettings) | |
| PerformNistCheck | Perform NIST check on document using SHA1BinaryHash hashes. The default value is true. | |
| PhysicalProcessorAffinity | Specifies the DocumentTaskEngine physicall processor (CPU) affinity for multi-processor workstations or servers. Do not set for machines with virtual processors (CPU) cores. | |
| ProcessingMode | Processing mode. Determines the type of content that is extracted from documents. | |
| ProcessingTaskType | Processing type. Determines if we are processing a document set, a single archive (or a multi-part archive), or a single mail store. | |
| ProjectId | [Optional] User defined project Id that owns the CustodianId. | |
| ProjectName | [Optional] A user defined project name. | |
| RequeueLargeContainersAsOwnTask | If true, "large" (defined by RequeueLargeContainerSizeCriteria) archive and mail store containers found in DocumentTaskSettings input documents (Documents) don't have their container items extracted but get their Result property set to RequeueAsSeparateTask and are not processed further. | |
| RequeueLargeContainerSizeCriteria | Defines the size, in bytes, criteria for archives and mail store containers found in a DocumentTaskSettings task where they are considered too "large" for the task and should get their own task. | |
| SaveAttachments | If true (default value), the task will save extracted attachments, embedded documents, and embedded media. See remarks. | |
| TaskId | [Optional] User defined task Id for this task. | |
| TaskParameter | [Optional] User defined task parameter. | |
| TestArchives | Test archives for actual expanded size and actual compression ratio. If false, no expansion test is performed before extracting from archive. | |
| TextArchiveFilenameFormat | Base string format of text archive files and only used when OutputMode is set to Archive (read-only) . | |
| TextArchiveMaxSize | Maximum text BLOB file size (default is 4GB) and only used when OutputMode is set to Archive. | |
| TextFileEncoding | If property OutputMode is set to IndividualFiles, then this property determines if text files are written out as UTF-16 (default) or UTF-8 encoded (with BOM). | |
| TimeZoneAndEmail |
Settings for document collection time zone and related extracted DateTime metadata and email extracted text DateTime display.
(Inherited from ContentExtractionSettings) | |
| TotalPartitions | The total number of processing partition tasks to partition an archive or mail store. This property must be set if property IsPartitioned is set to true and must be greater than or equal to 2. | |
| UnsupportedFiltering |
Binary-to-text filtering of unsupported/unknown document file format settings.
(Inherited from ContentExtractionSettings) | |
| UseLargeDocumentUTF16Encoding |
'Large' document extracted text encoding (see base class property LargeDocumentCritera). This property is read-only, the base class setter is overriden and this property
value is controlled by the value of property TextFileEncoding.
(Overrides ContentExtractionSettingsUseLargeDocumentUTF16Encoding) | |
| UserRequeueDocumentTypes | User defined HashSet of document format Id's (Id) to not process further and mark as "requeue" for user custom processing workflow. If a document is found to have a format Id contained in this hash set, then its Result gets set to UserRequeueAsSeparateTask and is not processed further. |
| Name | Description | |
|---|---|---|
| Equals | Determines whether the specified object is equal to the current object. (Inherited from Object) | |
| GetHashCode | Serves as the default hash function. (Inherited from Object) | |
| GetType | Gets the Type of the current instance. (Inherited from Object) | |
| ToString | Returns a string that represents the current object. (Inherited from Object) |
This class stores the processing task settings for a document set, single archive, or single mail store that is to be processed by a DocumentTaskEngine instance.
A set of documents, a single archive**, or single mail store ** to process should not have a combined size greater than 5 gigabytes, or else the outputted document data archive (.dda) could become too large to read into memory. Large document sets should be broken up into 1-2 gigabyte of document subsets to process as tasks. Breaking a large document processing job (e.g., a 100 gigabyte worth of documents) into 1-2 gigabyte subsets worth of documents aids in distributing a big processing job across multiple desktops or server VMs - each running a DocumentTaskEngine instance(s).
** "Large" archives or mail stores can be processed as a single task, or can also be partitioned into many sub-tasks for distributable processing. See properties IsPartitioned, TotalPartitions and PartitionTarget.
var taskSettings = new OpenDiscoverPlatform.DocumentTaskSettings(); taskSettings.CollectionId = "101"; taskSettings.TaskId = Guid.NewGuid().ToString(); taskSettings.ProcessingTaskType = ProcessingType.SingleArchive; // Single (or multi-part) archive processing task taskSettings.IsPartitioned = true; // Archive will be partitioned taskSettings.TotalPartitions = 4; // Archive will be broken up and processed as 4 separate partitions (tasks). taskSettings.PartitionTarget = 2; // The partition # this task will work on (other DocumentTaskEngine instances can process the other partitions simultaneously) var archivePath = @"D:\InputDocuments\Archives\VeryLargeArchive.zip"; var outputRootPath = @"D:\Output\"; // Root path to store task output var taskOutputPath = System.IO.Path.Combine(outputRootPath, string.Format(@"CollectionId_{0}\Task_{1}", taskSettings.CollectionId, taskSettings.TaskId)); // // For single archive or single mail store tasks, the input document(s) Document.FilePath and Document.FormatId properties should be set: // var archiveDocument = new Document(); archiveDocument.FilePath = archivePath; using (var docStream = System.IO.File.OpenRead(archivePath)) { archiveDocument.FormatId = OpenDiscoverSDK.DocumentIdentifier.Identify(docStream, archivePath); } //For a split (multi-part) archive, we would pass in a list of the split segment documents in order: taskSettings.Documents = new List<Document> () { archiveDocument }; taskSettings.ProcessingMode = ProcessingMode.TextAndMetadata; taskSettings.OutputMode = OutputMode.IndividualFiles; // Extracted attachments/ text will be saved as individual (flat) files. // Set root path for processing output files: taskSettings.DocumentArchiveRootPath = taskOutputPath; taskSettings.Passwords = null; // No passwords to cycle through taskSettings.PerformNistCheck = false; // No checking document binary hashes against NIST database. taskSettings.EmbeddedObjectExtraction = EmbeddedExtractionType.EmbeddedDocumentsAndMedia; taskSettings.ExcludeInlineEmailImages = true; taskSettings.PdfDocument.ImageExtraction = PdfImageExtraction.OnlyFailedPdfPages; taskSettings.PdfDocument.PageExtractedTextCriteria = 1; taskSettings.TimeZoneAndEmail.CollectionTimeZone = TimeZoneInfo.Utc; taskSettings.TimeZoneAndEmail.ApplyTimeZoneToMetadata = false; taskSettings.TimeZoneAndEmail.EmailDateTimeFormat = EmailDateTimeFormat.MonthDayYearTime; taskSettings.TimeZoneAndEmail.ShowUtcOffsetForTime = true; taskSettings.Hashing.HashingType = HashingType.BinaryAndContentHash; taskSettings.Hashing.MaxBinaryHashLength = 10*1024*1024*1024; // Hash up to a maximum of the first 10GB of a file taskSettings.Hashing.IncludeBccRecipientsInEmailContentHash = false; taskSettings.LanguageId.IdentifyLanguages = true; taskSettings.UnsupportedFiltering.FilteringType = UnsupportedFilterType.Unsupported; taskSettings.UnsupportedFiltering.LargeUnsupportedMaxFilteredChars = 1024 * 1024 * 1024; // Binary-to-text filter at max 1 billion chars // // Create a document task engine instance to process the task: // var documentTaskEngine = new DocumentTaskEngine(taskSettings); documentTaskEngine.Completed += _documentTaskEngine_Completed; documentTaskEngine.FatalException += _documentTaskEngine_FatalException; documentTaskEngine.LongProcessingDocumentWarning += _documentTaskEngine_LongProcessingDocumentWarning; // // Run task synchronously (blocking): // documentTaskEngine.RunTaskBlocking(); // TODO: do something with the output, like bulk insert into a document store or an eDiscovery document review system.