ILargeUnsupportedExtractorExtractContent Method

Extracts document text using a proprietary binary-to-text extractor. Note: important to read documentation on 'textFileOutputStream' argument.

Namespace: OpenDiscoverSDK.Interfaces.Extractors
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.6.0 (2025.4.6)

Syntax

Copy

DocumentContent ExtractContent(
	Stream textFileOutputStream
)

Parameters

textFileOutputStream Stream: Destination output stream for extracted text. The stream SHOULD be a FileStream instance for this method (due to potentially very lage files) and if user wants to run sensitive item detection on the extracted text then this stream must be writeable and readable (FileAccess.ReadWrite). Depending on the "large" document's file size, 100's of megabytes or even gigabytes of text could be extracted, that is why a FileStream SHOULD be used.

Return Value

DocumentContent
DocumentContent object.

Remarks

"Large" is a subjective term defined by the LargeDocumentCritera property value.

The maximum number of binary-to-text filtered characters from a 'large' unsupported/unknown file is limited to LargeUnsupportedMaxFilteredChars.

This method extracts useful text (if any in the supported encodings) from binary via a proprietary binary-to-text algorithm. The algorithm supports binary-to-text for UTF-16 (latin Unicode range), UTF-8, and code page 1252 encodings.

If sensitive item detection is enabled (EntityExtractionSettings) then only the first 200 million characters of the text file are scanned for sensitive items. And if the actual number of characters exceeds 200 million then the Attributes will have the following attribute added to set: EntityDetectionScanLimited

Reference

ILargeUnsupportedExtractor Interface

OpenDiscoverSDK.Interfaces.Extractors Namespace