ILargeUnsupportedExtractorExtractContent Method |
Extracts document text using a proprietary binary-to-text extractor. Note: important to read documentation on 'textFileOutputStream' argument.
Namespace: OpenDiscoverSDK.Interfaces.ExtractorsAssembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.4.0 (2025.4.4)
SyntaxDocumentContent ExtractContent(
Stream textFileOutputStream
)
Parameters
- textFileOutputStream Stream
-
Destination output stream for extracted text. The stream SHOULD be a FileStream instance for this method (due to potentially very lage files)
and if user wants to run sensitive item detection on the extracted text then this stream must be writeable and readable (FileAccess.ReadWrite). Depending on the "large"
document's file size, 100's of megabytes or even gigabytes of text could be extracted, that is why a FileStream SHOULD be used.
Return Value
DocumentContentDocumentContent object.
Remarks
"Large" is a subjective term defined by the LargeDocumentCritera property value.
The maximum number of binary-to-text filtered characters from a 'large' unsupported/unknown file is limited to LargeUnsupportedMaxFilteredChars.
This method extracts useful text (if any in the supported encodings) from binary via a proprietary binary-to-text algorithm. The algorithm supports
binary-to-text for UTF-16 (latin Unicode range), UTF-8, and code page 1252 encodings.
If sensitive item detection is enabled (EntityExtractionSettings) then only the first 200 million characters of the text file
are scanned for sensitive items. And if the actual number of characters exceeds 200 million then the Attributes will have the following
attribute added to set: EntityDetectionScanLimited
See Also