Click or drag to resize

ContentExtractionSettingsLargeDocumentCritera Property

Defines the "large" document criteria, in bytes, that determines what type of content extractor is returned by the content extractor factory for "large" unknown/unsupported formats and also "large" encoded text based formats.

Namespace: OpenDiscoverSDK.Interfaces.Settings
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.4.0 (2025.4.4)
Syntax
C#
[DataMemberAttribute]
public long LargeDocumentCritera { get; set; }

Property Value

Int64
Remarks

"Large" is a relative term and that this property defines.

If document size is greater than this property value for an unknown/unsupported format or 'large' encoded text file then either a ILargeUnsupportedExtractor or ILargeEncodedTextExtractor is returned by the content extractor factory, respectively. These 2 interfaces take a stream argument for interface method ExtractContent, which should be a FileStream, to write the potentially very large amount of extracted text. The extracted text will not be set in the ExtractContent calls returned DocumentContent object's ExtractedText property, but will be written to the stream provided to the ExtractContent interface method.

See property UseLargeDocumentUTF16Encoding for setting UTF-16 or UTF-8 for the text encoding used by the stream (UTF-16 is the default). If an encoded text file is already in a useful text encoding for indexing, then user may want to bypass calling the ExtractContent(Stream) method altogether or just to use the interface method to binary hash the original text file. To just binary hash a 'large' encoded text file set ExtractionType to MetadataOnly and HashingSettings to BinaryHashOnly. See the "How To" section of the help file for more information.

Default property value: 104,857,600 bytes (100 MB) - that is, any unknown/unsupported or encoded text file length greater than this value is considered a 'large' document.

Valid range of property values: 50 MB (mega-bytes) - 2 GB (giga-bytes). Due to the maximum range of 2GB of this property, any file length, in bytes, over 2GB in size will always be considered a "large" document. This also corresponds to the maximum size of a .NET string (or any other CLR array).

See Also