UnsupportedFilterType Enumeration

C#

[DataContractAttribute]
public enum UnsupportedFilterType

Member name	Value	Description
None	0	No binary-to-text filtering. For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will NOT return a content extractor interface for unsupported format types.
Unsupported	1	Perform binary-to-text filtering on unsupported/unknown document formats to extract text. If unsupported format is encrypted it will not be filtered (see UnsupportedAndEncrypted). The binary-to-text filtering algorithm will attempt to extract as much UTF8, UTF-16LE (Latin languages only), and code page 1252 encoded text from the documents binary using a proprietary filtering algorithm. In many cases, useful text for indexing or searching can be extracted from unknown/corrupted/unsupported file formats using binary-to-text filtering. For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will either return a IUnsupportedExtractor or ILargeUnsupportedExtractor interface depending on the value of property LargeDocumentCritera and the document's file size.
UnsupportedAndEncrypted	2	Perform binary-to-text filtering on unknown/unsupported document formats to get extracted text - even if unsupported format is identified as being encrypted. For encrypted document formats, no meaningful text can be extracted via binary-to-text filtering unless internal parts of the document happen to reside in unencrypted regions (if any) of the document format. For encrypted formats, the utility of this enum value setting is mainly for document forensic analysis and not text extraction for the purpose of indexing/searching. Unless doing document forensic analysis, it is recommened for user to use Unsupported instead. For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will either return a IUnsupportedExtractor or ILargeUnsupportedExtractor interface depending on the value of property LargeDocumentCritera and the document's file size.

Member name

Value

Description

None

0

No binary-to-text filtering. For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will NOT return a content extractor interface for unsupported format types.

Unsupported

1

Perform binary-to-text filtering on unsupported/unknown document formats to extract text. If unsupported format is encrypted it will not be filtered (see UnsupportedAndEncrypted).

The binary-to-text filtering algorithm will attempt to extract as much UTF8, UTF-16LE (Latin languages only), and code page 1252 encoded text from the documents binary using a proprietary filtering algorithm. In many cases, useful text for indexing or searching can be extracted from unknown/corrupted/unsupported file formats using binary-to-text filtering.

For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will either return a IUnsupportedExtractor or ILargeUnsupportedExtractor interface depending on the value of property LargeDocumentCritera and the document's file size.

UnsupportedAndEncrypted

2

Perform binary-to-text filtering on unknown/unsupported document formats to get extracted text - even if unsupported format is identified as being encrypted.

For encrypted document formats, no meaningful text can be extracted via binary-to-text filtering unless internal parts of the document happen to reside in unencrypted regions (if any) of the document format. For encrypted formats, the utility of this enum value setting is mainly for document forensic analysis and not text extraction for the purpose of indexing/searching. Unless doing document forensic analysis, it is recommened for user to use Unsupported instead.

For this enum value, the SDK API method "ContentExtratorFactory.GetContentExtractor" will either return a IUnsupportedExtractor or ILargeUnsupportedExtractor interface depending on the value of property LargeDocumentCritera and the document's file size.

Unsupported, unknown, and corrupted documents can have text extracted via a proprietary binary-to-text extraction algorithm.

The binary-to-text filtering algorithm will attempt to extract as much UTF8, UTF-16LE (latin languages only), and code page 1252 encoded text from the documents binary using a proprietary filtering algorithm. In many cases, useful text for indexing or searching can be extracted from unknown/corrupted/unsupported file formats using binary-to-text filtering.

Reference

OpenDiscoverSDK.Interfaces.Settings Namespace