PdfDocumentSettingsPageExtractedTextCriteria Property

Minimum PDF page extracted text length (in characters) criteria. See remarks.

Namespace: OpenDiscoverSDK.Interfaces.Settings
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.6.0 (2025.4.6)

Syntax

Copy

[DataMemberAttribute]
public int PageExtractedTextCriteria { get; set; }

Property Value

Int32

Remarks

If ExtractionType is set to MetadataOnly then this property is ignored.

If the extracted text length of any PDF page is below the value of this property then the following data is added to PdfDocumentContent:

PdfHasFailedPages is added to Attributes (note: DocumentContent is the base class for PdfDocumentContent).
A PdfPageInfo entry is added to FailedPdfPages for this page.

The PdfPageInfo information can aid users who plan on implementing OCR (optical character recognition) to augment text extraction in determining which, if any, PDF pages are a candidate for OCR.

The value of "1" (see below) is chosen as the default value because it is not uncommon to find a PDF page that is blank except for a page number. Users are encouraged to experiment on a PDF document collection and find values that work best for their particular needs.

Default property value: 1 [character]; at least 1 character of extracted text per PDF page must be extracted to pass this criteria (e.g., at least a page number on an otherwise blank PDF page).

Valid range: 0 - 500 [characters]; a value of 0 means that any length of text (including no text) passes this criteria and no pages will be marked failed due to page extracted text length.

Reference

PdfDocumentSettings Class

OpenDiscoverSDK.Interfaces.Settings Namespace