Click or drag to resize

LanguageIdSettingsLatinScriptRegionPartitionSize Property

Used by language identification algorithm, see IdentifyLanguages, to partition detected Latin script regions into smaller character ranges of this size.

Namespace: OpenDiscoverSDK.Interfaces.Settings
Assembly: OpenDiscoverSDK.Interfaces (in OpenDiscoverSDK.Interfaces.dll) Version: 2025.4.4.0 (2025.4.4)
Syntax
C#
[DataMemberAttribute]
public int LatinScriptRegionPartitionSize { get; set; }

Property Value

Int32
Remarks

This property is ignored if IdentifyLanguages or PartitionLatinScriptRegions are false.

If IdentifyLanguages value is true and PartitionLatinScriptRegions is false then Latin script regions are not partitioned into smaller regions and the whole detected Latin script region is used to identify the Latin script based language. The Latin based language with the highest score for this region will be the identified language for this region.

If IdentifyLanguages and PartitionLatinScriptRegions both are true, then if a detected Latin script region (character range) is larger than this property value it is broken into smaller partitions of this value (in characters). And then each of the Latin script partitions has its language dectected separately, rather than the whole original detected Latin script range. The Latin based language with the highest score for each of these smaller partitioned regions will be the identified language for this smaller partitioned region.

If a Latin script region has multiple Latin based languages within it, only the highest scoring language is returned for that script region. Since Latin script has many languages associated with it, Latin based languages can be less reliable for language identification than CJK and other non-Latin based scripts. Furthermore, if a detected Latin script region does not have normal conversational language but instead has tabular data (like in a spreadsheet), acronyms, entity names, etc, then the detected Latin based language can be unreliable.

If two or more Latin based languages are sequential in a document's extract text, then the combined character range will have 'Latin' as the detected script. Having this larger Latin script region broken into smaller regions will aide in dectecting of these two or more Latin-based languages BUT at a cost. Too small of a region, and therefore fewer number of words in which to identify the language, can make it difficult for the algorithm to detect the languages accurately and can lead to incorrect identification of these smaller regions.

This property is ignored for file formats with Classification equal to Spreadsheet because spreadsheet cell values and column naming make language identification less reliable for Latin scripts. Language detection for spreadsheets is generally not as good as other office document formats such as word processing or slide presentations; However, word processing or slide presentations with lots of tabular data and/or regions with long lists of names, places, and acronyms can suffer from incorrect language identification.

Take into consideration that smaller Latin script regions will require more processing for documents with Latin based languages. The default value of this property generally works well for most documents, balancing Latin-based language detection with performance and reduction of incorrect identification.

Developers should test the effects of properties PartitionLatinScriptRegions and LatinScriptRegionPartitionSize to determine the best fit for their needs and data.

The default property value: 1000 [characters] The minumum allowed property value: 150 [characters].

See Also