Click or drag to resize

Release Notes

This topic contains the following sections:

Release History

This section contains the following subsections:

Changes in Release 2025.4.6 11/12/2025

  • Split out new entity type that was represented under the EntityType.HealthInsuranceClaimNumber (SSN based) entity: EntityType.MedicareBeneficiaryIdentfier (MBI). The MBI is confidential like the SSN and should be protected as Personally Identifiable Information.
  • SDK: Improved person name finder and other PII/PHI entity extractions.

Changes in Release 2025.3.6 7/28/2025

  • SDK: Added new file format Id's (IdClassification.AIChatsAndPrompts): OutlookCopilotTeamsChat, OutlookCopilotExcel, OutlookCopilotWord, OutlookCopilotPowerPoint, OutlookCopilotOneNote, OutlookCopilotSharePoint, OutlookCopilotOutlook, OutlookCopilotLoop, OutlookCopilotOffice365, OutlookCopilotForms, OutlookCopilotBizChat, OutlookCopilotWebChat, OutlookCopilotWhiteboard, OutlookMobileSMS, OutlookMobileMMS, and OutlookInfoPathForm.
  • SDK: Added new email/email system IdClassification types to help filter emails from email system objects and reports: EmailSystemObjectType, EmailSystemReport, and VoiceAndFaxMail. Also added IdClassification.AIChatsAndPrompts for new Office 365 Copilot chats and prompts archiving in Exchange.
  • SDK: Improved person name finder and other entity extractions

Changes in Release 2024.4.8 11/14/2024

  • Upgraded SDK/Platform to .NET 8.
  • SDK: Added new file format Id's: RSMF, OutlookEmailRefAttachment, OutlookEmailRefOnlyAttachment, OutlookEmailWebRefAttachment, and Parquet.
  • SDK: Added metadata and file format Id support for Outlook Reference type attachments
  • SDK: Improved person name finder and other entity extractions

Changes in Release 2024.3.1 7/31/2024

  • SDK: Added person name normalization to Entity.Associated property
  • SDK: Added postal address normalization to Entity.Associated property
  • SDK: Improved person name finder and other entity extractions

Changes in Release 2023.4.4 11/01/2023

  • SDK: Fixed issues with some international phone number formats
  • SDK: Improved person name finder
  • SDK: Added ability to extract only user selected EntityTypes by setting properties EntityExtractionSettings.EnableExtractedEntityTypeFilter and EntityExtractionSettings.ExtractedEntityTypeFilter. If enabled, only EntityTypes in EntityExtractionSettings.ExtractedEntityTypeFilter are extracted.

Changes in Release 2023.4.0 9/27/2023

  • SDK: Improved person name finder and various PII/PHI extractions
  • Platform: Added file format Id's: AdobePDFAcroForm and AdobePDFAcroFormEncrypted

Changes in Release 2023.2.4 5/14/2023

  • SDK: Replaced all SensitiveItemType, EntityItemType, and EntityIdentifierType with combined EntityType.
  • Platform: Load creator re-factored to use ExportField(s).

Changes in Release 2022.4.0 10/7/2022

  • SDK: Added the following EntityItemType(s): BorrowerNameEntry
  • SDK: Added the following EntityIdentifierType(s): MortgageElectronicRegSystemsNumber,MobileIdentificationNumber, InstrumentId,ParcelId,EscrowId,FHACaseId
  • SDK: Various improvements to EntityItemType.PersonName detection.
  • SDK: Various file identification improvements with XML/HTML files.

Changes in Release 2022.3.8 8/28/2022 (Extended Release 1/30/23)

  • SDK: Added the following EntityItemType(s): CityStateAndZip, ServiceDateEntry
  • SDK: Added the following EntityIdentifierType(s): ProductKey, MaintenanceId, NaturalizationCertificateNumber, CertificateOfCitizenshipNumber
  • SDK: Various improvements to EntityItemType.PersonName detection.

Changes in Release 2022.2.4 6/26/2022

  • SDK: Added the following EntityItemType(s): RenewalDateEntry, TestDateEntry
  • SDK: Added the following EntityIdentifierType(s): PensionId, AppointmentId, FICECode, OPEIDCode, CEEBCode, LockBoxId
  • SDK: Various improvements to EntityItemType.PersonName detection.
  • SDK: Fixed Excel 97-2003 SST record (text) bug
  • SDK: Fixed Excel 97-2003 embedded item extraction (fallback from ExtractOleObjectXBin to ExtractOle10Native)

Changes in Release 2022.2.2 5/10/2022

  • SDK: Added the following EntityItemType(s): GuarantorNameEntry, BuyerNameEntry, SellerNameEntry, LessorNameEntry, LesseeNameEntry
  • SDK: Added the following EntityIdentifierType(s): EmployeeAccountNumber, EmployerAccountNumber
  • SDK: Various improvements to EntityItemType.PersonName detection.

Changes in Release 2022.2.1 4/19/2022

  • SDK: Various improvements to EntityItemType.PersonName detection.

Changes in Release 2022.1.7 2/28/2022

  • SDK: Added the following EntityIdentifierType(s): DocuSignEnvelopeId, EnvelopeId, FileId, LexisNexisId
  • SDK: Various improvements in entity item detection.

Changes in Release 2022.1.6 2/10/2022

  • SDK: Added the following EntityItemType(s): PersonName
  • SDK: Added the following EntityIdentifierType(s): CheckId, BankId, PayeeId, PayerId, RxBinId, RxGroupId
  • SDK: Various improvements in sensitive item detection.

Changes in Release 2022.1.5 2/1/2022

  • SDK: Added the following EntityItemType(s): ProviderNameEntry, BeneficiaryNameEntry, DoctorOfMedicineNameEntry, SubmittalDateEntry, RequestDateEntry, ReportDateEntry, ReceiveDateEntry, ProcedureDate, AdjudicationDate
  • SDK: Added the following EntityIdentifierType(s): RecipientId, CertificationId, PaymentId, ProviderId, ProviderTransactionAccessNumber, ProcessorControlNumber, AccessionId, CurrentProceduralTerminologyCodeId, InternationalClassificationOfDiseasesCodeId, RemittanceId, HospitalAccountRecordId, StateId, SpecimenId, CheckId, BankId, PayeeId, PayerId, RxBinId, RxGroupId
  • SDK: Added new file identification: Id.CrystalReports
  • SDK: Various improvements in sensitive item detection.

Changes in Release 2021.4.12 12/29/2021

  • SDK: Added the following entity item: NationalityType
  • SDK: Various improvements in sensitive item detection of general accounts and bank accounts

Changes in Release 2021.4.10 12/20/2021

  • SDK: Added the following entity identifier items: ICRReferenceNumber
  • SDK: Added the following entity items: DoDInformationCollectionFormRelated
  • SDK: Added the following entity identifier items: ICRReferenceNumber
  • SDK: Various improvements in sensitive item detection.

Changes in Release 2021.4.7 11/20/2021

  • SDK: Fixed issue with sensitive item deduplication setting.
  • Platform: Improvements in handling very large text and extracted text edge cases.

Changes in Release 2021.4.4 11/10/2021

  • SDK: Fixed issue with sensitive item detection of items like social security numbers, credit card numbers, etc., when being surrounded by touching underscore ('_') characters.
  • SDK: Added new file ID: AVIFImageFile

Changes in Release 2021.4.3 11/06/2021

  • SDK: Fixed issue with PST partitioning mode failing on any message object parsing error.
  • SDK: Improved MBOX parsing.
  • SDK: Added properties to SensitiveItemResult class: IdentificationRelatedCounts, FinancialRelatedCounts, PersonNameRelatedCounts, PHIRelatedCounts, and FerpaRelatedCounts
  • Platform: Enhanced Relativity load file export with new field mapping types: Document.FamilySHA1Hash, Document.IsRepresentingDuplicate, Document.DuplicateGroupId, Document.IdentificationRelatedCount, Document.FinancialRelatedCounts, Document.PersonNameRelatedCount, Document.PHIRelatedCounts, Document.FerpaRelatedCounts, JSON_HIGHLIGHT_SensitiveItems, JSON_HIGHLIGHT_SensitiveCustomItems, JSON_HIGHLIGHT_EntityItems

Changes in Release 2021.4.2 10/14/2021

  • SDK: Fixed issue with OpenDiscoverSDK.ContentExtractorFactory.ClearCustomItemDefinitions method not working properly.
  • Platform: DocumentDataReader control numbering added that removes references to container parents. See Document.DocControlNumber, Document.ParentDocControlNumber, and Document.FamilyControlNumber.

Changes in Release 2021.3.21 10/4/2021

  • SDK: Added new EntityItemType value: NationalSecurityQuestionnaireFormRelated
  • Platform: Added property DocumentTaskSettings.ControlNumberingType

Changes in Release 2021.3.18 9/7/2021

  • SDK: iCalendar (.ics) and PDF bug fixes
  • SDK: Improvements in various entity item detections.
  • SDK: Added new EntityItemType (first cut, exploratory): CitizenshipEntry.
  • SDK: Added new property to SensitiveItemResult: SensitiveItemTypeCounts
  • Platform: Added Primary Date/Time to complement Sort Date/Time
  • Platform: Added document DocControlNumber and ParentDocControlNumber properties to Document class.
  • Platform: Added "CreateEmptyTextFilesWhenNoTextExtracted" to DocumentTaskSettings class. When set to true this property will write empty extracted text files during processing for documents that do not have any extracted text.

Changes in Release 2021.3.14 8/7/2021

  • SDK: Added new file Id type: OutlookRightsManagedEmailObject
  • SDK: Fixed bug with iCalendar format not displaying email address in extracted text under certain conditions. Also improved metadata extracted.

Changes in Release 2021.3.12 7/29/2021

  • SDK: Added new EntityItemTypes: GraduationDateEntry and EntryDateEntry
  • SDK: Various small bug fixes

Changes in Release 2021.3.10 7/15/2021

  • SDK: Added new file Id types: MetafileOLE2Container, DeviceIndependentBitmapOLE2Container, EnhancedMetafileOLE2Container
  • SDK: Added new CustomItemDefinition property 'RequireKeywordSequenceAtStartOfLine'.
  • Platform: Various improvements in SingleMailStore task processing.

Changes in Release 2021.3.9 7/7/2021

  • SDK: Added new file Id types: ArchiveWinAce, QuickBooksBackupFile, QuickBooksBackupFileCFD, QuickBooksAccountantReviewFileCFD, QuickBooksAccountantReviewFile, and eFaxDocument.
  • SDK: New EntityIdentifierTypes: RevenueId, PromoCodeId, and CouponCodeId
  • SDK: Various improvements with sensitive item detection.
  • Platform: Fixed bug with SingleMailStore task.

Changes in Release 2021.3.4 6/20/2021

  • SDK/Platform: Various bug fixes.
  • Platform: LoadFileCreator/LoadFileSettings API improvements and features.

Changes in Release 2021.3.2 6/01/2021

  • SDK/Platform: Added DocumentContent.SHA256BinaryHash and DocumentContent.SHA256ContentHash. Likewise, Platform Document class also has these new hashes.
  • SDK: New EntityIdentifierType types: VATRegistrationNumber, PersonalIdentificationNumber, SecurityCodeNumber, SecurityCardNumber, AccessCardNumber, HealthIdNumber, DepartmentOfDefenseIdCardNumber, VeteranIdCardNumber, BadgeNumber
  • SDK: New EntityItemType types: SecurityChallengeRelated, FederallyRecognizedTribe, SMBUrl, ImmigrationFormRelated
  • SDK: EntityIdentifierType type bug fixes
  • SDK: Added DocumentAttributes.PresentationHasSpeakerNotes and DocumentAttributes.HiddenText
  • SDK: Added EmailDocumentContent.ReceivedDate to the API.
  • SDK: Added methods ContentExtractorFactory.ValidateCustomItemDefinitions ContentExtractorFactory.ClearCustomItemDefinitions to the API.
  • SDK: Added property SensitiveItemCheckSettings.CustomItemDefinitions to the API such that this property could be serialized and used by another process.
  • Platform: Added LoadFileCreator/LoadFileSettings class to the API.

Changes in Release 2021.2.7 5/11/2021

  • SDK: Fixed Url entity extraction bug
  • SDK: Now scans Document.Hyperlinks for specific Url entity types when sensitive item detection enabled. All unique Urls will be aggregated under SensitiveItemResult.EntityItems when sensitive item detection is enabled (see SensitiveItemCheckSettings.Check)

Changes in Release 2021.2.6 5/09/2021

  • SDK: New SenstiveItemType: StateIdCardNumber
  • SDK: New EntityIdentifierType types: AlienRegistrationNumber, TribalIdentificationNumber, CLIAIdentificationNumber, GenericIdCardNumber
  • SDK: New EntityItemType types: ExpirationDateEntry, EffectiveDateEntry, and IssuedDateEntry
  • SDK: Fixed email address sensitive item detection bug.
  • SDK: Fixed API to work with System.Text.Json (the new built-in .NET 5 JSON serializer).
  • SDK: Improved file identification of JSON files with no or misleading extension.
  • Platform: Enhanced and added more precedence rules to Document.SortDate calculation.
  • Platform: Added Document property Document.SortDateType that specifies the metadata field that was used for SortDate. See also new associated enumeration type

Changes in Release 2021.2.5 5/01/2021

  • SDK/Platform: Fixed JSON serialization with new .NET 5 System.Text.Json serializer.

Changes in Release 2021.2.4 4/28/2021

  • SDK: Improvements to person name entity detection for spreadsheet/database formats to reduce false positives.
  • Platform SDK: Added DocumentTaskSettings class property: JobStartDate This property should be set by any workflow engine that manages Open Discover tasks.
  • Platform SDK: Added new user convenience (user set) DocumentTaskSettings class properties: CustodianId, CollectionId, and TaskParameter. These tasks settings properties will be serialized/de-serialized to/from document data archive (.dda) processing output files.
  • Platform SDK: Added new Document property Document.ParentChildRelativePath. This property is only set on child documents and is a path that contains the parent container file name(s) as well as Document.ContainerRelativePath as part of the parent/child relative path.

Changes in Release 2021.2.3 4/21/2021

  • SDK: Added person name entity detection for spreadsheet/database formats (only for specific cases with specific named columns)
  • Platform SDK: Added CustodianId, CollectionId, and TaskParameter as optional user defined DocumentTaskSettings properties.

Changes in Release 2021.2.2 4/15/2021

  • SDK: New EntityIdentifierType types: AttorneyDocketNumber
  • SDK: SensitiveItemType.PhoneNumber international number detection improvements

Changes in Release 2021.2.1 4/11/2021

  • SDK: New EntityIdentifierType types: PatentNumberApplication
  • SDK: Improvements in parsing EntityIdentifierType types: PatentNumber, PatentNumberApplication, and TrademarkNumber
  • Platform: New property on DocumentTaskSettings: JobStartDate

Changes in Release 2021.2.0 4/08/2021

  • SDK: New SensitiveItemType types: MacAddress and IMEINumber.
  • SDK: New EntityIdentifierType types: PersonalHealthNumber, MexicoConsularId, NationalId, AccessCodeNumber, EncounterNumber, VisitNumber, ClientIdentificationNumber, OMBControlNumber, LoanId, NMLSUniqueIdentifier, SpecificationNumber, PatentNumber, and TrademarkNumber
  • SDK: New EntityItemType types: PersonAgeEntry, ClientNameEntry, StudentNameEntry, EnrolleeNameEntry, ApplicantNameEntry, ContactNameEntry, GenderTypeEntry, EmploymentStartingDateEntry, EmploymentEndingDateEntry, ReferralDateEntry, EncounterDateEntry, InjuryDateEntry, InvestigativeOrganization, InvestigativeTerm, LegalContractType, LegalPrivilegedRelated, IRSTaxFormName, IRSTaxFormRelated, LegalContractTerm, and IntellectualPropertyTerm.
  • SDK: Improved detections for sensitive items in spreadsheet formats and general improvement of several sensitive item types.
  • SDK: Bug fix for EntityIdentifierType number/id parsing.
  • SDK: Added new file format Id: CythonScriptFile

Changes in Release 2021.1.2 02/8/2021

  • SDK: Added support for Emoji entity (formerly was marked as RESERVED). Will return emoji group, sub-group, and description. Nearly 4,600 emojis supported.
  • SDK: Added EntityItemType.DomainName entity. Will find domain names that begin with www,www2, or www3. If domain name has a path (/) then it gets upgraded to EntityItemType.Url, prior to this release an incomplete Url (i.e., no http/https prefix) would not get identified as a Url entity.
  • SDK: Added 8 new EntityItemType Url types (security related): FileUrl, FileTransferUrl, LDAPUrl, IMAPUrl, POP3Url, InternetRelayChatUrl, AppleFilingProtocolUrl, and AppleFaceTimeUrl
  • SDK: 2 new EntityIdentifierTypes related to the medical field: DEARegistrationNumber and NationalProviderIdentifier
  • SDK: New IdCategory.Telecommunications file format for identification and content extraction: SkypeChatHistoryExport
  • SDK: New IdCategory.Telecommunications file format for identification and content extraction: SlackExport
  • SDK: Added extraction support for Microsoft Outlook for Mac (.olm) format (Id.OutlookForMacMailbox). Message objects will get extracted as MIME (.eml) files. Unmapped Outlook for Mac fields are added as MIME headers to extracted MIME file(s).
  • SDK: Modified and simplified the IDatabaseExtractor interface.
  • SDK: Added new TextSourceType enumeration value: TextSourceType.ExtractionUserLimited (or cases like database tables, which can potentially have 10's of millions of rows, the user can choose to limit the amount of database table rows outputted)
  • SDK: New supported database file formats for identification and content extraction: EdgeIndexedDB, EdgeIndexedDBDirty, IEIndexedDBAppQuota, IEIndexedDBAppQuotaDirty, IEIndexedDB,IEIndexedDBDirty,EdgeFavorites,EdgeFavoritesDirty,WindowsSearch,WindowsSearchDirty,WindowsUpdate,WindowsUpdateDirty, WindowsSyncShareSvc, WindowsSyncShareSvcDirty,WindowsTileDataLayer,WindowsTileDataLayerDirty,WindowsLiveMsgrContacts,WindowsLiveMsgrContactsDirty,WindowsZuneMusic, WindowsZuneMusicDirty,Windows10MailAppDatabase,Windows10MailAppDatabaseDirty (File formats with "Dirty" at end require an extra workflow step to process - they are databases that were not shutdown cleanly by Windows)
  • SDK: Added new DocumentAttribute.SensitiveItemScanLimited to notify if sensitive item detection only used part of the extracted text for very "large" binary blobs or text files.
  • SDK: PDF Bug Fix: PDF signed dates and annotation dates were extracted to text as local date/time, they are now extracted as UTC date/time to text.
  • SDK: Bug Fix: ILargeEncodedTextExtractor and ILargeUnsupportedExtractor did not have sensitive item detection implemented. This would only effect large encoded text files and large unknown/unsupported binary blobs larger than 80 MB. The ILargeEncodedTextExtractor will scan the first 200 million characters for sensitive items while the ILargeUnsupportedExtractor will scan the first 100 million bytes. If the file size exceeds these values then the DocumentContent.Attributes gets a DocumentAttributes.SensitiveItemScanLimited to indicated the sensitive item scan was limited in size.
  • Platform: Added DocumentTaskSettings processing task: ProcessingType.SingleDatabase. Process a single supported database file (WindowSearch/ESEDatabase/Access/Future additions) as a task. Typically, the user would want to create a separate task for a database only if it is too "large" to be included in a DocumentSet task (i.e., file size or number of rows in one or more TableInfo objects in DatabaseTableInfo are greater than a user defined criteria).

Changes in Release 2021.1.1 1/18/2021

  • SDK: New EntityIdentifierType types: ApplicationNumber, InsuranceClaimNumber, AdmissionId, InternalControlNumber, MedicalRecordNumber, and HealthInsuranceClaimNumber.
  • SDK: New EntityItemType types related to dates: DueDateEntry, BillingDateEntry, PaymentDateEntry, BalanceDateEntry, InvoiceDateEntry, MaturityDateEntry, ContractDateEntry, AdmissionDateEntry, DischargeDateEntry, and DateOfDeathEntry.
  • SDK: Improvements in detections for IP addresses, date of birth, network, phone numbers, and general account numbers.
  • SDK: Added new options for CustomItemExtractType: PreceedingTextOnSameLine, and PreviousTerm
  • SDK: Added partial support for text/metadata extraction for Microsoft Access database (versions 2000-2016, Office 365) and supporting extraction interfaces/content classes for future database format additions (see interface IDatabaseExtractor and content classes DatabaseContent, TableInfo, and ColumnInfo). Following releases will have more support for MS Access (namely attachment column extraction)
  • Platform: DocumentTaskEngine support for Microsoft Access database and general infrastructure for future database format additions.
  • Platform: Made Document.ContainerRelativePath ending (without end '\') consistent between archive and mail store containers.

Changes in Release 2020.4.7 10/30/2020

  • Platform: Re-exposed improved version of CustomItemDefinition API for custom sensitive item detection. See SensitiveItemCheckSettings.CustomItemCheck property.
  • SDK: Added the following new entity types: EntityItemType.StudentInformationRelated, EntityIdentifierType.StudentId, and EntityIdentifierType.PatientAccountNumber
  • SDK: Various improvements in sensitive item detection.
  • SDK/Platform: Changed Document.Index and ChildDocument.Index to Int64 value from Int32. This change was necessary for the change list below this one.
  • SDK: Added new method "GetMessagesByIndex" to IMailStoreExtractor interface. This method allows retrieving previously processed emails by their Document.Index property. This assumes the mail store has not changed since last processed.
  • Platform: Removed Document.SHA1EmailAttachmentNamesHash and replaced with Document.SHA1EmailAttachmentSortedHash.

Changes in Release 2020.4.4 10/6/2020

  • SDK: Added 5 new file format IDs (4 related to SQL database backup formats and 1 related to Windows MiniDump files)
  • Platform: Added OpenDiscoverPlatform.Settings.CpuCoreMode.TwoCore enumeration value for use on systems with low-end harware.
  • Platform: Fixed a bug in DocumentTaskEngine.Inventory for folders without access permissions (access denied).

Changes in Release 2020.4.0 9/10/2020

  • SDK: Added Cryptocurrency address check to the sensitive items checked. See class OpenDiscoverSDK.Interfaces.Settings.SensitiveItemCheckSettings
  • SDK: Many sensitive item detection improvements.
  • Platform: Added new inventory mode OpenDiscoverPlatform.InventoryMode.InputDirectoryDocumentsOnly.

Changes in Release 2020.3.8 8/27/2020

  • SDK: Added 51 new file format identifications related to the IdClassification.SourceCode category
  • SDK: Various bug fixes and improvements related to content extraction.
  • SDK: Improvements related to sensitive item detection.

Changes in Release 2020.3.7 8/11/2020

  • SDK: Added option for deduplication of sensitive items and entity items.
  • SDK: Added MedicalPersonType, HospicePalliativeCareOrganization, LegalOrganization, LegalTerm, and LegalPersonType entity identification.
  • SDK: First release with full address extraction. If full address can't be identified a StateAndZipAddress entity is created.
  • SDK: Added new partial address entity types.

Changes in Release 2020.3.6 7/29/2020

  • SDK: This is first release of new SensitiveItem detection engine. Lower false positives while increasing detection rate of hard to detect items.
  • SDK: SensitiveItem API changes. Recognized entity items are returned in SensitiveItemResult.EntityItems
  • SDK: Detection for IPv6 was added. Zip codes are verified to reduce false positives for Address detection. Phone numbers have their US/Canadian city or country listed in SensitiveItem.Associated property.
  • Platform: Fixed issue with Document.SortDate property not being set properly in certain instances.
  • SDK: Fixed MIME header metadata extraction issue for multiple instances of same header.

Changes in Release 2020.3.4 7/12/2020

  • Platform: Added new processing mode to isolate corrupt documents.
  • SDK: Fixed PST extracted Outlook .msg file binary hash changes that were dependent on when the messages were extracted.
  • SDK: Added SocialMediaAccount, License Plate Number, Vehicle Identification Number (VIN), Health Care Number/Member ID checks and google/bing hyperlink address checks.
  • SDK: Improvements to sensitive item detection
  • SDK: Change property SensitiveItem.IsMetadata to SensitiveItem.LocationType enumeration (now have Hyperlink as location type)
  • SDK: Added file ID for MediaImageVirtualHardDiskVer2
  • SDK: Changed file ID MediaImageVHD to new name MediaImageVirtualHardDiskVer1

Changes in Release 2020.3.3 7/7/2020

  • Platform: VHDX and VHD get requeued as UserRequeue.
  • SDK: Improvements to sensitive item detection.

Changes in Release 2020.3.1 6/20/2020

  • Platform: Added new processing type, ProcessingType.CustomDocumentSource, and associated property CustomDocumentSource to DocumentTaskSettings class. The new interface ICustomDocumentSource allows for users to provide a custom source of documents to process that may not exist on file system (for example, streamed from a SQL FILESTREAM table column).
  • SDK: Improvements to bank account sensitive item detection (added check for hyphenated account numbers).

Changes in Release 2020.3.0 6/12/2020

  • SDK: Added detected sensitive flags to DocumentAttributes enumeration (e.g., DocumentAttribute.DetectedCreditCard)
  • SDK: Added International Bank Account Number (IBAN) sensitive check option.
  • SDK: Added username and password sensitive check option.
  • SDK: Added maiden name sensitive check option. Next release will attempt to extract the maiden name text.
  • SDK: Added IPv4 sensitive check option.
  • SDK: Improvements in credit card and bank account sensitive identification. More planned for next release.
  • Platform: Added new document data archive (.dda) reader class, DDARecordReader. DDARecordReader reads .dda file records sequentially and has a very low memory footprint.

Changes in Release 2020.2.5 6/07/2020

  • SDK: Added sensitive check for addresses and enhanced credit card and bank account checks.
  • SDK: Fixed thumbs.db image extraction bug.
  • SDK: Fixed compound file stackoverflow bug for certain non-conforming compound files.
  • Platform: UserRequeueAsSeparateTask is now ignored for formats that are excluded format types.

Changes in Release 2020.2.4 5/11/2020

  • SDK: Added ContentResult enumeration values: MimePartialOrphanError and OCRError
  • Platform: Added Document properties: FirstContainerParentGuid and TopMostContainerParentGuid

Changes in Release 2020.2.3 5/2/2020

  • Platform: Fixed small resource leak if running DocumentTaskEngine in process (versus recommended out-of-process)

Changes in Release 2020.2.1 4/14/2020

  • SDK: EML identification bug fix (many extra MIME headers)

Changes in Release 2020.2.0 4/1/2020

  • SDK/Platform: sensitive bug fixes.

Changes in Release 2020.1.6 3/1/2020

  • SDK/Platform: Fixed sensitive check item "Date of Birth". Feature was not fully implemented.
  • Extended partner trial length to 1 month for SDK/Platform.

Changes in Release 2020.1.5 2/18/2020

  • SDK/Platform: Added new feature to extract Word tracked changes and append to end of extracted text. PowerPoint does not have tracked changes and Excel uses a commenting system instead of detailed changes like Word. Excels comments are already extracted with Author information and text. To extract Word tracked changes set ExtractOfficeTrackedChanges property to true (the default value).
  • SDK/Platform: Added optional file entropy calculation. File entropy can be useful in the detection of unknown formats (e.g., is document encrypted or compressed). See property CalculateFileEntropy
  • Platform: Added Document.SortDate which calculates a date time from existing document metadata that is useful for sorting a set of documents by date.

Changes in Release 2020.1.4 1/27/2020

  • Open Discover SDK/Platform new preliminary feature: sensitive item detection in extracted text and metadata. Future releases will have expanded sensitive item extraction options.
  • Open Discover SDK/Platform new feature: ability for users to specify additional MAPI properties and named properties to be extracted as metadata.
  • 7 new document file format Ids: Archive7ZipEncrypted, ArchiveRar4Encrypted, ArchiveRar4SplitEncrypted, ArchiveRar4SplitSegmentEncrypted, ArchiveRar5Encrypted, ArchiveRar5SplitEncrypted, ArchiveRar5SplitSegmentEncrypted. This new archive file Ids specify that the archives headers are encrypted - which means no file item information is available without applying the correct password first.
  • Bug fix: Rar5 and Rar5Split archive extraction was not working. Some of the new file Ids in this release listed above were created to fix this issue.
  • Bug fix with 7-Zip split archives: Before this fix the expanded archive file item size was not available. This would effect metadata scan information and testing of archive true expansion size.

Changes in Release 2020.1.1 1/5/2020

  • 5 new document file format Ids. AdobeFDF, AdobeXFDF, AdobeXDAP; MicrosoftOwnerFile (Word, PPT,Excel temp lock files), and MicrosoftOwnerFileOLE.
  • Open Discover Platform bug fix: Fixed issue with a process locked file stream on a processing input file that cause duplicate entries in document data archive (.dda).
  • Open Discover Platform feature: New processing mode: IdentificationWithContainerItemCount. This mode identifies, calculates binary hash, de-NISTs, gives container item count, and tests archives for expansion size/compression ratio. If user is doing full text/metadata document processing then this mode is ideal for the first workflow step in a processing workflow.

Changes in Release 2020.1.0 12/31/2019

  • Open Discover Platform: Fixed error introduced in a prior release that effected archive partitioning processing (that is, breaking’ large’ archives into several partitions (tasks) for distributed processing).
  • New feature for Open Discover Platform: Added new processing type “ProcessingType.MimePartialMessageSet” that should only be used at the very end of document processing workflow to join all MIME partial-message parts into a single valid MIME emails that is processed. The task takes ALL MIME partial-message parts in a document collection and figures out how to join all the partial-message parts into original MIME messages and then processes the joined messages (and assigns extracted content to first MIME partial-message part). MIME partial-messages are sometimes found in MBOX files where the original email was too large to be sent to the receiving client as one email (due to allowed message size limits of mail server).

Changes in Release 2019.4.5 12/10/2019

  • To prevent metadata name collusions the following was added: (1) custom metadata dictionary DocumentContent.CustomMedata, (2) new property types: BooleanListProperty, Int32ListProperty, Int64ListProperty, DoubleListProperty, DateTimeListProperty, and StringListProperty. All extracted custom metadata is stored in property DocumentContent.CustomMedata. If 2 or more metadata properties have the same name, then a one of the new list types is created to hold all of these values, so that no metadata value is lost.
  • OpenDiscoverPlatform.DocumentTaskEngine.InventoryDirectory has a new optional InventoryMode argument. If InventoryMode.DirectoriesOnly is chosen, then input directory and all sub-directories nested under input directory get enumerated along with their corresponding file system metadata; However, files are not enumerated and returned but the number of total files and total file size under a directory is returned as directory metadata (see DirectoryItem.TotalDocumentCount and DirectoryItem.TotalSizeInBytes). This mode allows a user to "scout" a very large file server and determine if multiple InventoryDirectory calls are necessary in order to enumerate and get info on possibly 10's of millions of files.
  • Fixed OpenDiscoverPlatform.DocumentTaskEngine bug related to too many archive being processed at once.

Changes in Release 2019.4.4 11/30/2019

  • Added new Outlook email extracted metadata fields: CreatorName, LastModifierName, InternetReferences, LastVerbExecuted, LastVerbExecutionTime, ResponseRequested, ReplyRequested, IsReceived (MAPI property PidLidAppointmentStateFlags bit 1 flag)
  • Expanded all PidTagMessageFlags bit flags into new metadata boolean fields: Read, Unsent, Resend, Unmodified, Submitted, FromMe, FAIMessage, NotifyRead, NotifyUnread, Internet, Untrusted
  • Added better documentation for extracted Outlook email object metadata, i.e., their mapping to known MAPI property tags.

Changes in Release 2019.4.3 11/4/2019

  • HTML extraction content and performance improvements. Added more extracted content properties to Hyperlink and HtmlImage classes. Removed ExternalLinks property from HtmlDocumentContent class. All hyperlinks/links are now contained just in DocumentContent.Hyperlinks property.
  • HTML document content property HtmlDocumentContent.Hyperlinks was moved to base class DocumentContent. Hyperlink extraction has been extended to also include Office 2007+ (Word, Excel, PowerPoint), Open Document, and PDFs fromats.
  • 9 new WCF related SDK and Platform API WCF service and client examples.
  • Improved Domino DXL parsing especially for “rawitemdata” fields that are parts of original MIME email stored in Lotus Notes.
  • New file ID: DominoXmlAppointmentNotice

Changes in Release 2019.4.1 10/7/2019

  • SDK and Platform API DataContract serialization bug fixes. These bug fixes resolved issues with serializing/de-serializing SDK and Platform API classes (WCF).
  • 9 new WCF related SDK and Platform API WCF service and client examples.

Changes in Release 2019.4.0 10/2/2019

  • API changes - OpenDiscoverSDK.Interfaces was broken into 2 new namespaces: OpenDiscoverSDK.Interfaces.Content and OpenDiscoverSDK.Interfaces.Settings.
  • Performance Improvements - Quicker startup time and less memory usage while processing without performance (GB/hr) degradation.
  • Added 2 file format Ids: MachObject32 and MachObject64 (Mac OS X executables)

Changes in Release 2019.3.7 9/12/2019

  • Added IArchiveExtractor.TestItem and IArchiveExtractor.TestSolidBlockItems methods to SDK to test for actual archive item expansion size. These methods can help detect potential malicious zip-bombs. For Platform level DocumentTaskEngine protection, property TestArchives must be set to test archives for actual expansion size in order to detect malicious archives that have modified headers to hide true expansion size. If TestArchives property is not set to true then protection logic relies on archive header information, if it exists, and not all archive formats have expansion size header information. It is recommended to always set TestArchives to true.
  • Added ability to extract images from PDF pages. User has option to either: (1) extract no PDF page image, (2)only extract images from failed PDF pages (text length critera failed thresold) or, (3) extract all images from all PDF pages. The main purpose of this feature is to help those users that may want to OCR failed PDF pages - this allows skipping the step of using a print driver, Ghostscript, etc. to re-parse PDFs and generate images of the failed PDF pages to use for OCR.
  • Added the 25 following new file format Ids: ArchiveLZ4, ArchiveLZFSE, ArchiveZstandard, ArchiveLZIP, LibreOfficeImpress6. LibreOfficeImpressTemplate6, LibreOfficeCalc6, LibreOfficeCalcTemplate6, LibreOfficeDraw6, LibreOfficeDrawTemplate6, LibreOfficeWriter6, LibreOfficeWriterTemplate6, Word2007Corrupted, PowerPoint2007Corrupted, Excel2007BinaryCorrupted, Visio2013Corrupted, VisoCompoundFileCorrupted, MSProjectCompoundFileCorrupted, MSPublisherCompoundFileCorrupted, CorruptedCompoundFile, MSProjectCompoundFileCorrupted, MinitabGraphCompoundFile, AutodeskExchange, Navisworks, ShockwaveFlashObject
  • Task#19: Fixed file format identification bugs related to truncated (corrupted) ZIP container formats. These bug fixes related to the following new file format ids: Word2007Corrupted, PowerPoint2007Corrupted, Excel2007Corrupted, Excel2007BinaryCorrupted, Visio2013Corrupted. There are also new file IDs for Office 97-2003 corrupted compound file formats.
  • Added fallback content extractors for above truncated (corrupted) Office 2007+ formats. A corrupted Office 2007+ document can also be one whose internal document relationships point to internal parts that do not exist (MS Office applications often won't even open these types of documents). However, in many case text, metadata, and embedded objects can still be extracted by the new fallback extractor. When fallback extraction is used it is reflected in the Document.TextSourceType property.
  • Bug #20: Fixed Excel 97-2003 embedded media (extraction option) not being extracted.
  • Added Microsoft Photo Editor 3 legacy format metafile extraction. This format shows up a lot as embedded items in Office 97-2003 formats and even is some Office 2007 formats. The Photo Editor file format specification is not published. If a Photo Editor 3 format has a metafile it will be extracted as an embedded object and the user can view the image presentation.
  • Unknown compound file formats (ID of UnknownCompoundFile) will have their metadata file presentation formats extracted as embedded items if they exist. Metafiles usually represent how the embedded object looks in the parent document and metafiles often can have their text extracted.
  • HTML parsing performance improvement - ~30% increase in parsing performance.