Identify Document File Formats |
The Open Discover® SDK file format identifier API (see DocumentIdentifier) identifies over 1,500 file formats using internal signatures specific to the identified formats.
Overloaded method DocumentIdentifier.Identify is used to identify a document's file format. This method returns an IdResult object that contains information on the identified file format (see Id), known file extensions of the file format, description of the format, format classification (see IdClassification), quality (confidence) of the identification (see IdMatchType), and more.
The following unit test example illustrates how to use DocumentIdentifier.Identify method and also show cases most of the properties on the returned IdResult object:
1var filePath = @"C:\WordProcessing\Word2003.doc"; 2using (var stream = File.OpenRead(filePath)) 3{ 4 var idResult = DocumentIdentifier.Identify(stream, filePath); 5 Assert.IsTrue(idResult.ID == Id.Word2003); 6 Assert.IsTrue(idResult.Classification == IdClassification.WordProcessing); 7 Assert.IsTrue(idResult.MatchType == IdMatchType.SignatureAndExtension); 8 Assert.IsTrue(idResult.IsEncrypted == false); 9 Assert.IsTrue(idResult.MediaType == "application/msword"); 10 Assert.IsTrue(idResult.Description != null); 11 Assert.IsTrue(idResult.PrimaryExtension != null); 12 Assert.IsTrue(idResult.Extensions != null); 13}
In the above example, DocumentIdentifier.Identify returns an IdResult object. The IdResult class contains useful information about the identified file format. If the document format cannot be identified then the IdResult.ID property will be set to Id.Unknown.
For example C# application usage of the DocumentIdentifier class, see our Github repository: Open Discover® SDK DocumentIdentifier Example
Open Discover SDK file format identification does not rely on file extensions except to judge the quality of identification and to help identify a small number of Id's that do not have unique enough identifying signatures. One such case would be encrypted Microsoft Office 2007 (and newer) Word, PowerPoint, and Excel documents. In their password encrypted state, these documents have the exact same internal file format (OLE2 compound file format) that has an encrypted OLE stream that contains the original document. Without the file extension (.docx,.pptx,.xlsx) it is impossible to know which specific Office format the document is until after it is decrypted. If the extension is removed from an encrypted Office 2007 and newer document it will be identified as a generic Id.MicrosoftOfficeEncrypted for this reason. For the above reasons, it is recommended that the user always populate the DocumentIdentifier.Identify 'filePath' argument with the file's full path or file name with extension, if known. |
The .NET assemblies that make up Open Discover SDK are x64 release builds (not AnyCPU) due to x64 dependencies. Therefore, applications that reference and use the SDK assemblies MUST also be x64 builds. |