How to Distinguish Xlsx and Docx Files from Zip Archives

How to distinguish xlsx and docx files from zip archives?

I used Mimemagic Gem and added custom magic(as it is called by the Gem) to identify xlsx, docx, and pptx file format. Also this does not relies on the file extension.

Following are the list of magic that I added:

[['application/vnd.openxmlformats-officedocument.wordprocessingml.document.custom', [[0, "PK\x03\x04", [[30, '_rels/.rels', [[0..5000, 'word/']]]]]]],
['application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.custom', [[0, "PK\003\004", [[30, '_rels/.rels', [[0..5000, 'xl/']]]]]]],
['application/vnd.openxmlformats-officedocument.presentationml.presentation.custom', [[0, "PK\003\004", [[30, '_rels/.rels', [[0..5000, 'ppt/']]]]]]],['application/vnd.openxmlformats-officedocument.wordprocessingml.document.custom', [[0, "PK\x03\x04", [[30, 'word/']]]]],
['application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.custom', [[0, "PK\003\004", [[30, 'xl/']]]]],
['application/vnd.openxmlformats-officedocument.presentationml.presentation.custom', [[0, "PK\003\004", [[30, 'ppt/']]]]]].each do |magic|
MimeMagic.add(magic[0], magic: magic[1])
end

Correct way to distinguish .xls from .doc file?

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.

Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.

Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.

    // slurp content from given input and close it
public static boolean isExcelFile(InputStream in) throws IOException {
try {
// it slurp the input stream
Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
workbook.close();
return true;

} catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
return false;
}
}

You may found more information on excel file format on this link

Update
Solution based on Apache Tika as suggested by gagravarr:

public class TikaBasedFileTypeDetector {
private Tika tika;
private TemporaryResources temporaryResources;

public void init() {
this.tika = new Tika();
this.temporaryResources = new TemporaryResources();
}

// clean up all the temporary resources
public void destroy() throws IOException {
temporaryResources.close();
}

// return content mime type
public String detectType(InputStream in) throws IOException {
TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);

return tika.detect(tikaInputStream);
}

public boolean isExcelFile(InputStream in) throws IOException{
// see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
String type = detectType(in);
return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
}
}

See this answer on mime types.

Reading a file signature and telling the difference between a zip file and a docx file

What I did was put the file signatures into a database, put the signature length of file type and the extension. If the file doesn't have an extension, it isn't uploaded. If the file extension has changed from the signature, the routine will reject the file. Here is the code in the routine that pulls the signatures and does a compare:

using var fileStream = file.OpenReadStream();
var signature = _context.FileSignatures.Select(f => new { f.FileSignature, f.AllowedFileType.FileExtension, f.SignatureLength })
.Where(x => x.FileExtension == fileType);

byte[] bytes = new byte[signature.Max(x => x.SignatureLength)];
fileStream.Read(bytes, 0, signature.Max(x => x.SignatureLength));

string hexData = BitConverter.ToString(bytes);
var foundFile = await signature.FirstAsync(x => x.FileSignature == hexData);

return foundFile.FileExtension;

File signatures are stored in the table like this:

File Extension           FileSignature        SignatureLength
.PDF 25-50-44-46 4

This way I can make sure the read the max number of bytes for the signature and get the extension. If I want to include more files, I just add them to the database.



Related Topics



Leave a reply



Submit