How to distinguish xlsx and docx files from zip archives?
I used Mimemagic Gem and added custom magic(as it is called by the Gem) to identify xlsx, docx, and pptx file format. Also this does not relies on the file extension.
Following are the list of magic that I added:
[['application/vnd.openxmlformats-officedocument.wordprocessingml.document.custom', [[0, "PK\x03\x04", [[30, '_rels/.rels', [[0..5000, 'word/']]]]]]],
['application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.custom', [[0, "PK\003\004", [[30, '_rels/.rels', [[0..5000, 'xl/']]]]]]],
['application/vnd.openxmlformats-officedocument.presentationml.presentation.custom', [[0, "PK\003\004", [[30, '_rels/.rels', [[0..5000, 'ppt/']]]]]]],['application/vnd.openxmlformats-officedocument.wordprocessingml.document.custom', [[0, "PK\x03\x04", [[30, 'word/']]]]],
['application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.custom', [[0, "PK\003\004", [[30, 'xl/']]]]],
['application/vnd.openxmlformats-officedocument.presentationml.presentation.custom', [[0, "PK\003\004", [[30, 'ppt/']]]]]].each do |magic|
MimeMagic.add(magic[0], magic: magic[1])
end
Correct way to distinguish .xls from .doc file?
Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic
helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.
Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.
Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.
// slurp content from given input and close it
public static boolean isExcelFile(InputStream in) throws IOException {
try {
// it slurp the input stream
Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
workbook.close();
return true;
} catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
return false;
}
}
You may found more information on excel file format on this link
Update
Solution based on Apache Tika as suggested by gagravarr:
public class TikaBasedFileTypeDetector {
private Tika tika;
private TemporaryResources temporaryResources;
public void init() {
this.tika = new Tika();
this.temporaryResources = new TemporaryResources();
}
// clean up all the temporary resources
public void destroy() throws IOException {
temporaryResources.close();
}
// return content mime type
public String detectType(InputStream in) throws IOException {
TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);
return tika.detect(tikaInputStream);
}
public boolean isExcelFile(InputStream in) throws IOException{
// see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
String type = detectType(in);
return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
}
}
See this answer on mime types.
Reading a file signature and telling the difference between a zip file and a docx file
What I did was put the file signatures into a database, put the signature length of file type and the extension. If the file doesn't have an extension, it isn't uploaded. If the file extension has changed from the signature, the routine will reject the file. Here is the code in the routine that pulls the signatures and does a compare:
using var fileStream = file.OpenReadStream();
var signature = _context.FileSignatures.Select(f => new { f.FileSignature, f.AllowedFileType.FileExtension, f.SignatureLength })
.Where(x => x.FileExtension == fileType);
byte[] bytes = new byte[signature.Max(x => x.SignatureLength)];
fileStream.Read(bytes, 0, signature.Max(x => x.SignatureLength));
string hexData = BitConverter.ToString(bytes);
var foundFile = await signature.FirstAsync(x => x.FileSignature == hexData);
return foundFile.FileExtension;
File signatures are stored in the table like this:
File Extension FileSignature SignatureLength
.PDF 25-50-44-46 4
This way I can make sure the read the max number of bytes for the signature and get the extension. If I want to include more files, I just add them to the database.
Related Topics
What Evaluates to False in Ruby
Is There an Inverse 'Member' Method in Ruby
Heroku Not Sending Email With Gmail Smtp
String Interpolation in Ruby Doesn't Work
Ruby/Rails - Change the Timezone of a Time, Without Changing the Value
Undefined Method 'Visit' When Using Rspec and Capybara in Rails
Difference Between $Stdout and Stdout in Ruby
How to Implement a Short Url Like the Urls in Twitter
Execute Bash Commands from a Rakefile
Rails Paperclip How to Delete Attachment
Haml: Append Class If Condition Is True in Haml
Extract a Substring from a String in Ruby Using a Regular Expression
How to Get Rid of Non-Ascii Characters in Ruby
Confusion With Atomic Grouping - How It Differs from the Grouping in Regular Expression of Ruby
Converting String from Snake_Case to Camelcase in Ruby
How to Use Bundler Behind a Proxy
How to Remove a Key from Hash and Get the Remaining Hash in Ruby/Rails