Docx File Type in PHP Finfo_File Is Application/Zip

DOCX File type in PHP finfo_file is application/zip

As far as I now the vendor specific file types (vnd.) are not standardized (by any RFC) and therefore not covered by file_info(). .docx is a zipped xml-format and thats the reason, why file_info() returns application_zip (what is completely right). You may unzip the file and test the mime-type of the result, but that will lead to xml (what is completely correct too) and other files, that are used by the document. To differ between different XML formats file_info() had to analyze its content and it must know, how it looks, what goes just to far.

Is it right that PHP's finfo returns application/zip MimeType for a .docx?

The Word Microsoft Office Open XML Format Document format consists of a bunch of XML and other files stored in a zip file (unzip it and see). So yes, this is correct.

Correct way to detect mime type in php

Based on this I've ported it to PHP:

function getMicrosoftOfficeMimeInfo($file) {
$fileInfo = array(
'word/' => array(
'type' => 'Microsoft Word 2007+',
'mime' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'extension' => 'docx'
),
'ppt/' => array(
'type' => 'Microsoft PowerPoint 2007+',
'mime' => 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'extension' => 'pptx'
),
'xl/' => array(
'type' => 'Microsoft Excel 2007+',
'mime' => 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'extension' => 'xlsx'
)
);

$pkEscapeSequence = "PK\x03\x04";

$file = new BinaryFile($file);
if ($file->bytesAre($pkEscapeSequence, 0x00)) {
if ($file->bytesAre('[Content_Types].xml', 0x1E)) {
if ($file->search($pkEscapeSequence, null, 2000)) {
if ($file->search($pkEscapeSequence, null, 1000)) {
$offset = $file->tell() + 26;
foreach ($fileInfo as $searchWord => $info) {
$file->seek($offset);
if ($file->bytesAre($searchWord)) {
return $fileInfo[$searchWord];
}
}
return array(
'type' => 'Microsoft OOXML',
'mime' => null,
'extension' => null
);
}
}
}
}

return false;
}

class BinaryFile_Exception extends Exception {}

class BinaryFile_Seek_Method {
const ABSOLUTE = 1;
const RELATIVE = 2;
}

class BinaryFile {
const SEARCH_BUFFER_SIZE = 1024;

private $handle;

public function __construct($file) {
$this->handle = fopen($file, 'r');
if ($this->handle === false) {
throw new BinaryFile_Exception('Cannot open file');
}
}

public function __destruct() {
fclose($this->handle);
}

public function tell() {
return ftell($this->handle);
}

public function seek($offset, $seekMethod = null) {
if ($offset !== null) {
if ($seekMethod === null) {
$seekMethod = BinaryFile_Seek_Method::ABSOLUTE;
}
if ($seekMethod === BinaryFile_Seek_Method::RELATIVE) {
$offset += $this->tell();
}
return fseek($this->handle, $offset);
} else {
return true;
}
}

public function read($length) {
return fread($this->handle, $length);
}

public function search($string, $offset = null, $maxLength = null, $seekMethod = null) {
if ($offset !== null) {
$this->seek($offset);
} else {
$offset = $this->tell();
}

$bytesRead = 0;
$bufferSize = ($maxLength !== null ? min(self::SEARCH_BUFFER_SIZE, $maxLength) : self::SEARCH_BUFFER_SIZE);

while ($read = $this->read($bufferSize)) {
$bytesRead += strlen($read);
$search = strpos($read, $string);

if ($search !== false) {
$this->seek($offset + $search + strlen($string));
return true;
}

if ($maxLength !== null) {
$bufferSize = min(self::SEARCH_BUFFER_SIZE, $maxLength - $bytesRead);
if ($bufferSize == 0) {
break;
}
}
}
return false;
}

public function getBytes($length, $offset = null, $seekMethod = null) {
$this->seek($offset, $seekMethod);
$read = $this->read($length);
return $read;
}

public function bytesAre($string, $offset = null, $seekMethod = null) {
return ($this->getBytes(strlen($string), $offset) == $string);
}
}

Usage:

$info = getMicrosoftOfficeMimeInfo('hi.docx');
/*
Array
(
[type] => Microsoft Word 2007+
[mime] => application/vnd.openxmlformats-officedocument.wordprocessingml.document
[extension] => docx
)
*/

$info = getMicrosoftOfficeMimeInfo('hi.xlsx');
/*
Array
(
[type] => Microsoft Excel 2007+
[mime] => application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
[extension] => xlsx
)
*/

$info = getMicrosoftOfficeMimeInfo('hi.pptx');
/*
Array
(
[type] => Microsoft PowerPoint 2007+
[mime] => application/vnd.openxmlformats-officedocument.presentationml.presentation
[extension] => pptx
)
*/

$info = getMicrosoftOfficeMimeInfo('hi.zip');
// bool(false)

Is it right that PHP's finfo returns application/zip MimeType for a .docx?

The Word Microsoft Office Open XML Format Document format consists of a bunch of XML and other files stored in a zip file (unzip it and see). So yes, this is correct.

Uploading .docx using mime types

If you look at the implementation of CFileValidator::validateFile() you'll notice that Yii will either use finfo_file() (since PHP 5.3.0) or mime_content_type() to find out the MIME type of your file.

  • finfo_open() will usually use the bundled magic database in PHP. But you can override this by setting a MAGIC environment variable as explained here.
  • mime_content_type() will use the magic file as specified in the mime_magic.magicfile configuration setting

So if you check your PHP version, you can debug further or supply your custom magic file.

Different file mime type detected for same file

The same file can have different and mulitple mime-types, that is totally normal.

Additionally the mime-type is only meta-information next to the file itself. Theoretically you can give any file any mime-type. That would not be very useful, but it works. It's just a concept.

The finfo library will try to obtain the mime-type of a file "magically" by looking into the file trying to identify the format. Then it will return the mime-type according to it's database.

Why is it not returning the same as while in uploading?

The mime-type within the request is given by the HTTP client. It might guess as well, but often it takes the value from information the underlying operating system is giving for that file.

As you can see with your issue that the more common the file-type is, the better it will match (the images).

However as pptx and docx files are actually zip-files, the finfo library will identify those as application/zip because the headers of those files (magic numbers) show that it is technically a zip file.

Is there something wrong with my code or should I expect this?

You should not expect that the mime-type of finfo matches the request header mime-type. Those are two different things.

How do I decide which file type it is then?

That depends. You can decide to trust the http header, you can decide to trust finfo, you can decide to compare the file extensions as well and a combination of all three.

Additionally you can decide to even add more. This entirely depends on what you do with the uploaded file.



Related Topics



Leave a reply



Submit