Can PHP read text from a PowerPoint file?
I wanted to post my resolution to this.
Unfortunately, I was unable to get PHP to reliably read the binary data.
My solution was to write a small vb6 app that does the work by automating PowerPoint.
Not what I was looking for, but, solves the issue for now.
That being said, the Zend option looks like it may be viable at some point, so I will watch that.
Thanks.
How to open a PowerPoint file with PHP
ppt is a powerpoint presentation file, it won't be easy to nearly impossible to copy the data from it. You'd have to use a .net language and office interop to this efficiently.
Phppowerpoint in getting text
Yes, it is.
PhpPresentation, oldly PHPPowerPoint, has some readers : PowerPoint2007, PowerPoint97 and ODPresentation. These readers permit to extract shapes with content and formatting.
How to extract text from word file .doc,docx,.xlsx,.pptx php
Here is a simple class which does the right job for .doc/.docx ,
PHP docx reader: Convert MS Word Docx files to text.
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}
}
Document_file_format Doc files are binary blobs.They can be read by using fopen.While .docx files are just zip files and xml files xml files in a zipfile container (source wikipedia) you can read them by using zip_open.
Usage of above class
$docObj = new DocxConversion("test.doc");
//$docObj = new DocxConversion("test.docx");
//$docObj = new DocxConversion("test.xlsx");
//$docObj = new DocxConversion("test.pptx");
echo $docText= $docObj->convertToText();
read microsoft excel, word and powerpoint info using PHP
Yes, trying to determine the MimeType is your best bet short of just try/catching to load the files with PHPWord, PHPExcel and PHPowerpoint directly to see if they throw an exception (Mark Baker correct me please if they dont throw exceptions).
See my answer to
- PHP how can i check if a file is mp3 or image file?
for various ways to detect the MimeType.
You can find a number of possible MimeTypes for Office documents at
- http://filext.com/faq/office_mime_types.php and
- http://social.msdn.microsoft.com/Forums/en-US/exceldev/thread/87b9cd73-a41b-4fd0-94c7-dfe53e92947e
Related Topics
Getting Imagegrabscreen to Work
Codeigniter Redirect -- the Uri You Submitted Has Disallowed Characters
PHP Array, Are Array Indexes Case Sensitive
Laravel Change Connection Dynamically
How Do We Implement Custom API-Only Authentication in Laravel
How to Join Three Tables in Codeigniter
Laravel 5.3 - How to Add Sessions to 'Api' Without Csrf
PHP "Header (Location)" Inside Iframe, to Load in _Top Location
Adding Additional Persist Calls to Preupdate Call in Symfony 2.1
How to 'JSON_Encode()' Keys from PHP Array
PHP Upload Size and Its Impact on Post Size and Memory Limit
Pdoexception' with Message 'Sqlstate[22001]: String Data, Right Truncated: 0
PHP MySQL SQL Parser (Insert and Update)
How to Send Email with PDF Attachment Using PHP