Can PHP Read Text from a Powerpoint File

Can PHP read text from a PowerPoint file?

I wanted to post my resolution to this.

Unfortunately, I was unable to get PHP to reliably read the binary data.

My solution was to write a small vb6 app that does the work by automating PowerPoint.

Not what I was looking for, but, solves the issue for now.

That being said, the Zend option looks like it may be viable at some point, so I will watch that.

Thanks.

How to open a PowerPoint file with PHP

ppt is a powerpoint presentation file, it won't be easy to nearly impossible to copy the data from it. You'd have to use a .net language and office interop to this efficiently.

Phppowerpoint in getting text

Yes, it is.

PhpPresentation, oldly PHPPowerPoint, has some readers : PowerPoint2007, PowerPoint97 and ODPresentation. These readers permit to extract shapes with content and formatting.

How to extract text from word file .doc,docx,.xlsx,.pptx php

Here is a simple class which does the right job for .doc/.docx ,
PHP docx reader: Convert MS Word Docx files to text.

    class DocxConversion{
private $filename;

public function __construct($filePath) {
$this->filename = $filePath;
}

private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}

private function read_docx(){

$striped_content = '';
$content = '';

$zip = zip_open($this->filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

if (zip_entry_name($zip_entry) != "word/document.xml") continue;

$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

zip_entry_close($zip_entry);
}// end while

zip_close($zip);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);

return $striped_content;
}

/************************excel sheet************************************/

function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}

public function convertToText() {

if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}

$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}

}

Document_file_format Doc files are binary blobs.They can be read by using fopen.While .docx files are just zip files and xml files xml files in a zipfile container (source wikipedia) you can read them by using zip_open.

Usage of above class

$docObj = new DocxConversion("test.doc");
//$docObj = new DocxConversion("test.docx");
//$docObj = new DocxConversion("test.xlsx");
//$docObj = new DocxConversion("test.pptx");
echo $docText= $docObj->convertToText();

read microsoft excel, word and powerpoint info using PHP

Yes, trying to determine the MimeType is your best bet short of just try/catching to load the files with PHPWord, PHPExcel and PHPowerpoint directly to see if they throw an exception (Mark Baker correct me please if they dont throw exceptions).

See my answer to

  • PHP how can i check if a file is mp3 or image file?

for various ways to detect the MimeType.

You can find a number of possible MimeTypes for Office documents at

  • http://filext.com/faq/office_mime_types.php and
  • http://social.msdn.microsoft.com/Forums/en-US/exceldev/thread/87b9cd73-a41b-4fd0-94c7-dfe53e92947e


Related Topics



Leave a reply



Submit