Read Word Document in PHP

Reading/Writing a MS Word file in PHP

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

Read MS word document with PHP Word

Textual info is located in [text] properties, which in their turn are nested in [elements] properties. Just search for them in the object you get in your browser using the "find something in text" function of your browser, to see the text you are searching for.

These two properties are protected, so you will have to make them public, in order to access/extract them.

Where these properties are defined within the PHPWord library: https://stackoverflow.com/a/50989007/8510094

Once you have made them public, you can start cutting off every layer of the object you have received and thus access the object where [elements]->[text] properties are just one layer down the 'tree'.

So, the algorithm is to 1) find these [text] properties, 2) see the path to the object holding these properties, 3) cut off higher-level objects and arrays level by level, 4) get an object where [elements]->[text] properties are just the 2nd level, 5) gather all the values of [text] properties in, say, an array.

Don't try to use foreach loops, recursive functions, etc. trying to access the text. The resulting object is enormous. You won't be given memory or time this big to be able to iterate over, flatten, reduce, etc. such big multidimensional associative arrays of data.

Alternatively, you can make certain changes to the PHPWord library files and don't get unnecessary properties and values in the resulting object you get when you load your Word file into PHPWord (styles, paragraph info, etc.).

In PHPSpreadsheet, they implemented a method to get only actual data from Excel files (stripped of formatting, styles info, etc). On the other hand, PHPWord also declared $readDataOnly property, but they stopped just there, and for some reason didn't implement the mechanism to read actual, textual data only.

How can I view/open a word document in my browser using with PHP or HTML

Two options: First is to just link to it, e.g. <a href="MyWordDocument.doc">My Word Document</a>, the second is to use an iframe and point it to the document. For this to work, however, most browsers require that the server sends a Content-disposition: inline header with the document. If you cannot configure your web server to do this, you can wrap the document in a bit of php:

<?php
header('Content-disposition: inline');
header('Content-type: application/msword'); // not sure if this is the correct MIME type
readfile('MyWordDocument.doc');
exit;

And then link to that script instead of your word document.

This isn't guaranteed to work though; the content-disposition header is just a hint, and any browser may choose to treat it as an attachment anyway.

Also, note that .doc isn't exactly portable; basically, you need Word to display it properly (Open Office and a few other Open Source applications do kind of a decent job, but they're not quite there yet), and the browser must support opening Word as a plugin.

If the .doc file format requirement isn't set in stone, PDF would be a better choice (the conversion is usually as simple as printing it on a PDF printer, say, CutePDF, from inside Word), or maybe you can even convert the document to HTML (mileage may vary though).

How to extract text from word file .doc,docx,.xlsx,.pptx php

Here is a simple class which does the right job for .doc/.docx ,
PHP docx reader: Convert MS Word Docx files to text.

    class DocxConversion{
private $filename;

public function __construct($filePath) {
$this->filename = $filePath;
}

private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}

private function read_docx(){

$striped_content = '';
$content = '';

$zip = zip_open($this->filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

if (zip_entry_name($zip_entry) != "word/document.xml") continue;

$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

zip_entry_close($zip_entry);
}// end while

zip_close($zip);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);

return $striped_content;
}

/************************excel sheet************************************/

function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}

public function convertToText() {

if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}

$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}

}

Document_file_format Doc files are binary blobs.They can be read by using fopen.While .docx files are just zip files and xml files xml files in a zipfile container (source wikipedia) you can read them by using zip_open.

Usage of above class

$docObj = new DocxConversion("test.doc");
//$docObj = new DocxConversion("test.docx");
//$docObj = new DocxConversion("test.xlsx");
//$docObj = new DocxConversion("test.pptx");
echo $docText= $docObj->convertToText();

How to read word document and put annotation using php

so far, I get this only

function read_docx($filename){
$striped_content = '';
$content = '';
if(!$filename || !file_exists($filename)) return false;

$zip = zip_open($filename);
if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

if (zip_entry_name($zip_entry) != "word/document.xml") continue;

$content.=zip_entry_read($zip_entry,zip_entry_filesize($zip_entry));

zip_entry_close($zip_entry);
}
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);

return $striped_content;
}

But it will return only content.I am looking forward to how to get images and same formatting as in word.



Related Topics



Leave a reply



Submit