Reading/Writing a Ms Word File in PHP

Reading/Writing a MS Word file in PHP

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

PHP Read and Write in MS WORD

It's worth noting that Microsoft advises against the automation of Office documents via COM objects:

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

Although you can create .docx files without COM objects however because of their XML foundations (you can use PHPDOCX for this). An added advantage with this method is that you don't need to have a local copy of Word installed (for .docx files) and you can also use it on a Linux server (in theory, although I'm not sure the PHPDOCX product supports that).

Reading DOC file in php

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

Read MS word document with PHP Word

Textual info is located in [text] properties, which in their turn are nested in [elements] properties. Just search for them in the object you get in your browser using the "find something in text" function of your browser, to see the text you are searching for.

These two properties are protected, so you will have to make them public, in order to access/extract them.

Where these properties are defined within the PHPWord library: https://stackoverflow.com/a/50989007/8510094

Once you have made them public, you can start cutting off every layer of the object you have received and thus access the object where [elements]->[text] properties are just one layer down the 'tree'.

So, the algorithm is to 1) find these [text] properties, 2) see the path to the object holding these properties, 3) cut off higher-level objects and arrays level by level, 4) get an object where [elements]->[text] properties are just the 2nd level, 5) gather all the values of [text] properties in, say, an array.

Don't try to use foreach loops, recursive functions, etc. trying to access the text. The resulting object is enormous. You won't be given memory or time this big to be able to iterate over, flatten, reduce, etc. such big multidimensional associative arrays of data.

Alternatively, you can make certain changes to the PHPWord library files and don't get unnecessary properties and values in the resulting object you get when you load your Word file into PHPWord (styles, paragraph info, etc.).

In PHPSpreadsheet, they implemented a method to get only actual data from Excel files (stripped of formatting, styles info, etc). On the other hand, PHPWord also declared $readDataOnly property, but they stopped just there, and for some reason didn't implement the mechanism to read actual, textual data only.



Related Topics



Leave a reply



Submit