How Read Doc or Docx File in Java

Read a word (.docx) file in java

Your docx contains altChunks of type docx.

It contains those because that would've been done explicitly when whoever created it did so using docx4j, using code such as https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/AltChunkAddOfTypeDocx.java

Ordinarily you wouldn't do that.

Generally, if you want to handle such a docx using approaches like XPath, you'd first convert those altChunks into normal content. Word can do this, as can Docx4j Enterprise.

But if you control the generating application, the best approach would be to revisit it, changing it so it doesn't create altChunks. At least understand why they wrote it that way.

how to judge if the file is doc or docx in POI

Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.

Example:

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;

import org.apache.poi.poifs.filesystem.FileMagic;

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadWord {

static String read(InputStream is) throws Exception {

System.out.println(FileMagic.valueOf(is));

String text = "";

if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}

return text;

}

public static void main(String[] args) throws Exception {

InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();

is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();

is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();

}
}

How to read doc file using Poi?

You are trying to open a .docx file (XWPF) with code for .doc (HWPF) files. You can use XWPFWordExtractor for .docx files.

There is an ExtractorFactory which you can use to let POI decide which of these applies and uses the correct class to open the file, however you can then not iterate by page as only a generic getText() method is available then.

Use it like this

POITextExtractor extractor = ExtractorFactory.createExtractor(file);
extractor.getText();

How to open .doc or .docx file and check text format using java

You can take a look at Apache POI. It is a powerful library for creating and editing microsoft office documents. But if you need only to check some parameters in doc or docx you can use docx4j

How to read doc and docx in java

Tika supports Microsoft Office format as well as many others formats, it provides you with a common interface for all the formats as well as hiding the complexity of maintaining and learning how to use lots of different libraries. It is as easy as calling this function. You could also use the Office Parser and OOXMLParser directly.

how to know whether a file is .docx or .doc format from Apache POI

If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:

boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)



Related Topics



Leave a reply



Submit