Read a word (.docx) file in java
Your docx contains altChunks of type docx.
It contains those because that would've been done explicitly when whoever created it did so using docx4j, using code such as https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/AltChunkAddOfTypeDocx.java
Ordinarily you wouldn't do that.
Generally, if you want to handle such a docx using approaches like XPath, you'd first convert those altChunks into normal content. Word can do this, as can Docx4j Enterprise.
But if you control the generating application, the best approach would be to revisit it, changing it so it doesn't create altChunks. At least understand why they wrote it that way.
how to judge if the file is doc or docx in POI
Using the current stable apache poi
version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
How to read doc file using Poi?
You are trying to open a .docx file (XWPF) with code for .doc (HWPF) files. You can use XWPFWordExtractor
for .docx files.
There is an ExtractorFactory
which you can use to let POI decide which of these applies and uses the correct class to open the file, however you can then not iterate by page as only a generic getText()
method is available then.
Use it like this
POITextExtractor extractor = ExtractorFactory.createExtractor(file);
extractor.getText();
How to open .doc or .docx file and check text format using java
You can take a look at Apache POI. It is a powerful library for creating and editing microsoft office documents. But if you need only to check some parameters in doc or docx you can use docx4j
How to read doc and docx in java
Tika supports Microsoft Office format as well as many others formats, it provides you with a common interface for all the formats as well as hiding the complexity of maintaining and learning how to use lots of different libraries. It is as easy as calling this function. You could also use the Office Parser and OOXMLParser directly.
how to know whether a file is .docx or .doc format from Apache POI
If it is just a matter of decided whether a collection of files known to either be .doc
or .docx
but are not marked accordingly with an extension, you can use the fact that a .docx
file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream
is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx
entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)
Related Topics
Postgresql Uuid Supported by Hibernate
Efficient Way to Divide a List into Lists of N Size
Is Executorservice (Specifically Threadpoolexecutor) Thread Safe
Start Mail-Client with Attachment
How to Securely Store Encryption Keys in Java
How Does the Enhanced for Statement Work for Arrays, and How to Get an Iterator for an Array
Reading a Specific Line from a Text File in Java
Is There Possibility of Sum of Arraylist Without Looping
What's the Meaning of System.Out.Println in Java
Java: Bufferedimage to Byte Array and Back
Removing All the Rows of Defaulttablemodel
Create a List of Primitive Int
Best Practices to Create and Download a Huge Zip (From Several Blobs) in a Webapp