Convert Word Doc to HTML Programmatically in Java

Convert Word doc to HTML programmatically in Java

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

Convert .docx to HTML using JAVA

This code worked for me to convert .docx to html:

You can also look at the link : Link to code

       //convert .docx to HTML string
InputStream in= new FileInputStream(new File(path));
XWPFDocument document = new XWPFDocument(in);


XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("word/media")));

OutputStream out = new ByteArrayOutputStream();


XHTMLConverter.getInstance().convert(document, out, options);
String html=out.toString();
System.out.println(html);

Convert Word to HTML with Apache POI

This code is now working for me!

    HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:\\temp\\seo\\1.doc"));

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();

String result = new String(out.toByteArray());
System.out.println(result);

Convert HTML/MXML file to Word doc programmatically in Java

I've found that by far the best (free) option to do conversions like this is to use the OpenOffice API. It has a very robust conversion facility. It's a bit of a pain to initially get working because of how abstract the API is, but once you do, it's powerful. This API wrapper helps to simplify it somewhat.

Apache POI - converting *.doc to *.html with images

Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.

Converting a .docx to html and I am getting unreadable text

No.

You are reading the raw content of a docx file, this is not html but zipped xml - see here, you would need something to translate the docx to html. The two are very different.

Convert docx to html in Android

I was not able to get Apache XWPF working, but I was able to use Docx4j (sample code for Android here), which worked for my purposes. I just had to include the libraries found in that project.



Related Topics



Leave a reply



Submit