 Java, Spring and Web Development tutorials  1. Introduction
Once complex and niche, document conversion is now a common part of not only toolsets, but also libraries and even native functionality of different programming languages.
In this tutorial, we’ll learn how to convert a Word document into an HTML page that can be rendered inside a browser. Specifically, we’ll learn two ways to convert documents programmatically using Apache POI. First, we’ll convert modern docx files. After that, we’ll look at the legacy doc format. In general, this use case is common in enterprise applications.
2. Differences Between doc and docx
Until 2007, Microsoft Word used the legacy doc format, which relied on a binary representation. As a consequence, interoperability and preservation of formatting became harder while working across different tools.
After 2007, Word moved to the Office Open XML-based docx format. This format is structured, standardized, and often much easier to process programmatically.
Because of that, converting Word documents requires a different approach depending on the format. To that end, we start with docx, but then also cover doc for backward compatibility.
3. Maven Dependencies
To support both formats, we need Apache POI modules plus the XHTML converter provided by XDocReport:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.5.1</version>
</dependency>
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
<version>2.1.0</version>
</dependency>
The converter can also be configured with an ImageManager so that embedded images are written to secondary storage and referenced from the generated HTML.
4. Converting docx Documents
A docx file is essentially a ZIP archive containing XML parts. Apache POI hides that complexity behind the XWPFDocument API, which gives us a much cleaner way to work with Word content.
4.1. Using Apache POI to Convert Documents
Apache POI represents docx files with the XWPFDocument class.
First, let’s load the document from storage:
public XWPFDocument loadDocxFromPath(String path) {
try {
Path file = Paths.get(path);
if (!Files.exists(file)) {
throw new FileNotFoundException("File not found: " + path);
}
XWPFDocument document = new XWPFDocument(Files.newInputStream(file));
boolean hasParagraphs = !document.getParagraphs().isEmpty();
boolean hasTables = !document.getTables().isEmpty();
if (!hasParagraphs && !hasTables) {
document.close();
throw new IllegalArgumentException("Document is empty: " + path);
}
return document;
} catch (IOException ex) {
throw new UncheckedIOException("Cannot load document: " + path, ex);
}
}
In the code above, we load the docx file from a path and reject empty documents.
Next, we configure XHTMLOptions for the generated HTML. XDocReport supports ImageManager, which stores extracted images in an images directory in the same directory as the one containing the final HTML output:
private XHTMLOptions configureHtmlOptions(Path outputDir) {
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new ImageManager(outputDir.toFile(), "images"));
return options;
}
Now, we can convert the document and save the HTML file next to the input document:
public void convertDocxToHtml(String docxPath) throws IOException {
Path input = Paths.get(docxPath);
String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
Path output = input.resolveSibling(htmlFileName);
try (XWPFDocument document = loadDocxFromPath(docxPath);
OutputStream out = Files.newOutputStream(output)) {
XHTMLConverter.getInstance().convert(document, out, configureHtmlOptions(output.getParent()));
}
}
Next, let’s write a test to verify the conversion:
@Test
void givenSimpleDocx_whenConverting_thenHtmlFileIsCreated() throws IOException {
DocxToHtml converter = new DocxToHtml();
Path docx = Paths.get(this.getClass().getResource("/sample.docx").getPath());
converter.convertDocxToHtml(docx.toString());
Path html = docx.resolveSibling("sample.html");
assertTrue(Files.exists(html));
String content = Files.lines(html, StandardCharsets.UTF_8)
.collect(Collectors.joining("\n"));
assertTrue(content.contains("<html"));
}
As we can see, we’re reading the file in UTF-8, which the Apache library correctly saved. But there could be other things to consider as well.
4.2. Handling Large Documents
For large documents, resource management matters. Using try-with-resources helps release streams and document data as soon as the conversion finishes.
If needed, run conversions asynchronously to prevent large files from blocking request threads.
Lastly, we’re not considering nested tables and complex layouts in here. Thus, the input Word document may not always match perfectly with HTML visually. The function works best with regular paragraphs, tables, and basic formatting, but it has limitations. For example, the sample docx file contains a graph that isn’t converted in the output document. To keep things simple, we won’t convert it here.
Still, in production systems, it’s a good idea to add regression tests with real sample documents that reflect the layouts that the application might have to support.
5. Legacy doc Conversion
The older doc format is handled by POI’s HWPF APIs rather than XWPF. Apache POI provides WordToHtmlConverter for this use case:
public void convertDocToHtml(String docPath) throws Exception {
Path input = Paths.get(docPath);
String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
Path output = input.resolveSibling(htmlFileName);
Path imagesDir = input.resolveSibling("images");
Files.createDirectories(imagesDir);
try (InputStream in = Files.newInputStream(Paths.get(docPath));
OutputStream out = Files.newOutputStream(output)) {
HWPFDocumentCore document = WordToHtmlUtils.loadDoc(in);
Document htmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
WordToHtmlConverter converter = new WordToHtmlConverter(htmlDocument);
converter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
Path imageFile = imagesDir.resolve(suggestedName);
try {
Files.write(imageFile, content);
} catch (IOException e) {
throw new RuntimeException(e);
}
return "images/" + suggestedName;
});
converter.processDocument(document);
Transformer transformer = TransformerFactory.newInstance()
.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(new DOMSource(converter.getDocument()), new StreamResult(out));
}
}
This flow is different internally, but the overall idea is the same: load the Word document, convert it, and write HTML to storage. Additionally, for the doc format, we have to specify the encoding of the document explicitly.
Additionally, the image conversion part takes a little more configuration.
6. Conclusion
In this tutorial, we learned how to convert Word documents to HTML using Apache POI.
We covered modern docx files with XWPFDocument and XHTMLConverter, then looked at legacy doc files with WordToHtmlConverter. The post Convert Word Document to HTML Programmatically in Java first appeared on Baeldung.
Content mobilized by FeedBlitz RSS Services, the premium FeedBurner alternative. |