Brian’s Brain

I Need a Tagline

Converting Org-mode Files to DOCX

I’ve found org-mode great for early phases of writing. Just like in the days when I used LaTeX, working in plain text doesn’t details of pagination or formatting distract me. It enforces a really clear boundary between “write your words” and “make the output look pretty.”

Two things are even better about working in org-mode than in LaTeX. First, it’s easy in org-mode to clear away all other sections and really zoom in on a part of the document. Focus, focus, focus! Second, I can embed lots of metadata / comments into the org-mode file that will not show up in the production document.

While org-mode comes with many publishing options (HTML, PDF, Docbook), converting to Microsoft Word (.docx) is not one of them. So, when I decided to use org-mode to author a longish document for work, I was faced with a challenge. How do I get the resulting content into a Word document? I looked at a few options:

  • Cut & paste
  • Convert to HTML and open the HTML in Word.
  • Paste HTML into OneNote, and then past to Word. (OneNote preserves HTML formatting much better.)

None of these avenues worked. Word does a bad job preserving HTML formatting. Going through OneNote first preserved the formatting but lost all of the styles. This last bit is a big deal—one of the biggest advantages of working in some sort of lightweight markup language is your text winds up being very consistently styled at the end, making it easier to ge a consistent look and feel. If style information is preserved. If it’s lost, then good luck trying to adjust the formatting for all of your level two headings, or your block quotes, etc.

I turned to code. Because of its regular XML syntax, I figured that Docbook would be a good intermediate format for conversion—much easier to parse than org files. On SourceForge, I found some XSLT files for manipulating docbook files, including one that tries to round-trip Docbook with Word. I used the version here.

Changing the Docbook XSLT

The XSLT targeted earlier versions of Word, and out-of-the-box it didn’t help with my goal of getting Word output that was easy to adjust with Word 2007 themes. (In particular, the heading styles are all wrong.) But this turned out to be pretty easy to change, and I learned some pro XSLT tips that I hadn’t learned before.

Changes:

  • Change output XML schema to target Word 2007 instead of Word 2003.
  • Changed XML output so DocBook section headings would get standard Word Heading styles instead of DocBook section styles. This was a key change to letting Word themes work out of the box, as the themes will describe the look of heading styles.
  • When converting to Word 2007 format, you no longer need to move over a bunch of template “goo”—so I got rid of the dependency on the templatedoc in the XSLT. I also got to remove a bunch of boilerplate XML from the output (document properties and the like).
  • Hacked to make table output work for Word 2007.

Automate XSLT transform and DOCX creation with PowerShell

With a working XSLT transform, it’s a pretty simple matter to automate the creation of a valid DOCX file. The rough strategy is to start with a valid DOCX file that contains the styling/formatting you want in the output. Using the System.IO.Packaging APIs, we crack open the DOCX file and replace the document content XML with the transformed DocBook content.

Breaking it down a step at a time, here are the key parts of the script with color commentary of issues I solved along the way.

First, I load the DocBook XML file. Mostly straightforward, but for reasons I no longer remember, I got tripped up by whitespace between para tags, so I strip it before converting the text into an XML object and navigator.

$text = [System.IO.File]::ReadAllText($file)
$text = [regex]::Replace($text, "<para>\s*", "<para>")
$text = [regex]::Replace($text, "\s*</para>", "</para>")
$xml = [xml] "$text"
$nav = $xml.CreateNavigator()

Note how I get the XsltPath by loading the XSL file from the same directory as the PowerShell script…

$xslt = new-object system.xml.xsl.xsltransform
$XsltPath = join-path $(split-path $MyInvocation.MyCommand.Path -parent) "dbk2wordml.xsl"
$xslt.Load($XsltPath)

The next bit is the magic. A Word DOCX file is just a ZIP file with several XML parts. The main content of the document lives in the part /word/document.xml. The System.IO.Packaging namespace gives you all the classes you need to manipulate the parts of a document; you can get a Stream object to any part. And that’s what I do—I get a Stream to the /word/document.xml part and set up an XmlWriter to write to that stream. As it took a few iterations to debug the XSLT transforms, I took care to have my XmlWriter output nicely indented XML.

# The script will edit a copy of the $Template file.
$OutputFile = $pscmdlet.GetUnresolvedProviderPathFromPSPath($OutputFile)
Copy-Item $Template $OutputFile
$package = [System.IO.Packaging.Package]::Open($OutputFile)
$part = $package.GetPart("/word/document.xml")
$stream = $part.GetStream()
$stream.SetLength(0)
$textWriter = new-object System.IO.StreamWriter $stream

# Use an XmlWriter so the output can be indented. Very helpful for debugging.

$settings = new-object System.Xml.XmlWriterSettings
$settings.Indent = $true
$writer = [System.Xml.XmlWriter]::Create($textWriter, $settings)

With that setup, you’re just about done. Run the XSLT transform, close the files, call it a day.

$args = new-object System.Xml.Xsl.XsltArgumentList
$xslt.Transform($nav, $args, $writer)
$writer.Close()
$package.Close()

Stuff I never got around to fixing

  • Hyperlinks. Word 2007 and later use a really strange format for storing links, and I don’t know how I could (easily) replicate that with an XSLT transform.

Download

You can download everything here. Extract the files into your PowerShell modules directory, then Import-Module DocBookToWordML and you’re good to go.