Convert LaTeX to docx using Pandoc
Challenges and solutions
Format
Prepare a reference .docx
file and apply the format using the --reference-doc
command. This can be done by creating a without reference first; tuning the format; then reproducing the .docx
file.
Citation references
See relevant sections in Pandoc User’s Guide. Reference styles are available to download from Zotero Style Repository.
--citeproc
--bibliography=xxx.bib
--csl=xxx.csl
Cross references
Pandoc does convert the section reference to section numbers, but it does not correctly convert references for tables and figures.
Inspired by the xr
package, when the hyperref
package .aux
file, the definition starts with \newlabel
. The regex expression for parsing the \newlabel
lines are:
^\\newlabel{(.+)}{{(.+)}{(.*)}{(.*)}{(.*)}{(.*)}}$
Here is an example filter in Python:
#!/usr/bin/env python
import re
from pandocfilters import toJSONFilter
from functools import lru_cache
REF_REGEX = re.compile(r'^\\newlabel{(.+)}{{(.+)}{(.*)}{(.*)}{(.*)}{(.*)}}$')
@lru_cache
def load_aux(fname: str):
refs = {}
with open(fname) as fp:
for line in fp:
res = REF_REGEX.search(line)
if res:
refs[res.group(1)] = res.group(2)
return refs
def resolveRef(key, value, format, meta):
refs = load_aux('main.aux')
if key == "Link":
try:
res = re.search(r'^\[(.*)\]$', value[1][0]['c'])
if res:
value[1][0]['c'] = refs[res.group(1)]
except Exception as e:
pass
if __name__ == "__main__":
toJSONFilter(resolveRef)
Undefined math command
Some math commands, such as \tiny
, \large
, \text
(from amsmath
), are not supported by Pandoc. One can also use filters to strip them away or replace them with alternatives if necessary.
Here is an example filter in Python:
#!/usr/bin/env python
from pandocfilters import toJSONFilter
def replaceCommands(key, value, format, meta):
if key == "Math":
for i in range(len(value)):
if isinstance(value[i], str):
value[i] = value[i].replace(r"\tiny", "")
value[i] = value[i].replace(r"\large", "")
value[i] = value[i].replace(r"\text", "\mathrm")
if __name__ == "__main__":
toJSONFilter(replaceCommands)
Final command
To put them together, the final command would be:
pandoc main.tex \
--citeproc \
--bibliography=xxx.bib \
--csl=physical-review-b.csl \
--reference-doc=main-ref.docx \
--filter fix-texmath.py \
--filter resolve-ref.py \
-t docx -o main-$(date +%Y%m%d%H%M).docx