PyDocX: Parsing Word documents in order to increase greater collaboration in collaborative writing

Authors: Portnow, Samuel, University of Virginia; Ward, Jason, PolicyStat LLC; Spies, Jeff, R, Center for Open Science

Track: Posters


Word is often the least common denominator when writing in a collaborative environment: if one member doesn't know LaTeX, everyone must use Word. That is not to say that Word lacks value: it's low barrier to entry and track changes functionality are highly attractive features, but other formats are more conducive to scientific writing and creating reproducible research. PyDocx is an open source Python package for parsing the Docx file format, including track changes markup. The package has been written in such a way as to allow users to easily write parsers to and from their desired format. Users can also extend the parsers we will provide, which will include HTML, LaTeX, reStructuredText, and Markdown. We currently have a basic set of features implemented to translate from Docx and a robust test environment. We hope to rapidly expand on the features we parse, as well as the functionality to translate to Docx.