Tag Archives: LaTeX

Markup languages with Git, a document workflow

Currently I’m writing this entry in a text editor that is a WYSIWYG type of editor. This is entirely common, to the point where people are just conditioned to thinking that editing text and WYSIWYG go together. There are however a number of document markup languages that are not WYSIWYG, lightweight ones such as Markdown and ReStructuredText are popular along with more powerful languages such as LaTeX.

All these markup languages store your documents in plain text files, for example LaTeX stores these as “.tex” files which then get compiled into whatever final output type you end up using (in my case this is usually pdf). At first I remember being put off by the way in which you needed to compile your documents with LaTeX, for someone conditioned to using packages like open office this extra step felt alien and weird. So strong was the sense strangeness of using a markup language for my regular documents that I abandoned using LaTeX the first time I encountered it. I didn’t come to a favourable cost vs benefit breakdown for learning LaTeX. This was because I overestimated the learning curve and I didn’t see how big the benefits were until later. The benefits of using a markup language very clearly outweigh the negatives for a variety of situations and document types. Hopefully after reading this some of these benefits will be more obvious.

It is important to choose the right tool for the job. For complex typesetting problems LaTeX is extremely powerful, for example if you need to typeset mathematics formulae or have extreme control over your formatting it’s just about the only game in town. However you don’t always need this power, if you are just dealing with simple documents with basic formatting you probably will do better using something like Markdown. This is especially the case if you are trying to convince other team-mates to also use a markup language for their documents. The benefits have to be clear to people for them to make the jump from whatever it is that they are currently using to create their documents. For me the tipping point was when I got into using version control software effectively, at this point the power of using documents with a markup language became really clear. Because I was working on a mathematics paper LaTeX ended up being the best choice for that project.

Collaboration

Have you ever had to collaborate on a sizeable document with a team? If so how did you manage merging all the individual changes into the final master document? I did this once for a technical report with open office writer and the process was particularly painful, combining all the changes was slow and error prone. There was a proliferation of similar files as well to deal with with people making a lot of temporary files, questions such as which files are most recent and which changes have to be migrated came up again and again. A couple of times we lost changes too because of documents being simultaneously overwritten. All these problems are hard to solve in an ad-hoc manner, thankfully this is mostly a solved problem though. The fact that LaTeX documents are stored in plain text is extremely useful when you consider that you can be using version control software on your documents. As the documents are stored in plain text this allows you to use all the software out there that can deal with maintain changes to text.

If you are not familiar with version control you really should stop and take some time to find out more. One of the big wins version control creates is when you need to collaborate on a document. The LaTeX + Git combination really shines through because it is now extremely straightforward to share your changes via making commits and merging your changes. (Note that any other markup language for documents will get this same benefit). Because Git is decentralized you don’t need to set up any servers, in fact you don’t even need a network or internet access, you can just push changes with a USB key if you need. This makes backing up your files really easy and saves the massive hassle of dealing with a ton of similarly named backup directories or zip files.

Benefits of using Git

From the point where you have got Git up and running onwards you can just edit your documents and save the changes in git as you go along. Aside from easier collaboration you get a lot of other major benefits essentially for free by doing it this way including:

  • A reliable way to back up your files (via git push to another location)
  • A full history of changes people have made.
  • Ability to check differences between revisions of the file.
  • Find out who wrote what lines by using things such as “git blame”.
  • Find things like the breakdown of how many lines people wrote in files.
  • Ability to integrate scripting tools into your document workflow (for example you could use git hooks to say build your document and place it in a common directory every time someone makes a push)

Note that some of those benefits have nothing at all to do with collaboration, even when I’m working on a project by myself I find Git to be worth using because of the way it provides me backups and a convenient way to look at what changes were made over time.

Here’s an example some of those things from looking at a README markdown file from a project I worked on. Here’s the log of commits changes for a file:

git log README.md

Which give the following:

commit d6a32350c3261c00f761eb3e125d025186e04104
Author: Tim <tim@example.com>
Date:   Thu Oct 4 14:22:35 2012 +1000

    Made a simple addition to the readme about git usage

commit b38dcd7b7f2e03248d85d4fe9f6fbb51d44c816a
Author: Janis <janis@example.com>
Date:   Wed Oct 3 16:24:27 2012 -0400

    Created a TODO file to track feature requests and added some more information to the README

commit 1bff657c061d89545fbce9e2cacf4aa99f9f1f73
Author: Janis <janis@example.com>
Date:   Tue Oct 2 21:23:23 2012 -0400

    added a little text into the readme file

commit 40e58627a20f7cd14c06e3db5865e0d995fbedf4
Author: Bob <bob@example.com>
Date:   Tue Oct 2 18:02:07 2012 -0700

    Initial commit

You can also do things like get a quick breakdown of who wrote which lines:

# count the number of lines attributed to each author
git blame --line-porcelain file_name |
sed -n 's/^author //p' |
sort | uniq -c | sort -rn

For example running that on README.md:

16 Janis
 3 Bob
 2 Tim

This just scratches the surface of the power that writing your documents in a markup language along with version control affords you.

Misc

Not every file that is generated in the process of making a document should be kept in source control. For example with LaTeX in general I’d make the following .gitignore file types:

*.aux
*.backup
*.dvi
*.log
*.pdf
*.ps
*.tex~

Using LaTeX for fast document generation

Many systems have some sort of report generation component. This is often some variation on extracting data from a database (or other sources) then doing some analysis on that data and outputting it in some readable form. Sometimes a requirement is for reports to be available in PDF format. I use a lot of Python for small tasks and many in-house report generation tasks fall into the category where developer time is much more expensive than processor time. Being able to make these reports quickly AND have the eventual typesetting look good is a big win, even if it’s not the most performant code in the world. This is especially the case if the report is a one-off report.

If the reason for creating PDFs is because it will be printed I find using LaTeX to be especially useful because it handles many of the annoying details of typesetting printed materials. There are a ton of little typesetting things that LaTeX does, for example it deals with excessive rivers in the text, I didn’t even realize it did this automatically because I didn’t notice any of these in the documents it generated. So given that LaTeX does a good job of automated typesetting it seemed like a natural candidate to make PDF files. The only tricky thing is automating the generation and compilation of the LaTeX documents from within code, which is the thing the rest of this tutorial covers.

Example problem

For the sake of this tutorial we look at a fairly common situation: We have some graphs with some along with some descriptive text describing when the data was generated. For the sake of the example the data we wish to plot is generated by the following:

def create_data():
    """Example data for JaggedVerge LaTeX tutorial"""
    x_vals = list(range(0,10))
    y_vals = [x**2 for x in x_vals]
    return x_vals, y_vals

Using PyLaTeX

There happens to be a library specifically designed to generate LaTeX from Python called PyLaTeX. For going direct to PDF this library solves a lot of problems. Specifically you don’t need to have an intermediate LaTeX file, you can go direct from Python code to PDF, the benefit of which is that you have fewer steps required in building your PDF.

First we have to set up our python virtual environment. Given that we are using python 3.3+ for this tutorial we can use the virtual environments that the language supports. (If you are using a different version then you have to use the virtualenv wrapper.)

pip install pylatex

PyLaTeX makes extensive use of context managers to handle the various LaTeX commands. Let’s start with a really simple example of creating a PDF:

from pylatex import Document, Section

doc = Document()
with doc.create(Section("Our section title")):
    doc.append("Simple example")

doc.generate_pdf('pylatex_example_output')

That’s ALL the code we need to generate a PDF. Without any more boilerplate to deal with lets get to satisfying the rest of the requirements. Because PyLaTeX has support for TikZ we can create some simple graphs without needing any extra dependencies:

with doc.create(TikZ()):
    plot_options= 'height=10cm, width=10cm, grid=major'
    with doc.create(Axis(options=plot_options)) as plot:
        x_coords, y_coords = create_data()
        coordinates = zip(x_coords, y_coords)

        plot.append(Plot(name="Our data", coordinates=coordinates))

That gets us our plot. Now all we need to do is handle the time stamping and a few miscellaneous document issues. First lets add a title to the document:

doc.preamble.append(Command('title', 'PyLaTeX example'))
doc.preamble.append(Command('author', 'JaggedVerge'))
doc.append(NoEscape(r'\maketitle'))

Note that PyLaTeX is a wrapper around LaTeX code so just like in LaTeX if you miss the  \maketitle command the title will not be generated. We can create the time stamp with regular python code:

formatted_timestamp = time.strftime("%a, %d %b %Y %H:%M:%S +0000", data_creation_time)
doc.append("The data in this plot example was created on {}".format(formatted_timestamp))

Once again we can fairly easily get from python to LaTeX whenever we are dealing with text.

At this point we have the following:

from doc_gen import create_data
from pylatex import (
    Axis,
    Command,
    Document,
    Plot,
    Section,
    TikZ,
    NoEscape,
)

import time
data_creation_time = time.gmtime()

doc = Document()
doc.preamble.append(Command('title', 'PyLaTeX example'))
doc.preamble.append(Command('author', 'JaggedVerge'))
doc.append(NoEscape(r'\maketitle'))


with doc.create(Section("Data report")):
    formatted_timestamp = time.strftime("%a, %d %b %Y %H:%M:%S +0000", data_creation_time)
    doc.append("The data in this plot example was created on {}".format(formatted_timestamp))
    with doc.create(TikZ()):
        plot_options= 'height=10cm, width=10cm, grid=major'
        with doc.create(Axis(options=plot_options)) as plot:
            x_coords, y_coords = create_data()
            coordinates = zip(x_coords, y_coords)

            plot.append(Plot(name="Our data", coordinates=coordinates))

doc.generate_pdf('pylatex_example_output')

Which generates the following document:

Rendered PyLaTeX output

Example of PyLaTeX generated PDF document

Just as with LaTeX it’s probably a good idea to put this plot into a figure environment. If you know LaTeX already it’s fairly straightforward to generate documents using PyLaTeX. In the future I’ll write about manipulating existing LaTeX documents if there is interest. Please leave a comment if you have any question or are interested in seeing more content like this in the future.