Skip to content

Latest commit

 

History

History
302 lines (221 loc) · 10.8 KB

auto-research-outputs.md

File metadata and controls

302 lines (221 loc) · 10.8 KB
jupytext kernelspec
cell_metadata_filter formats text_representation
-all
md:myst
extension format_name format_version jupytext_version
.md
myst
0.8
1.5.0
display_name language name
Python 3.10.12 64-bit ('codeforecon': conda)
python
python3

(auto-research-outputs)=

Automating Research Outputs

In this chapter, you'll learn how to automate the inclusion of figures and tables in LaTeX-derived research outputs including PDFs and slides——plus how to convert those outputs to Microsoft Word documents and more. Much of what you'll see in this chapter applies to a wide range of coding languages.

This chapter has some similarities with another chapter, on {ref}quarto. But this chapter puts the LaTeX typesetting language front and centre, because it's the de facto standard for preparing research outputs (most journals have a LaTeX template for submission, for example), and it gives you full control over every aspect of how your outputs look. However, if you don't already know LaTeX, there is a steep-ish learning curve and—if you're just looking to create some automated reports using code and text rather than write pre-prints, working papers, journal articles, or academic-talk style slide decks—the chapter on {ref}quarto is going to be a better and easier fit for you.

Automating the inclusion of figures and tables in your research outputs has many benefits:

  • once configured, it's clearly easier than manual updates
  • your paper can update at the touch of a button
  • it helps with creating a reproducible analytical pipeline (for more on these, see the {ref}wrkflow-rap chapter).
  • it enforces structure on your project
  • automation is complementary to other good practices such as version control

Let's now turn to the how.

Including research outputs in LaTeX documents and slides

Let's say you're writing a paper, using $\LaTeX$, or a presentation, using $\LaTeX$ and beamer. (Perhaps you'd like the final document or presentation to be in Word, Powerpoint-that's okay too, and we'll come to it shortly, but let's assume you're writing it in $\LaTeX$.)

Including code outputs is pretty simple, but is slightly different for figures and tables (the two main outputs you might include).

Figures

For figures, the $\LaTeX$ graphicx package is your friend as it allows you to set a directory where your figures live, for example outputs/figures, which would be set like this at the top of the document:

\usepackage{graphicx}
\graphicspath{{outputs/figures/}}

We're imagining here that we have a project structure like this:

code.py
paper.tex
outputs/
    figures/
        chart.pdf
    tables/
        reg_table.tex

Then, whenever you need to include a figure, say chart.pdf, you can always do it using

\begin{figure}
	\includegraphics[width=\textwidth]{chart.pdf}
	\caption{Example figure. \label{fig:example}}
\end{figure}

Let's pretend chart.pdf is generated by the most popular Python graphics library, matplotlib. The code in 'code.py' which puts the chart in the 'figures' folder could look something like this:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(range(5), range(5), s=50, c='b')
plt.savefig("outputs/figures/chart.pdf")

The important line here is plt.savefig("outputs/figures/chart.pdf") because it says to save the figure in the 'figures' directory. When you re-run your code, the chart ends up in the right place. When you re-compile your $\LaTeX$ document or presentation, it can pick the chart up from the right place.

Tables

Now let's imagine you've created a table of descriptive statistics such as the one below:

import seaborn as sns
import pandas as pd

tips = sns.load_dataset("tips")

table = tips.groupby(["smoker", "time"], observed=True)["tip"].mean().unstack().round(2)
table

This can be turned into a $\LaTeX$ table using the following command

table.style.to_latex(caption='A Table', label='tab:descriptive')

Or perhaps you have a regression table, for example

import pandas as pd
from sklearn import datasets
import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer

diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data)
df.columns = ['Age', 'Sex', 'BMI', 'ABP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6']
df['target'] = diabetes.target

est = smf.ols('target ~ Age + Sex + BMI + ABP', data=df).fit()
est2 = smf.ols('target ~ Age + Sex + BMI + ABP + S1 + S2', data=df).fit()

reg_results = Stargazer([est, est2])
reg_results
import numpy as np
import pandas as pd
#import pylatex as pl  # for the latex table; note: not a dependency of pyfixest - needs manual installation
from great_tables import loc, style
from IPython.display import FileLink, display

import pyfixest as pf

data = pf.get_data()

fit1 = pf.feols("Y ~ X1 + X2 | f1", data=data)
fit2 = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data)
fit3 = pf.feols("Y2 ~ X1 + X2 | f1", data=data)
fit4 = pf.feols("Y2 ~ X1 + X2 | f1 + f2", data=data)

pf.etable([fit1, fit2, fit3, fit4,])

which can be cast into $\LaTeX$ using type="tex".

tab = pf.etable(
    [fit1, fit2, fit3, fit4],
    digits=2,
    type="tex",
    print_tex=True,
)

tab

We'd like to export tables like this into files that can be picked up by our $\LaTeX$ document. We must first save it to the right place from Python. Assuming you have the folders "outputs/tables" relative to your working directory, this would be

from pathlib import Path
with open(Path('outputs/tables/reg_table.tex'), 'w') as f:
    f.write(table.style.to_latex(caption='A Table', label='tab:descriptive'))

in the first example, and

from pathlib import Path
with open(Path('outputs/tables/reg_table.tex'), 'w') as f:
    f.write(tab)

in the second. Remember that Path is a clever module that will find the relevant file path regardless of which operating system you happen to be using at the time. This is especially useful when you have co-authors on different systems!

The code chunk above opens up a file in write mode in the right directory relative to code.py, and puts the $\LaTeX$ file into it. We now need to ensure that this $\LaTeX$ gets picked up in our paper. Inside the paper, you need a line:

\input{outputs/tables/reg_table.tex}

which picks up your table. If you don't want to have to add the full path to the tables directory each time, you can add this near the top of 'paper.tex':

\makeatletter
\providecommand*{\input@path}{}
\g@addto@macro\input@path{{outputs/tables/}}
\makeatother

So that you need only write \input{reg_table.tex} in your $\LaTeX$ document.

Exporting papers and slides to other document types

When including your research outputs automatically, you may not want your final output to be a PDF (the standard output for $\LaTeX$), but to be one of a range of other document types. That's perfectly possible, and you can choose from a really wide range of output types, although input types will be limited to formats that can use file paths such as $\LaTeX$ and markdown.

To perform the magic conversion to other document types (and often between types), we'll use the command line tool pandoc, which is absolutely brilliant. It can translate $\LaTeX$ papers and beamer presentations into a whole variety of other formats, including Microsoft Word's .docx, OpenOffice's .ODT, Microsoft Powerpoint's .pptx, HTML, plain text, markdown, and more. It can also write from any of those formats (and more) in one direction to PDF, Microsoft Powerpoint, and $\LaTeX$ Beamer.

To use pandoc, first install it following the instructions on the website.

Converting Documents

To convert documents, the general syntax for pandoc looks like this:

pandoc mydoc.tex -o mydoc.docx

This is an example where the input is a .tex document and the output, -o, is a Microsoft Word docx file.

You can try this yourself using the following minimal tex file:

\documentclass{article}
\usepackage[margin=0.7in]{geometry}
\usepackage[pasrfill]{parskip}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amssymb,amsfonts,amsthm}

\begin{document}

This is some text

And an equation:
\[
    u'(c_{t})=\beta(1+r_{t+1})u'(c_{t+1})
\]

\section{Section Heading}

More text

\end{document}
Create a .tex file from the tex code above and convert it to a word document using **pandoc**.

What's surprising is how effective the conversion to word is: even if you have figures, equations, and other non-text features.

You can get quite fancy with pandoc, for example you can translate a whole book's worth of latex into a Word doc complete with a Word style, a bibliography via biblatex, equations, and figures. Nothing can save Word from being painful to use, but pandoc certainly helps. If you want to see a couple of examples, you could check out cookie-cutter-latex-book-manuscript.

Converting Slides

Beamer slides can be converted in much the same way that documents can. Popular output formats for slides include PDF, HTML (via dzslides, slidy, or revealjs), and .pptx (powerpoint).

For example, to create revealjs slides,

pandoc -f latex -t revealjs -s --self-contained -o presentation.html presentation.tex --mathjax

where presentation.tex is the input file. (Self-contained just creates a single, large output HTML file; mathjax enables equations in the HTML.) For powerpoint, the equivalent is

pandoc -f latex -t -o presentation.pptx presentation.tex

As with the example above and the reference file, you can use a reference powerpoint file for style. Here is a minimal example of the tex code for a beamer presentation:

\documentclass[aspectratio=169]{beamer}
\usepackage[english]{babel}
\usepackage[utf8x]{inputenc}
\mode<presentation>
{
  \usetheme{default}
  \usecolortheme{default}
  \usefonttheme{default}
  \setbeamertemplate{caption}[numbered]
}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}

\title{Title for a minimal beamer presentation}
\author{Author One}
\institute{Name of institution}
\date{\today}

\begin{document}
\begin{frame}
  \titlepage
\end{frame}

\section{Section One}

\begin{frame}{Slide with bullet points}
    This is a bullet list of two points:
    \begin{itemize}
        \item Point one
        \item Point two
    \end{itemize}
\end{frame}
\section{Section Two}
\begin{frame}
Slide with an equation
\[
    u'(c_{t})=\beta(1+r_{t+1})u'(c_{t+1})
\]
\end{frame}

\end{document}
Create a .tex beamer file from the tex code above and convert it to a powerpoint presentation using **pandoc**.