diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..11f6ef6 --- /dev/null +++ b/.gitignore @@ -0,0 +1,9 @@ +## LaTeX sources ## +################### +*.aux +*.bbl +*.blg +*.log +*.out +*.synctex.gz +*.pdf diff --git a/README.md b/README.md index 7c51b24..8414893 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,5 @@ This repository accompanies the paper _[The Human Evaluation Datasheet v1.0: A T Initially the sheet was developed for the [ReproGen](https://reprogen.github.io/) shared task. -## Repository Structure - -### Sheet -the datasheet in LaTeX +## Sheet Templates +the datasheet [templates](./sheet/) in LaTeX and Markdown diff --git a/scripts/convert-md.sh b/scripts/convert-md.sh new file mode 100755 index 0000000..e4b220b --- /dev/null +++ b/scripts/convert-md.sh @@ -0,0 +1,13 @@ +#!bin/sh + +pandoc --version +pandoc -f latex \ + -t gfm-raw_html-fenced_divs \ + -s ../sheet/latex/human-evaluation-datasheet-no-boxes.tex \ + -o ../sheet/markdown/human-evaluation-datasheet.md \ + --bibliography ../sheet/latex/human-evaluation-datasheet.bib \ + -C \ + --csl ../sheet/latex/apa-annotated-bibliography.csl \ + +# add reference section +# delete divs in references diff --git a/sheet/latex/acl2020.sty b/sheet/latex/acl2020.sty new file mode 100644 index 0000000..d25add4 --- /dev/null +++ b/sheet/latex/acl2020.sty @@ -0,0 +1,562 @@ +% This is the LaTex style file for ACL 2020, based off of ACL 2019. + +% Addressing bibtex issues mentioned in https://github.com/acl-org/acl-pub/issues/2 +% Other major modifications include +% changing the color of the line numbers to a light gray; changing font size of abstract to be 10pt; changing caption font size to be 10pt. +% -- M Mitchell and Stephanie Lukin + +% 2017: modified to support DOI links in bibliography. Now uses +% natbib package rather than defining citation commands in this file. +% Use with acl_natbib.bst bib style. -- Dan Gildea + +% This is the LaTeX style for ACL 2016. It contains Margaret Mitchell's +% line number adaptations (ported by Hai Zhao and Yannick Versley). + +% It is nearly identical to the style files for ACL 2015, +% ACL 2014, EACL 2006, ACL2005, ACL 2002, ACL 2001, ACL 2000, +% EACL 95 and EACL 99. +% +% Changes made include: adapt layout to A4 and centimeters, widen abstract + +% This is the LaTeX style file for ACL 2000. It is nearly identical to the +% style files for EACL 95 and EACL 99. Minor changes include editing the +% instructions to reflect use of \documentclass rather than \documentstyle +% and removing the white space before the title on the first page +% -- John Chen, June 29, 2000 + +% This is the LaTeX style file for EACL-95. It is identical to the +% style file for ANLP '94 except that the margins are adjusted for A4 +% paper. -- abney 13 Dec 94 + +% The ANLP '94 style file is a slightly modified +% version of the style used for AAAI and IJCAI, using some changes +% prepared by Fernando Pereira and others and some minor changes +% by Paul Jacobs. + +% Papers prepared using the aclsub.sty file and acl.bst bibtex style +% should be easily converted to final format using this style. +% (1) Submission information (\wordcount, \subject, and \makeidpage) +% should be removed. +% (2) \summary should be removed. The summary material should come +% after \maketitle and should be in the ``abstract'' environment +% (between \begin{abstract} and \end{abstract}). +% (3) Check all citations. This style should handle citations correctly +% and also allows multiple citations separated by semicolons. +% (4) Check figures and examples. Because the final format is double- +% column, some adjustments may have to be made to fit text in the column +% or to choose full-width (\figure*} figures. + +% Place this in a file called aclap.sty in the TeX search path. +% (Placing it in the same directory as the paper should also work.) + +% Prepared by Peter F. Patel-Schneider, liberally using the ideas of +% other style hackers, including Barbara Beeton. +% This style is NOT guaranteed to work. It is provided in the hope +% that it will make the preparation of papers easier. +% +% There are undoubtably bugs in this style. If you make bug fixes, +% improvements, etc. please let me know. My e-mail address is: +% pfps@research.att.com + +% Papers are to be prepared using the ``acl_natbib'' bibliography style, +% as follows: +% \documentclass[11pt]{article} +% \usepackage{acl2000} +% \title{Title} +% \author{Author 1 \and Author 2 \\ Address line \\ Address line \And +% Author 3 \\ Address line \\ Address line} +% \begin{document} +% ... +% \bibliography{bibliography-file} +% \bibliographystyle{acl_natbib} +% \end{document} + +% Author information can be set in various styles: +% For several authors from the same institution: +% \author{Author 1 \and ... \and Author n \\ +% Address line \\ ... \\ Address line} +% if the names do not fit well on one line use +% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\ +% For authors from different institutions: +% \author{Author 1 \\ Address line \\ ... \\ Address line +% \And ... \And +% Author n \\ Address line \\ ... \\ Address line} +% To start a seperate ``row'' of authors use \AND, as in +% \author{Author 1 \\ Address line \\ ... \\ Address line +% \AND +% Author 2 \\ Address line \\ ... \\ Address line \And +% Author 3 \\ Address line \\ ... \\ Address line} + +% If the title and author information does not fit in the area allocated, +% place \setlength\titlebox{} right after +% \usepackage{acl2015} +% where can be something larger than 5cm + +% include hyperref, unless user specifies nohyperref option like this: +% \usepackage[nohyperref]{naaclhlt2018} +\newif\ifacl@hyperref +\DeclareOption{hyperref}{\acl@hyperreftrue} +\DeclareOption{nohyperref}{\acl@hyperreffalse} +\ExecuteOptions{hyperref} % default is to use hyperref +\ProcessOptions\relax +\ifacl@hyperref + \RequirePackage{hyperref} + \usepackage{xcolor} % make links dark blue + \definecolor{darkblue}{rgb}{0, 0, 0.5} + \hypersetup{colorlinks=true,citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue} +\else + % This definition is used if the hyperref package is not loaded. + % It provides a backup, no-op definiton of \href. + % This is necessary because \href command is used in the acl_natbib.bst file. + \def\href#1#2{{#2}} + % We still need to load xcolor in this case because the lighter line numbers require it. (SC/KG/WL) + \usepackage{xcolor} +\fi + +\typeout{Conference Style for ACL 2019} + +% NOTE: Some laser printers have a serious problem printing TeX output. +% These printing devices, commonly known as ``write-white'' laser +% printers, tend to make characters too light. To get around this +% problem, a darker set of fonts must be created for these devices. +% + +\newcommand{\Thanks}[1]{\thanks{\ #1}} + +% A4 modified by Eneko; again modified by Alexander for 5cm titlebox +\setlength{\paperwidth}{21cm} % A4 +\setlength{\paperheight}{29.7cm}% A4 +\setlength\topmargin{-0.5cm} +\setlength\oddsidemargin{0cm} +\setlength\textheight{24.7cm} +\setlength\textwidth{16.0cm} +\setlength\columnsep{0.6cm} +\newlength\titlebox +\setlength\titlebox{5cm} +\setlength\headheight{5pt} +\setlength\headsep{0pt} +\thispagestyle{empty} +\pagestyle{empty} + + +\flushbottom \twocolumn \sloppy + +% We're never going to need a table of contents, so just flush it to +% save space --- suggested by drstrip@sandia-2 +\def\addcontentsline#1#2#3{} + +\newif\ifaclfinal +\aclfinalfalse +\def\aclfinalcopy{\global\aclfinaltrue} + +%% ----- Set up hooks to repeat content on every page of the output doc, +%% necessary for the line numbers in the submitted version. --MM +%% +%% Copied from CVPR 2015's cvpr_eso.sty, which appears to be largely copied from everyshi.sty. +%% +%% Original cvpr_eso.sty available at: http://www.pamitc.org/cvpr15/author_guidelines.php +%% Original evershi.sty available at: https://www.ctan.org/pkg/everyshi +%% +%% Copyright (C) 2001 Martin Schr\"oder: +%% +%% Martin Schr"oder +%% Cr"usemannallee 3 +%% D-28213 Bremen +%% Martin.Schroeder@ACM.org +%% +%% This program may be redistributed and/or modified under the terms +%% of the LaTeX Project Public License, either version 1.0 of this +%% license, or (at your option) any later version. +%% The latest version of this license is in +%% CTAN:macros/latex/base/lppl.txt. +%% +%% Happy users are requested to send [Martin] a postcard. :-) +%% +\newcommand{\@EveryShipoutACL@Hook}{} +\newcommand{\@EveryShipoutACL@AtNextHook}{} +\newcommand*{\EveryShipoutACL}[1] + {\g@addto@macro\@EveryShipoutACL@Hook{#1}} +\newcommand*{\AtNextShipoutACL@}[1] + {\g@addto@macro\@EveryShipoutACL@AtNextHook{#1}} +\newcommand{\@EveryShipoutACL@Shipout}{% + \afterassignment\@EveryShipoutACL@Test + \global\setbox\@cclv= % + } +\newcommand{\@EveryShipoutACL@Test}{% + \ifvoid\@cclv\relax + \aftergroup\@EveryShipoutACL@Output + \else + \@EveryShipoutACL@Output + \fi% + } +\newcommand{\@EveryShipoutACL@Output}{% + \@EveryShipoutACL@Hook% + \@EveryShipoutACL@AtNextHook% + \gdef\@EveryShipoutACL@AtNextHook{}% + \@EveryShipoutACL@Org@Shipout\box\@cclv% + } +\newcommand{\@EveryShipoutACL@Org@Shipout}{} +\newcommand*{\@EveryShipoutACL@Init}{% + \message{ABD: EveryShipout initializing macros}% + \let\@EveryShipoutACL@Org@Shipout\shipout + \let\shipout\@EveryShipoutACL@Shipout + } +\AtBeginDocument{\@EveryShipoutACL@Init} + +%% ----- Set up for placing additional items into the submitted version --MM +%% +%% Based on eso-pic.sty +%% +%% Original available at: https://www.ctan.org/tex-archive/macros/latex/contrib/eso-pic +%% Copyright (C) 1998-2002 by Rolf Niepraschk +%% +%% Which may be distributed and/or modified under the conditions of +%% the LaTeX Project Public License, either version 1.2 of this license +%% or (at your option) any later version. The latest version of this +%% license is in: +%% +%% http://www.latex-project.org/lppl.txt +%% +%% and version 1.2 or later is part of all distributions of LaTeX version +%% 1999/12/01 or later. +%% +%% In contrast to the original, we do not include the definitions for/using: +%% gridpicture, div[2], isMEMOIR[1], gridSetup[6][], subgridstyle{dotted}, labelfactor{}, gap{}, gridunitname{}, gridunit{}, gridlines{\thinlines}, subgridlines{\thinlines}, the {keyval} package, evenside margin, nor any definitions with 'color'. +%% +%% These are beyond what is needed for the NAACL/ACL style. +%% +\newcommand\LenToUnit[1]{#1\@gobble} +\newcommand\AtPageUpperLeft[1]{% + \begingroup + \@tempdima=0pt\relax\@tempdimb=\ESO@yoffsetI\relax + \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}% + \endgroup +} +\newcommand\AtPageLowerLeft[1]{\AtPageUpperLeft{% + \put(0,\LenToUnit{-\paperheight}){#1}}} +\newcommand\AtPageCenter[1]{\AtPageUpperLeft{% + \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.5\paperheight}){#1}}} +\newcommand\AtPageLowerCenter[1]{\AtPageUpperLeft{% + \put(\LenToUnit{.5\paperwidth},\LenToUnit{-\paperheight}){#1}}}% +\newcommand\AtPageLowishCenter[1]{\AtPageUpperLeft{% + \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.96\paperheight}){#1}}} +\newcommand\AtTextUpperLeft[1]{% + \begingroup + \setlength\@tempdima{1in}% + \advance\@tempdima\oddsidemargin% + \@tempdimb=\ESO@yoffsetI\relax\advance\@tempdimb-1in\relax% + \advance\@tempdimb-\topmargin% + \advance\@tempdimb-\headheight\advance\@tempdimb-\headsep% + \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}% + \endgroup +} +\newcommand\AtTextLowerLeft[1]{\AtTextUpperLeft{% + \put(0,\LenToUnit{-\textheight}){#1}}} +\newcommand\AtTextCenter[1]{\AtTextUpperLeft{% + \put(\LenToUnit{.5\textwidth},\LenToUnit{-.5\textheight}){#1}}} +\newcommand{\ESO@HookI}{} \newcommand{\ESO@HookII}{} +\newcommand{\ESO@HookIII}{} +\newcommand{\AddToShipoutPicture}{% + \@ifstar{\g@addto@macro\ESO@HookII}{\g@addto@macro\ESO@HookI}} +\newcommand{\ClearShipoutPicture}{\global\let\ESO@HookI\@empty} +\newcommand{\@ShipoutPicture}{% + \bgroup + \@tempswafalse% + \ifx\ESO@HookI\@empty\else\@tempswatrue\fi% + \ifx\ESO@HookII\@empty\else\@tempswatrue\fi% + \ifx\ESO@HookIII\@empty\else\@tempswatrue\fi% + \if@tempswa% + \@tempdima=1in\@tempdimb=-\@tempdima% + \advance\@tempdimb\ESO@yoffsetI% + \unitlength=1pt% + \global\setbox\@cclv\vbox{% + \vbox{\let\protect\relax + \pictur@(0,0)(\strip@pt\@tempdima,\strip@pt\@tempdimb)% + \ESO@HookIII\ESO@HookI\ESO@HookII% + \global\let\ESO@HookII\@empty% + \endpicture}% + \nointerlineskip% + \box\@cclv}% + \fi + \egroup +} +\EveryShipoutACL{\@ShipoutPicture} +\newif\ifESO@dvips\ESO@dvipsfalse +\newif\ifESO@grid\ESO@gridfalse +\newif\ifESO@texcoord\ESO@texcoordfalse +\newcommand*\ESO@griddelta{}\newcommand*\ESO@griddeltaY{} +\newcommand*\ESO@gridDelta{}\newcommand*\ESO@gridDeltaY{} +\newcommand*\ESO@yoffsetI{}\newcommand*\ESO@yoffsetII{} +\ifESO@texcoord + \def\ESO@yoffsetI{0pt}\def\ESO@yoffsetII{-\paperheight} + \edef\ESO@griddeltaY{-\ESO@griddelta}\edef\ESO@gridDeltaY{-\ESO@gridDelta} +\else + \def\ESO@yoffsetI{\paperheight}\def\ESO@yoffsetII{0pt} + \edef\ESO@griddeltaY{\ESO@griddelta}\edef\ESO@gridDeltaY{\ESO@gridDelta} +\fi + + +%% ----- Submitted version markup: Page numbers, ruler, and confidentiality. Using ideas/code from cvpr.sty 2015. --MM + +\font\aclhv = phvb at 8pt + +%% Define vruler %% + +%\makeatletter +\newbox\aclrulerbox +\newcount\aclrulercount +\newdimen\aclruleroffset +\newdimen\cv@lineheight +\newdimen\cv@boxheight +\newbox\cv@tmpbox +\newcount\cv@refno +\newcount\cv@tot +% NUMBER with left flushed zeros \fillzeros[] +\newcount\cv@tmpc@ \newcount\cv@tmpc +\def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi +\cv@tmpc=1 % +\loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi + \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat +\ifnum#2<0\advance\cv@tmpc1\relax-\fi +\loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat +\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}% +% \makevruler[][][][][] +\def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip +\textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt% +\global\setbox\aclrulerbox=\vbox to \textheight{% +{\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight +\color{gray} +\cv@lineheight=#1\global\aclrulercount=#2% +\cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2% +\cv@refno1\vskip-\cv@lineheight\vskip1ex% +\loop\setbox\cv@tmpbox=\hbox to0cm{{\aclhv\hfil\fillzeros[#4]\aclrulercount}}% +\ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break +\advance\cv@refno1\global\advance\aclrulercount#3\relax +\ifnum\cv@refno<\cv@tot\repeat}}\endgroup}% +%\makeatother + + +\def\aclpaperid{***} +\def\confidential{\textcolor{black}{ACL 2020 Submission~\aclpaperid. Confidential Review Copy. DO NOT DISTRIBUTE.}} + +%% Page numbering, Vruler and Confidentiality %% +% \makevruler[][][][][] + +% SC/KG/WL - changed line numbering to gainsboro +\definecolor{gainsboro}{rgb}{0.8, 0.8, 0.8} +%\def\aclruler#1{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}} %% old line +\def\aclruler#1{\textcolor{gainsboro}{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}}} + +\def\leftoffset{-2.1cm} %original: -45pt +\def\rightoffset{17.5cm} %original: 500pt +%\ifaclfinal\else CHANGED AB +\pagenumbering{arabic} +\AddToShipoutPicture{% +%\ifaclfinal\else CHANGED to l. 368 AB +\AtPageLowishCenter{\textcolor{black}{\thepage}} +%\aclruleroffset=\textheight +%\advance\aclruleroffset4pt +% \AtTextUpperLeft{% +% \put(\LenToUnit{\leftoffset},\LenToUnit{-\aclruleroffset}){%left ruler +% \aclruler{\aclrulercount}} +% \put(\LenToUnit{\rightoffset},\LenToUnit{-\aclruleroffset}){%right ruler +% \aclruler{\aclrulercount}} +% } +% \AtTextUpperLeft{%confidential +% \put(0,\LenToUnit{1cm}){\parbox{\textwidth}{\centering\aclhv\confidential}} +% } +%\fi +} + +%%%% ----- End settings for placing additional items into the submitted version --MM ----- %%%% + +%%%% ----- Begin settings for both submitted and camera-ready version ----- %%%% + +%% Title and Authors %% + +\newcommand\outauthor{ + \begin{tabular}[t]{c} + \ifaclfinal + \bf\@author + \else + % Avoiding common accidental de-anonymization issue. --MM +% \bf Anonymous ACL submission + \bf\@author + \fi + \end{tabular}} + +% Changing the expanded titlebox for submissions to 2.5 in (rather than 6.5cm) +% and moving it to the style sheet, rather than within the example tex file. --MM +\ifaclfinal +\else + \addtolength\titlebox{.25in} +\fi +% Mostly taken from deproc. +\def\maketitle{\par + \begingroup + \def\thefootnote{\fnsymbol{footnote}} + \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} + \twocolumn[\@maketitle] \@thanks + \endgroup + \setcounter{footnote}{0} + \let\maketitle\relax \let\@maketitle\relax + \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax} +\def\@maketitle{\vbox to \titlebox{\hsize\textwidth + \linewidth\hsize \vskip 0.125in minus 0.125in \centering + {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in + {\def\and{\unskip\enspace{\rm and}\enspace}% + \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil + \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}% + \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup + \vskip 0.25in plus 1fil minus 0.125in + \hbox to \linewidth\bgroup\large \hfil\hfil + \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf} + \hbox to \linewidth\bgroup\large \hfil\hfil + \hbox to 0pt\bgroup\hss + \outauthor + \hss\egroup + \hfil\hfil\egroup} + \vskip 0.3in plus 2fil minus 0.1in +}} + +% margins and font size for abstract +\renewenvironment{abstract}% + {\centerline{\large\bf Abstract}% + \begin{list}{}% + {\setlength{\rightmargin}{0.6cm}% + \setlength{\leftmargin}{0.6cm}}% + \item[]\ignorespaces% + \@setsize\normalsize{12pt}\xpt\@xpt + }% + {\unskip\end{list}} + +%\renewenvironment{abstract}{\centerline{\large\bf +% Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex} + +% Resizing figure and table captions - SL +\newcommand{\figcapfont}{\rm} +\newcommand{\tabcapfont}{\rm} +\renewcommand{\fnum@figure}{\figcapfont Figure \thefigure} +\renewcommand{\fnum@table}{\tabcapfont Table \thetable} +\renewcommand{\figcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt} +\renewcommand{\tabcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt} +% Support for interacting with the caption, subfigure, and subcaption packages - SL +\usepackage{caption} +\DeclareCaptionFont{10pt}{\fontsize{10pt}{12pt}\selectfont} +\captionsetup{font=10pt} + +\RequirePackage{natbib} +% for citation commands in the .tex, authors can use: +% \citep, \citet, and \citeyearpar for compatibility with natbib, or +% \cite, \newcite, and \shortcite for compatibility with older ACL .sty files +\renewcommand\cite{\citep} % to get "(Author Year)" with natbib +\newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib +\newcommand\newcite{\citet} % to get "Author (Year)" with natbib + +% DK/IV: Workaround for annoying hyperref pagewrap bug +\RequirePackage{etoolbox} +%\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{\errmessage{\noexpand patch failed}} + +% bibliography + +\def\@up#1{\raise.2ex\hbox{#1}} + +% Don't put a label in the bibliography at all. Just use the unlabeled format +% instead. +\def\thebibliography#1{\vskip\parskip% +\vskip\baselineskip% +\def\baselinestretch{1}% +\ifx\@currsize\normalsize\@normalsize\else\@currsize\fi% +\vskip-\parskip% +\vskip-\baselineskip% +\section*{References\@mkboth + {References}{References}}\list + {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent} + \setlength{\itemindent}{-\parindent}} + \def\newblock{\hskip .11em plus .33em minus -.07em} + \sloppy\clubpenalty4000\widowpenalty4000 + \sfcode`\.=1000\relax} +\let\endthebibliography=\endlist + + +% Allow for a bibliography of sources of attested examples +\def\thesourcebibliography#1{\vskip\parskip% +\vskip\baselineskip% +\def\baselinestretch{1}% +\ifx\@currsize\normalsize\@normalsize\else\@currsize\fi% +\vskip-\parskip% +\vskip-\baselineskip% +\section*{Sources of Attested Examples\@mkboth + {Sources of Attested Examples}{Sources of Attested Examples}}\list + {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent} + \setlength{\itemindent}{-\parindent}} + \def\newblock{\hskip .11em plus .33em minus -.07em} + \sloppy\clubpenalty4000\widowpenalty4000 + \sfcode`\.=1000\relax} +\let\endthesourcebibliography=\endlist + +% sections with less space +\def\section{\@startsection {section}{1}{\z@}{-2.0ex plus + -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}} +\def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus + -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}} +%% changed by KO to - values to get teh initial parindent right +\def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus + -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}} +\def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus + 0.5ex minus .2ex}{-1em}{\normalsize\bf}} +\def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus + 0.5ex minus .2ex}{-1em}{\normalsize\bf}} + +% Footnotes +\footnotesep 6.65pt % +\skip\footins 9pt plus 4pt minus 2pt +\def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt } +\setcounter{footnote}{0} + +% Lists and paragraphs +\parindent 1em +\topsep 4pt plus 1pt minus 2pt +\partopsep 1pt plus 0.5pt minus 0.5pt +\itemsep 2pt plus 1pt minus 0.5pt +\parsep 2pt plus 1pt minus 0.5pt + +\leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em +\leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em +\labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt + +\def\@listi{\leftmargin\leftmargini} +\def\@listii{\leftmargin\leftmarginii + \labelwidth\leftmarginii\advance\labelwidth-\labelsep + \topsep 2pt plus 1pt minus 0.5pt + \parsep 1pt plus 0.5pt minus 0.5pt + \itemsep \parsep} +\def\@listiii{\leftmargin\leftmarginiii + \labelwidth\leftmarginiii\advance\labelwidth-\labelsep + \topsep 1pt plus 0.5pt minus 0.5pt + \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt + \itemsep \topsep} +\def\@listiv{\leftmargin\leftmarginiv + \labelwidth\leftmarginiv\advance\labelwidth-\labelsep} +\def\@listv{\leftmargin\leftmarginv + \labelwidth\leftmarginv\advance\labelwidth-\labelsep} +\def\@listvi{\leftmargin\leftmarginvi + \labelwidth\leftmarginvi\advance\labelwidth-\labelsep} + +\abovedisplayskip 7pt plus2pt minus5pt% +\belowdisplayskip \abovedisplayskip +\abovedisplayshortskip 0pt plus3pt% +\belowdisplayshortskip 4pt plus3pt minus3pt% + +% Less leading in most fonts (due to the narrow columns) +% The choices were between 1-pt and 1.5-pt leading +\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} +\def\small{\@setsize\small{10pt}\ixpt\@ixpt} +\def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt} +\def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt} +\def\tiny{\@setsize\tiny{7pt}\vipt\@vipt} +\def\large{\@setsize\large{14pt}\xiipt\@xiipt} +\def\Large{\@setsize\Large{16pt}\xivpt\@xivpt} +\def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt} +\def\huge{\@setsize\huge{23pt}\xxpt\@xxpt} +\def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt} diff --git a/sheet/latex/acl_natbib.bst b/sheet/latex/acl_natbib.bst new file mode 100644 index 0000000..821195d --- /dev/null +++ b/sheet/latex/acl_natbib.bst @@ -0,0 +1,1975 @@ +%%% acl_natbib.bst +%%% Modification of BibTeX style file acl_natbib_nourl.bst +%%% ... by urlbst, version 0.7 (marked with "% urlbst") +%%% See +%%% Added webpage entry type, and url and lastchecked fields. +%%% Added eprint support. +%%% Added DOI support. +%%% Added PUBMED support. +%%% Added hyperref support. +%%% Original headers follow... + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% +% BibTeX style file acl_natbib_nourl.bst +% +% intended as input to urlbst script +% $ ./urlbst --hyperref --inlinelinks acl_natbib_nourl.bst > acl_natbib.bst +% +% adapted from compling.bst +% in order to mimic the style files for ACL conferences prior to 2017 +% by making the following three changes: +% - for @incollection, page numbers now follow volume title. +% - for @inproceedings, address now follows conference name. +% (address is intended as location of conference, +% not address of publisher.) +% - for papers with three authors, use et al. in citation +% Dan Gildea 2017/06/08 +% - fixed a bug with format.chapter - error given if chapter is empty +% with inbook. +% Shay Cohen 2018/02/16 + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% +% BibTeX style file compling.bst +% +% Intended for the journal Computational Linguistics (ACL/MIT Press) +% Created by Ron Artstein on 2005/08/22 +% For use with for author-year citations. +% +% I created this file in order to allow submissions to the journal +% Computational Linguistics using the package for author-year +% citations, which offers a lot more flexibility than , CL's +% official citation package. This file adheres strictly to the official +% style guide available from the MIT Press: +% +% http://mitpress.mit.edu/journals/coli/compling_style.pdf +% +% This includes all the various quirks of the style guide, for example: +% - a chapter from a monograph (@inbook) has no page numbers. +% - an article from an edited volume (@incollection) has page numbers +% after the publisher and address. +% - an article from a proceedings volume (@inproceedings) has page +% numbers before the publisher and address. +% +% Where the style guide was inconsistent or not specific enough I +% looked at actual published articles and exercised my own judgment. +% I noticed two inconsistencies in the style guide: +% +% - The style guide gives one example of an article from an edited +% volume with the editor's name spelled out in full, and another +% with the editors' names abbreviated. I chose to accept the first +% one as correct, since the style guide generally shuns abbreviations, +% and editors' names are also spelled out in some recently published +% articles. +% +% - The style guide gives one example of a reference where the word +% "and" between two authors is preceded by a comma. This is most +% likely a typo, since in all other cases with just two authors or +% editors there is no comma before the word "and". +% +% One case where the style guide is not being specific is the placement +% of the edition number, for which no example is given. I chose to put +% it immediately after the title, which I (subjectively) find natural, +% and is also the place of the edition in a few recently published +% articles. +% +% This file correctly reproduces all of the examples in the official +% style guide, except for the two inconsistencies noted above. I even +% managed to get it to correctly format the proceedings example which +% has an organization, a publisher, and two addresses (the conference +% location and the publisher's address), though I cheated a bit by +% putting the conference location and month as part of the title field; +% I feel that in this case the conference location and month can be +% considered as part of the title, and that adding a location field +% is not justified. Note also that a location field is not standard, +% so entries made with this field would not port nicely to other styles. +% However, if authors feel that there's a need for a location field +% then tell me and I'll see what I can do. +% +% The file also produces to my satisfaction all the bibliographical +% entries in my recent (joint) submission to CL (this was the original +% motivation for creating the file). I also tested it by running it +% on a larger set of entries and eyeballing the results. There may of +% course still be errors, especially with combinations of fields that +% are not that common, or with cross-references (which I seldom use). +% If you find such errors please write to me. +% +% I hope people find this file useful. Please email me with comments +% and suggestions. +% +% Ron Artstein +% artstein [at] essex.ac.uk +% August 22, 2005. +% +% Some technical notes. +% +% This file is based on a file generated with the package +% by Patrick W. Daly (see selected options below), which was then +% manually customized to conform with certain CL requirements which +% cannot be met by . Departures from the generated file +% include: +% +% Function inbook: moved publisher and address to the end; moved +% edition after title; replaced function format.chapter.pages by +% new function format.chapter to output chapter without pages. +% +% Function inproceedings: moved publisher and address to the end; +% replaced function format.in.ed.booktitle by new function +% format.in.booktitle to output the proceedings title without +% the editor. +% +% Functions book, incollection, manual: moved edition after title. +% +% Function mastersthesis: formatted title as for articles (unlike +% phdthesis which is formatted as book) and added month. +% +% Function proceedings: added new.sentence between organization and +% publisher when both are present. +% +% Function format.lab.names: modified so that it gives all the +% authors' surnames for in-text citations for one, two and three +% authors and only uses "et. al" for works with four authors or more +% (thanks to Ken Shan for convincing me to go through the trouble of +% modifying this function rather than using unreliable hacks). +% +% Changes: +% +% 2006-10-27: Changed function reverse.pass so that the extra label is +% enclosed in parentheses when the year field ends in an uppercase or +% lowercase letter (change modeled after Uli Sauerland's modification +% of nals.bst). RA. +% +% +% The preamble of the generated file begins below: +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%% +%% This is file `compling.bst', +%% generated with the docstrip utility. +%% +%% The original source files were: +%% +%% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss') +%% ---------------------------------------- +%% *** Intended for the journal Computational Linguistics *** +%% +%% Copyright 1994-2002 Patrick W Daly + % =============================================================== + % IMPORTANT NOTICE: + % This bibliographic style (bst) file has been generated from one or + % more master bibliographic style (mbs) files, listed above. + % + % This generated file can be redistributed and/or modified under the terms + % of the LaTeX Project Public License Distributed from CTAN + % archives in directory macros/latex/base/lppl.txt; either + % version 1 of the License, or any later version. + % =============================================================== + % Name and version information of the main mbs file: + % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)] + % For use with BibTeX version 0.99a or later + %------------------------------------------------------------------- + % This bibliography style file is intended for texts in ENGLISH + % This is an author-year citation style bibliography. As such, it is + % non-standard LaTeX, and requires a special package file to function properly. + % Such a package is natbib.sty by Patrick W. Daly + % The form of the \bibitem entries is + % \bibitem[Jones et al.(1990)]{key}... + % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}... + % The essential feature is that the label (the part in brackets) consists + % of the author names, as they should appear in the citation, with the year + % in parentheses following. There must be no space before the opening + % parenthesis! + % With natbib v5.3, a full list of authors may also follow the year. + % In natbib.sty, it is possible to define the type of enclosures that is + % really wanted (brackets or parentheses), but in either case, there must + % be parentheses in the label. + % The \cite command functions as follows: + % \citet{key} ==>> Jones et al. (1990) + % \citet*{key} ==>> Jones, Baker, and Smith (1990) + % \citep{key} ==>> (Jones et al., 1990) + % \citep*{key} ==>> (Jones, Baker, and Smith, 1990) + % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2) + % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990) + % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32) + % \citeauthor{key} ==>> Jones et al. + % \citeauthor*{key} ==>> Jones, Baker, and Smith + % \citeyear{key} ==>> 1990 + %--------------------------------------------------------------------- + +ENTRY + { address + author + booktitle + chapter + edition + editor + howpublished + institution + journal + key + month + note + number + organization + pages + publisher + school + series + title + type + volume + year + eprint % urlbst + doi % urlbst + pubmed % urlbst + url % urlbst + lastchecked % urlbst + } + {} + { label extra.label sort.label short.list } +INTEGERS { output.state before.all mid.sentence after.sentence after.block } +% urlbst... +% urlbst constants and state variables +STRINGS { urlintro + eprinturl eprintprefix doiprefix doiurl pubmedprefix pubmedurl + citedstring onlinestring linktextstring + openinlinelink closeinlinelink } +INTEGERS { hrefform inlinelinks makeinlinelink + addeprints adddoiresolver addpubmedresolver } +FUNCTION {init.urlbst.variables} +{ + % The following constants may be adjusted by hand, if desired + + % The first set allow you to enable or disable certain functionality. + #1 'addeprints := % 0=no eprints; 1=include eprints + #1 'adddoiresolver := % 0=no DOI resolver; 1=include it + #1 'addpubmedresolver := % 0=no PUBMED resolver; 1=include it + #2 'hrefform := % 0=no crossrefs; 1=hypertex xrefs; 2=hyperref refs + #1 'inlinelinks := % 0=URLs explicit; 1=URLs attached to titles + + % String constants, which you _might_ want to tweak. + "URL: " 'urlintro := % prefix before URL; typically "Available from:" or "URL": + "online" 'onlinestring := % indication that resource is online; typically "online" + "cited " 'citedstring := % indicator of citation date; typically "cited " + "[link]" 'linktextstring := % dummy link text; typically "[link]" + "http://arxiv.org/abs/" 'eprinturl := % prefix to make URL from eprint ref + "arXiv:" 'eprintprefix := % text prefix printed before eprint ref; typically "arXiv:" + "https://doi.org/" 'doiurl := % prefix to make URL from DOI + "doi:" 'doiprefix := % text prefix printed before DOI ref; typically "doi:" + "http://www.ncbi.nlm.nih.gov/pubmed/" 'pubmedurl := % prefix to make URL from PUBMED + "PMID:" 'pubmedprefix := % text prefix printed before PUBMED ref; typically "PMID:" + + % The following are internal state variables, not configuration constants, + % so they shouldn't be fiddled with. + #0 'makeinlinelink := % state variable managed by possibly.setup.inlinelink + "" 'openinlinelink := % ditto + "" 'closeinlinelink := % ditto +} +INTEGERS { + bracket.state + outside.brackets + open.brackets + within.brackets + close.brackets +} +% ...urlbst to here +FUNCTION {init.state.consts} +{ #0 'outside.brackets := % urlbst... + #1 'open.brackets := + #2 'within.brackets := + #3 'close.brackets := % ...urlbst to here + + #0 'before.all := + #1 'mid.sentence := + #2 'after.sentence := + #3 'after.block := +} +STRINGS { s t} +% urlbst +FUNCTION {output.nonnull.original} +{ 's := + output.state mid.sentence = + { ", " * write$ } + { output.state after.block = + { add.period$ write$ + newline$ + "\newblock " write$ + } + { output.state before.all = + 'write$ + { add.period$ " " * write$ } + if$ + } + if$ + mid.sentence 'output.state := + } + if$ + s +} + +% urlbst... +% The following three functions are for handling inlinelink. They wrap +% a block of text which is potentially output with write$ by multiple +% other functions, so we don't know the content a priori. +% They communicate between each other using the variables makeinlinelink +% (which is true if a link should be made), and closeinlinelink (which holds +% the string which should close any current link. They can be called +% at any time, but start.inlinelink will be a no-op unless something has +% previously set makeinlinelink true, and the two ...end.inlinelink functions +% will only do their stuff if start.inlinelink has previously set +% closeinlinelink to be non-empty. +% (thanks to 'ijvm' for suggested code here) +FUNCTION {uand} +{ 'skip$ { pop$ #0 } if$ } % 'and' (which isn't defined at this point in the file) +FUNCTION {possibly.setup.inlinelink} +{ makeinlinelink hrefform #0 > uand + { doi empty$ adddoiresolver uand + { pubmed empty$ addpubmedresolver uand + { eprint empty$ addeprints uand + { url empty$ + { "" } + { url } + if$ } + { eprinturl eprint * } + if$ } + { pubmedurl pubmed * } + if$ } + { doiurl doi * } + if$ + % an appropriately-formatted URL is now on the stack + hrefform #1 = % hypertex + { "\special {html: }{" * 'openinlinelink := + "\special {html:}" 'closeinlinelink := } + { "\href {" swap$ * "} {" * 'openinlinelink := % hrefform=#2 -- hyperref + % the space between "} {" matters: a URL of just the right length can cause "\% newline em" + "}" 'closeinlinelink := } + if$ + #0 'makeinlinelink := + } + 'skip$ + if$ % makeinlinelink +} +FUNCTION {add.inlinelink} +{ openinlinelink empty$ + 'skip$ + { openinlinelink swap$ * closeinlinelink * + "" 'openinlinelink := + } + if$ +} +FUNCTION {output.nonnull} +{ % Save the thing we've been asked to output + 's := + % If the bracket-state is close.brackets, then add a close-bracket to + % what is currently at the top of the stack, and set bracket.state + % to outside.brackets + bracket.state close.brackets = + { "]" * + outside.brackets 'bracket.state := + } + 'skip$ + if$ + bracket.state outside.brackets = + { % We're outside all brackets -- this is the normal situation. + % Write out what's currently at the top of the stack, using the + % original output.nonnull function. + s + add.inlinelink + output.nonnull.original % invoke the original output.nonnull + } + { % Still in brackets. Add open-bracket or (continuation) comma, add the + % new text (in s) to the top of the stack, and move to the close-brackets + % state, ready for next time (unless inbrackets resets it). If we come + % into this branch, then output.state is carefully undisturbed. + bracket.state open.brackets = + { " [" * } + { ", " * } % bracket.state will be within.brackets + if$ + s * + close.brackets 'bracket.state := + } + if$ +} + +% Call this function just before adding something which should be presented in +% brackets. bracket.state is handled specially within output.nonnull. +FUNCTION {inbrackets} +{ bracket.state close.brackets = + { within.brackets 'bracket.state := } % reset the state: not open nor closed + { open.brackets 'bracket.state := } + if$ +} + +FUNCTION {format.lastchecked} +{ lastchecked empty$ + { "" } + { inbrackets citedstring lastchecked * } + if$ +} +% ...urlbst to here +FUNCTION {output} +{ duplicate$ empty$ + 'pop$ + 'output.nonnull + if$ +} +FUNCTION {output.check} +{ 't := + duplicate$ empty$ + { pop$ "empty " t * " in " * cite$ * warning$ } + 'output.nonnull + if$ +} +FUNCTION {fin.entry.original} % urlbst (renamed from fin.entry, so it can be wrapped below) +{ add.period$ + write$ + newline$ +} + +FUNCTION {new.block} +{ output.state before.all = + 'skip$ + { after.block 'output.state := } + if$ +} +FUNCTION {new.sentence} +{ output.state after.block = + 'skip$ + { output.state before.all = + 'skip$ + { after.sentence 'output.state := } + if$ + } + if$ +} +FUNCTION {add.blank} +{ " " * before.all 'output.state := +} + +FUNCTION {date.block} +{ + new.block +} + +FUNCTION {not} +{ { #0 } + { #1 } + if$ +} +FUNCTION {and} +{ 'skip$ + { pop$ #0 } + if$ +} +FUNCTION {or} +{ { pop$ #1 } + 'skip$ + if$ +} +FUNCTION {new.block.checkb} +{ empty$ + swap$ empty$ + and + 'skip$ + 'new.block + if$ +} +FUNCTION {field.or.null} +{ duplicate$ empty$ + { pop$ "" } + 'skip$ + if$ +} +FUNCTION {emphasize} +{ duplicate$ empty$ + { pop$ "" } + { "\emph{" swap$ * "}" * } + if$ +} +FUNCTION {tie.or.space.prefix} +{ duplicate$ text.length$ #3 < + { "~" } + { " " } + if$ + swap$ +} + +FUNCTION {capitalize} +{ "u" change.case$ "t" change.case$ } + +FUNCTION {space.word} +{ " " swap$ * " " * } + % Here are the language-specific definitions for explicit words. + % Each function has a name bbl.xxx where xxx is the English word. + % The language selected here is ENGLISH +FUNCTION {bbl.and} +{ "and"} + +FUNCTION {bbl.etal} +{ "et~al." } + +FUNCTION {bbl.editors} +{ "editors" } + +FUNCTION {bbl.editor} +{ "editor" } + +FUNCTION {bbl.edby} +{ "edited by" } + +FUNCTION {bbl.edition} +{ "edition" } + +FUNCTION {bbl.volume} +{ "volume" } + +FUNCTION {bbl.of} +{ "of" } + +FUNCTION {bbl.number} +{ "number" } + +FUNCTION {bbl.nr} +{ "no." } + +FUNCTION {bbl.in} +{ "in" } + +FUNCTION {bbl.pages} +{ "pages" } + +FUNCTION {bbl.page} +{ "page" } + +FUNCTION {bbl.chapter} +{ "chapter" } + +FUNCTION {bbl.techrep} +{ "Technical Report" } + +FUNCTION {bbl.mthesis} +{ "Master's thesis" } + +FUNCTION {bbl.phdthesis} +{ "Ph.D. thesis" } + +MACRO {jan} {"January"} + +MACRO {feb} {"February"} + +MACRO {mar} {"March"} + +MACRO {apr} {"April"} + +MACRO {may} {"May"} + +MACRO {jun} {"June"} + +MACRO {jul} {"July"} + +MACRO {aug} {"August"} + +MACRO {sep} {"September"} + +MACRO {oct} {"October"} + +MACRO {nov} {"November"} + +MACRO {dec} {"December"} + +MACRO {acmcs} {"ACM Computing Surveys"} + +MACRO {acta} {"Acta Informatica"} + +MACRO {cacm} {"Communications of the ACM"} + +MACRO {ibmjrd} {"IBM Journal of Research and Development"} + +MACRO {ibmsj} {"IBM Systems Journal"} + +MACRO {ieeese} {"IEEE Transactions on Software Engineering"} + +MACRO {ieeetc} {"IEEE Transactions on Computers"} + +MACRO {ieeetcad} + {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"} + +MACRO {ipl} {"Information Processing Letters"} + +MACRO {jacm} {"Journal of the ACM"} + +MACRO {jcss} {"Journal of Computer and System Sciences"} + +MACRO {scp} {"Science of Computer Programming"} + +MACRO {sicomp} {"SIAM Journal on Computing"} + +MACRO {tocs} {"ACM Transactions on Computer Systems"} + +MACRO {tods} {"ACM Transactions on Database Systems"} + +MACRO {tog} {"ACM Transactions on Graphics"} + +MACRO {toms} {"ACM Transactions on Mathematical Software"} + +MACRO {toois} {"ACM Transactions on Office Information Systems"} + +MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"} + +MACRO {tcs} {"Theoretical Computer Science"} +FUNCTION {bibinfo.check} +{ swap$ + duplicate$ missing$ + { + pop$ pop$ + "" + } + { duplicate$ empty$ + { + swap$ pop$ + } + { swap$ + pop$ + } + if$ + } + if$ +} +FUNCTION {bibinfo.warn} +{ swap$ + duplicate$ missing$ + { + swap$ "missing " swap$ * " in " * cite$ * warning$ pop$ + "" + } + { duplicate$ empty$ + { + swap$ "empty " swap$ * " in " * cite$ * warning$ + } + { swap$ + pop$ + } + if$ + } + if$ +} +STRINGS { bibinfo} +INTEGERS { nameptr namesleft numnames } + +FUNCTION {format.names} +{ 'bibinfo := + duplicate$ empty$ 'skip$ { + 's := + "" 't := + #1 'nameptr := + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { s nameptr + duplicate$ #1 > + { "{ff~}{vv~}{ll}{, jj}" } + { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author +% { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author + if$ + format.name$ + bibinfo bibinfo.check + 't := + nameptr #1 > + { + namesleft #1 > + { ", " * t * } + { + numnames #2 > + { "," * } + 'skip$ + if$ + s nameptr "{ll}" format.name$ duplicate$ "others" = + { 't := } + { pop$ } + if$ + t "others" = + { + " " * bbl.etal * + } + { + bbl.and + space.word * t * + } + if$ + } + if$ + } + 't + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ + } if$ +} +FUNCTION {format.names.ed} +{ + 'bibinfo := + duplicate$ empty$ 'skip$ { + 's := + "" 't := + #1 'nameptr := + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { s nameptr + "{ff~}{vv~}{ll}{, jj}" + format.name$ + bibinfo bibinfo.check + 't := + nameptr #1 > + { + namesleft #1 > + { ", " * t * } + { + numnames #2 > + { "," * } + 'skip$ + if$ + s nameptr "{ll}" format.name$ duplicate$ "others" = + { 't := } + { pop$ } + if$ + t "others" = + { + + " " * bbl.etal * + } + { + bbl.and + space.word * t * + } + if$ + } + if$ + } + 't + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ + } if$ +} +FUNCTION {format.key} +{ empty$ + { key field.or.null } + { "" } + if$ +} + +FUNCTION {format.authors} +{ author "author" format.names +} +FUNCTION {get.bbl.editor} +{ editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ } + +FUNCTION {format.editors} +{ editor "editor" format.names duplicate$ empty$ 'skip$ + { + "," * + " " * + get.bbl.editor + * + } + if$ +} +FUNCTION {format.note} +{ + note empty$ + { "" } + { note #1 #1 substring$ + duplicate$ "{" = + 'skip$ + { output.state mid.sentence = + { "l" } + { "u" } + if$ + change.case$ + } + if$ + note #2 global.max$ substring$ * "note" bibinfo.check + } + if$ +} + +FUNCTION {format.title} +{ title + duplicate$ empty$ 'skip$ + { "t" change.case$ } + if$ + "title" bibinfo.check +} +FUNCTION {format.full.names} +{'s := + "" 't := + #1 'nameptr := + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { s nameptr + "{vv~}{ll}" format.name$ + 't := + nameptr #1 > + { + namesleft #1 > + { ", " * t * } + { + s nameptr "{ll}" format.name$ duplicate$ "others" = + { 't := } + { pop$ } + if$ + t "others" = + { + " " * bbl.etal * + } + { + numnames #2 > + { "," * } + 'skip$ + if$ + bbl.and + space.word * t * + } + if$ + } + if$ + } + 't + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ +} + +FUNCTION {author.editor.key.full} +{ author empty$ + { editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { editor format.full.names } + if$ + } + { author format.full.names } + if$ +} + +FUNCTION {author.key.full} +{ author empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { author format.full.names } + if$ +} + +FUNCTION {editor.key.full} +{ editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { editor format.full.names } + if$ +} + +FUNCTION {make.full.names} +{ type$ "book" = + type$ "inbook" = + or + 'author.editor.key.full + { type$ "proceedings" = + 'editor.key.full + 'author.key.full + if$ + } + if$ +} + +FUNCTION {output.bibitem.original} % urlbst (renamed from output.bibitem, so it can be wrapped below) +{ newline$ + "\bibitem[{" write$ + label write$ + ")" make.full.names duplicate$ short.list = + { pop$ } + { * } + if$ + "}]{" * write$ + cite$ write$ + "}" write$ + newline$ + "" + before.all 'output.state := +} + +FUNCTION {n.dashify} +{ + 't := + "" + { t empty$ not } + { t #1 #1 substring$ "-" = + { t #1 #2 substring$ "--" = not + { "--" * + t #2 global.max$ substring$ 't := + } + { { t #1 #1 substring$ "-" = } + { "-" * + t #2 global.max$ substring$ 't := + } + while$ + } + if$ + } + { t #1 #1 substring$ * + t #2 global.max$ substring$ 't := + } + if$ + } + while$ +} + +FUNCTION {word.in} +{ bbl.in capitalize + " " * } + +FUNCTION {format.date} +{ year "year" bibinfo.check duplicate$ empty$ + { + } + 'skip$ + if$ + extra.label * + before.all 'output.state := + after.sentence 'output.state := +} +FUNCTION {format.btitle} +{ title "title" bibinfo.check + duplicate$ empty$ 'skip$ + { + emphasize + } + if$ +} +FUNCTION {either.or.check} +{ empty$ + 'pop$ + { "can't use both " swap$ * " fields in " * cite$ * warning$ } + if$ +} +FUNCTION {format.bvolume} +{ volume empty$ + { "" } + { bbl.volume volume tie.or.space.prefix + "volume" bibinfo.check * * + series "series" bibinfo.check + duplicate$ empty$ 'pop$ + { swap$ bbl.of space.word * swap$ + emphasize * } + if$ + "volume and number" number either.or.check + } + if$ +} +FUNCTION {format.number.series} +{ volume empty$ + { number empty$ + { series field.or.null } + { series empty$ + { number "number" bibinfo.check } + { output.state mid.sentence = + { bbl.number } + { bbl.number capitalize } + if$ + number tie.or.space.prefix "number" bibinfo.check * * + bbl.in space.word * + series "series" bibinfo.check * + } + if$ + } + if$ + } + { "" } + if$ +} + +FUNCTION {format.edition} +{ edition duplicate$ empty$ 'skip$ + { + output.state mid.sentence = + { "l" } + { "t" } + if$ change.case$ + "edition" bibinfo.check + " " * bbl.edition * + } + if$ +} +INTEGERS { multiresult } +FUNCTION {multi.page.check} +{ 't := + #0 'multiresult := + { multiresult not + t empty$ not + and + } + { t #1 #1 substring$ + duplicate$ "-" = + swap$ duplicate$ "," = + swap$ "+" = + or or + { #1 'multiresult := } + { t #2 global.max$ substring$ 't := } + if$ + } + while$ + multiresult +} +FUNCTION {format.pages} +{ pages duplicate$ empty$ 'skip$ + { duplicate$ multi.page.check + { + bbl.pages swap$ + n.dashify + } + { + bbl.page swap$ + } + if$ + tie.or.space.prefix + "pages" bibinfo.check + * * + } + if$ +} +FUNCTION {format.journal.pages} +{ pages duplicate$ empty$ 'pop$ + { swap$ duplicate$ empty$ + { pop$ pop$ format.pages } + { + ":" * + swap$ + n.dashify + "pages" bibinfo.check + * + } + if$ + } + if$ +} +FUNCTION {format.vol.num.pages} +{ volume field.or.null + duplicate$ empty$ 'skip$ + { + "volume" bibinfo.check + } + if$ + number "number" bibinfo.check duplicate$ empty$ 'skip$ + { + swap$ duplicate$ empty$ + { "there's a number but no volume in " cite$ * warning$ } + 'skip$ + if$ + swap$ + "(" swap$ * ")" * + } + if$ * + format.journal.pages +} + +FUNCTION {format.chapter} +{ chapter empty$ + 'format.pages + { type empty$ + { bbl.chapter } + { type "l" change.case$ + "type" bibinfo.check + } + if$ + chapter tie.or.space.prefix + "chapter" bibinfo.check + * * + } + if$ +} + +FUNCTION {format.chapter.pages} +{ chapter empty$ + 'format.pages + { type empty$ + { bbl.chapter } + { type "l" change.case$ + "type" bibinfo.check + } + if$ + chapter tie.or.space.prefix + "chapter" bibinfo.check + * * + pages empty$ + 'skip$ + { ", " * format.pages * } + if$ + } + if$ +} + +FUNCTION {format.booktitle} +{ + booktitle "booktitle" bibinfo.check + emphasize +} +FUNCTION {format.in.booktitle} +{ format.booktitle duplicate$ empty$ 'skip$ + { + word.in swap$ * + } + if$ +} +FUNCTION {format.in.ed.booktitle} +{ format.booktitle duplicate$ empty$ 'skip$ + { + editor "editor" format.names.ed duplicate$ empty$ 'pop$ + { + "," * + " " * + get.bbl.editor + ", " * + * swap$ + * } + if$ + word.in swap$ * + } + if$ +} +FUNCTION {format.thesis.type} +{ type duplicate$ empty$ + 'pop$ + { swap$ pop$ + "t" change.case$ "type" bibinfo.check + } + if$ +} +FUNCTION {format.tr.number} +{ number "number" bibinfo.check + type duplicate$ empty$ + { pop$ bbl.techrep } + 'skip$ + if$ + "type" bibinfo.check + swap$ duplicate$ empty$ + { pop$ "t" change.case$ } + { tie.or.space.prefix * * } + if$ +} +FUNCTION {format.article.crossref} +{ + word.in + " \cite{" * crossref * "}" * +} +FUNCTION {format.book.crossref} +{ volume duplicate$ empty$ + { "empty volume in " cite$ * "'s crossref of " * crossref * warning$ + pop$ word.in + } + { bbl.volume + capitalize + swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word * + } + if$ + " \cite{" * crossref * "}" * +} +FUNCTION {format.incoll.inproc.crossref} +{ + word.in + " \cite{" * crossref * "}" * +} +FUNCTION {format.org.or.pub} +{ 't := + "" + address empty$ t empty$ and + 'skip$ + { + t empty$ + { address "address" bibinfo.check * + } + { t * + address empty$ + 'skip$ + { ", " * address "address" bibinfo.check * } + if$ + } + if$ + } + if$ +} +FUNCTION {format.publisher.address} +{ publisher "publisher" bibinfo.warn format.org.or.pub +} + +FUNCTION {format.organization.address} +{ organization "organization" bibinfo.check format.org.or.pub +} + +% urlbst... +% Functions for making hypertext links. +% In all cases, the stack has (link-text href-url) +% +% make 'null' specials +FUNCTION {make.href.null} +{ + pop$ +} +% make hypertex specials +FUNCTION {make.href.hypertex} +{ + "\special {html: }" * swap$ * + "\special {html:}" * +} +% make hyperref specials +FUNCTION {make.href.hyperref} +{ + "\href {" swap$ * "} {\path{" * swap$ * "}}" * +} +FUNCTION {make.href} +{ hrefform #2 = + 'make.href.hyperref % hrefform = 2 + { hrefform #1 = + 'make.href.hypertex % hrefform = 1 + 'make.href.null % hrefform = 0 (or anything else) + if$ + } + if$ +} + +% If inlinelinks is true, then format.url should be a no-op, since it's +% (a) redundant, and (b) could end up as a link-within-a-link. +FUNCTION {format.url} +{ inlinelinks #1 = url empty$ or + { "" } + { hrefform #1 = + { % special case -- add HyperTeX specials + urlintro "\url{" url * "}" * url make.href.hypertex * } + { urlintro "\url{" * url * "}" * } + if$ + } + if$ +} + +FUNCTION {format.eprint} +{ eprint empty$ + { "" } + { eprintprefix eprint * eprinturl eprint * make.href } + if$ +} + +FUNCTION {format.doi} +{ doi empty$ + { "" } + { doiprefix doi * doiurl doi * make.href } + if$ +} + +FUNCTION {format.pubmed} +{ pubmed empty$ + { "" } + { pubmedprefix pubmed * pubmedurl pubmed * make.href } + if$ +} + +% Output a URL. We can't use the more normal idiom (something like +% `format.url output'), because the `inbrackets' within +% format.lastchecked applies to everything between calls to `output', +% so that `format.url format.lastchecked * output' ends up with both +% the URL and the lastchecked in brackets. +FUNCTION {output.url} +{ url empty$ + 'skip$ + { new.block + format.url output + format.lastchecked output + } + if$ +} + +FUNCTION {output.web.refs} +{ + new.block + inlinelinks + 'skip$ % links were inline -- don't repeat them + { + output.url + addeprints eprint empty$ not and + { format.eprint output.nonnull } + 'skip$ + if$ + adddoiresolver doi empty$ not and + { format.doi output.nonnull } + 'skip$ + if$ + addpubmedresolver pubmed empty$ not and + { format.pubmed output.nonnull } + 'skip$ + if$ + } + if$ +} + +% Wrapper for output.bibitem.original. +% If the URL field is not empty, set makeinlinelink to be true, +% so that an inline link will be started at the next opportunity +FUNCTION {output.bibitem} +{ outside.brackets 'bracket.state := + output.bibitem.original + inlinelinks url empty$ not doi empty$ not or pubmed empty$ not or eprint empty$ not or and + { #1 'makeinlinelink := } + { #0 'makeinlinelink := } + if$ +} + +% Wrapper for fin.entry.original +FUNCTION {fin.entry} +{ output.web.refs % urlbst + makeinlinelink % ooops, it appears we didn't have a title for inlinelink + { possibly.setup.inlinelink % add some artificial link text here, as a fallback + linktextstring output.nonnull } + 'skip$ + if$ + bracket.state close.brackets = % urlbst + { "]" * } + 'skip$ + if$ + fin.entry.original +} + +% Webpage entry type. +% Title and url fields required; +% author, note, year, month, and lastchecked fields optional +% See references +% ISO 690-2 http://www.nlc-bnc.ca/iso/tc46sc9/standard/690-2e.htm +% http://www.classroom.net/classroom/CitingNetResources.html +% http://neal.ctstateu.edu/history/cite.html +% http://www.cas.usf.edu/english/walker/mla.html +% for citation formats for web pages. +FUNCTION {webpage} +{ output.bibitem + author empty$ + { editor empty$ + 'skip$ % author and editor both optional + { format.editors output.nonnull } + if$ + } + { editor empty$ + { format.authors output.nonnull } + { "can't use both author and editor fields in " cite$ * warning$ } + if$ + } + if$ + new.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ + format.title "title" output.check + inbrackets onlinestring output + new.block + year empty$ + 'skip$ + { format.date "year" output.check } + if$ + % We don't need to output the URL details ('lastchecked' and 'url'), + % because fin.entry does that for us, using output.web.refs. The only + % reason we would want to put them here is if we were to decide that + % they should go in front of the rather miscellaneous information in 'note'. + new.block + note output + fin.entry +} +% ...urlbst to here + + +FUNCTION {article} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title "title" output.check + new.block + crossref missing$ + { + journal + "journal" bibinfo.check + emphasize + "journal" output.check + possibly.setup.inlinelink format.vol.num.pages output% urlbst + } + { format.article.crossref output.nonnull + format.pages output + } + if$ + new.block + format.note output + fin.entry +} +FUNCTION {book} +{ output.bibitem + author empty$ + { format.editors "author and editor" output.check + editor format.key output + } + { format.authors output.nonnull + crossref missing$ + { "author and editor" editor either.or.check } + 'skip$ + if$ + } + if$ + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.btitle "title" output.check + format.edition output + crossref missing$ + { format.bvolume output + new.block + format.number.series output + new.sentence + format.publisher.address output + } + { + new.block + format.book.crossref output.nonnull + } + if$ + new.block + format.note output + fin.entry +} +FUNCTION {booklet} +{ output.bibitem + format.authors output + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title "title" output.check + new.block + howpublished "howpublished" bibinfo.check output + address "address" bibinfo.check output + new.block + format.note output + fin.entry +} + +FUNCTION {inbook} +{ output.bibitem + author empty$ + { format.editors "author and editor" output.check + editor format.key output + } + { format.authors output.nonnull + crossref missing$ + { "author and editor" editor either.or.check } + 'skip$ + if$ + } + if$ + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.btitle "title" output.check + format.edition output + crossref missing$ + { + format.bvolume output + format.number.series output + format.chapter "chapter" output.check + new.sentence + format.publisher.address output + new.block + } + { + format.chapter "chapter" output.check + new.block + format.book.crossref output.nonnull + } + if$ + new.block + format.note output + fin.entry +} + +FUNCTION {incollection} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title "title" output.check + new.block + crossref missing$ + { format.in.ed.booktitle "booktitle" output.check + format.edition output + format.bvolume output + format.number.series output + format.chapter.pages output + new.sentence + format.publisher.address output + } + { format.incoll.inproc.crossref output.nonnull + format.chapter.pages output + } + if$ + new.block + format.note output + fin.entry +} +FUNCTION {inproceedings} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title "title" output.check + new.block + crossref missing$ + { format.in.booktitle "booktitle" output.check + format.bvolume output + format.number.series output + format.pages output + address "address" bibinfo.check output + new.sentence + organization "organization" bibinfo.check output + publisher "publisher" bibinfo.check output + } + { format.incoll.inproc.crossref output.nonnull + format.pages output + } + if$ + new.block + format.note output + fin.entry +} +FUNCTION {conference} { inproceedings } +FUNCTION {manual} +{ output.bibitem + format.authors output + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.btitle "title" output.check + format.edition output + organization address new.block.checkb + organization "organization" bibinfo.check output + address "address" bibinfo.check output + new.block + format.note output + fin.entry +} + +FUNCTION {mastersthesis} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title + "title" output.check + new.block + bbl.mthesis format.thesis.type output.nonnull + school "school" bibinfo.warn output + address "address" bibinfo.check output + month "month" bibinfo.check output + new.block + format.note output + fin.entry +} + +FUNCTION {misc} +{ output.bibitem + format.authors output + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title output + new.block + howpublished "howpublished" bibinfo.check output + new.block + format.note output + fin.entry +} +FUNCTION {phdthesis} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.btitle + "title" output.check + new.block + bbl.phdthesis format.thesis.type output.nonnull + school "school" bibinfo.warn output + address "address" bibinfo.check output + new.block + format.note output + fin.entry +} + +FUNCTION {proceedings} +{ output.bibitem + format.editors output + editor format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.btitle "title" output.check + format.bvolume output + format.number.series output + new.sentence + publisher empty$ + { format.organization.address output } + { organization "organization" bibinfo.check output + new.sentence + format.publisher.address output + } + if$ + new.block + format.note output + fin.entry +} + +FUNCTION {techreport} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title + "title" output.check + new.block + format.tr.number output.nonnull + institution "institution" bibinfo.warn output + address "address" bibinfo.check output + new.block + format.note output + fin.entry +} + +FUNCTION {unpublished} +{ output.bibitem + format.authors "author" output.check + author format.key output + format.date "year" output.check + date.block + title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst + format.title "title" output.check + new.block + format.note "note" output.check + fin.entry +} + +FUNCTION {default.type} { misc } +READ +FUNCTION {sortify} +{ purify$ + "l" change.case$ +} +INTEGERS { len } +FUNCTION {chop.word} +{ 's := + 'len := + s #1 len substring$ = + { s len #1 + global.max$ substring$ } + 's + if$ +} +FUNCTION {format.lab.names} +{ 's := + "" 't := + s #1 "{vv~}{ll}" format.name$ + s num.names$ duplicate$ + #2 > + { pop$ + " " * bbl.etal * + } + { #2 < + 'skip$ + { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" = + { + " " * bbl.etal * + } + { bbl.and space.word * s #2 "{vv~}{ll}" format.name$ + * } + if$ + } + if$ + } + if$ +} + +FUNCTION {author.key.label} +{ author empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { author format.lab.names } + if$ +} + +FUNCTION {author.editor.key.label} +{ author empty$ + { editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { editor format.lab.names } + if$ + } + { author format.lab.names } + if$ +} + +FUNCTION {editor.key.label} +{ editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key + if$ + } + { editor format.lab.names } + if$ +} + +FUNCTION {calc.short.authors} +{ type$ "book" = + type$ "inbook" = + or + 'author.editor.key.label + { type$ "proceedings" = + 'editor.key.label + 'author.key.label + if$ + } + if$ + 'short.list := +} + +FUNCTION {calc.label} +{ calc.short.authors + short.list + "(" + * + year duplicate$ empty$ + short.list key field.or.null = or + { pop$ "" } + 'skip$ + if$ + * + 'label := +} + +FUNCTION {sort.format.names} +{ 's := + #1 'nameptr := + "" + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { s nameptr + "{ll{ }}{ ff{ }}{ jj{ }}" + format.name$ 't := + nameptr #1 > + { + " " * + namesleft #1 = t "others" = and + { "zzzzz" * } + { t sortify * } + if$ + } + { t sortify * } + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ +} + +FUNCTION {sort.format.title} +{ 't := + "A " #2 + "An " #3 + "The " #4 t chop.word + chop.word + chop.word + sortify + #1 global.max$ substring$ +} +FUNCTION {author.sort} +{ author empty$ + { key empty$ + { "to sort, need author or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { author sort.format.names } + if$ +} +FUNCTION {author.editor.sort} +{ author empty$ + { editor empty$ + { key empty$ + { "to sort, need author, editor, or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { editor sort.format.names } + if$ + } + { author sort.format.names } + if$ +} +FUNCTION {editor.sort} +{ editor empty$ + { key empty$ + { "to sort, need editor or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { editor sort.format.names } + if$ +} +FUNCTION {presort} +{ calc.label + label sortify + " " + * + type$ "book" = + type$ "inbook" = + or + 'author.editor.sort + { type$ "proceedings" = + 'editor.sort + 'author.sort + if$ + } + if$ + #1 entry.max$ substring$ + 'sort.label := + sort.label + * + " " + * + title field.or.null + sort.format.title + * + #1 entry.max$ substring$ + 'sort.key$ := +} + +ITERATE {presort} +SORT +STRINGS { last.label next.extra } +INTEGERS { last.extra.num number.label } +FUNCTION {initialize.extra.label.stuff} +{ #0 int.to.chr$ 'last.label := + "" 'next.extra := + #0 'last.extra.num := + #0 'number.label := +} +FUNCTION {forward.pass} +{ last.label label = + { last.extra.num #1 + 'last.extra.num := + last.extra.num int.to.chr$ 'extra.label := + } + { "a" chr.to.int$ 'last.extra.num := + "" 'extra.label := + label 'last.label := + } + if$ + number.label #1 + 'number.label := +} +FUNCTION {reverse.pass} +{ next.extra "b" = + { "a" 'extra.label := } + 'skip$ + if$ + extra.label 'next.extra := + extra.label + duplicate$ empty$ + 'skip$ + { year field.or.null #-1 #1 substring$ chr.to.int$ #65 < + { "{\natexlab{" swap$ * "}}" * } + { "{(\natexlab{" swap$ * "})}" * } + if$ } + if$ + 'extra.label := + label extra.label * 'label := +} +EXECUTE {initialize.extra.label.stuff} +ITERATE {forward.pass} +REVERSE {reverse.pass} +FUNCTION {bib.sort.order} +{ sort.label + " " + * + year field.or.null sortify + * + " " + * + title field.or.null + sort.format.title + * + #1 entry.max$ substring$ + 'sort.key$ := +} +ITERATE {bib.sort.order} +SORT +FUNCTION {begin.bib} +{ preamble$ empty$ + 'skip$ + { preamble$ write$ newline$ } + if$ + "\begin{thebibliography}{" number.label int.to.str$ * "}" * + write$ newline$ + "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi" + write$ newline$ +} +EXECUTE {begin.bib} +EXECUTE {init.urlbst.variables} % urlbst +EXECUTE {init.state.consts} +ITERATE {call.type$} +FUNCTION {end.bib} +{ newline$ + "\end{thebibliography}" write$ newline$ +} +EXECUTE {end.bib} +%% End of customized bst file +%% +%% End of file `compling.bst'. diff --git a/sheet/latex/apa-annotated-bibliography.csl b/sheet/latex/apa-annotated-bibliography.csl new file mode 100644 index 0000000..f6fe45a --- /dev/null +++ b/sheet/latex/apa-annotated-bibliography.csl @@ -0,0 +1,1716 @@ + + diff --git a/sheet/latex/human-evaluation-datasheet-no-boxes.tex b/sheet/latex/human-evaluation-datasheet-no-boxes.tex new file mode 100644 index 0000000..b7c887c --- /dev/null +++ b/sheet/latex/human-evaluation-datasheet-no-boxes.tex @@ -0,0 +1,718 @@ +% +% File acl2020.tex +% +%% Based on the style files for ACL 2020, which were +%% Based on the style files for ACL 2018, NAACL 2018/19, which were +%% Based on the style files for ACL-2015, with some improvements +%% taken from the NAACL-2016 style +%% Based on the style files for ACL-2014, which were, in turn, +%% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009, +%% EACL-2009, IJCNLP-2008... +%% Based on the style files for EACL 2006 by +%%e.agirre@ehu.es or Sergi.Balari@uab.es +%% and that of ACL 08 by Joakim Nivre and Noah Smith + +\documentclass[11pt,a4paper]{article} +\pdfoutput=1 % forces arxiv to use pdflatex for compilation +\usepackage[hyperref]{acl2020} +\usepackage{times} +\usepackage{latexsym} +\renewcommand{\UrlFont}{\ttfamily\small} +\newcommand{\egcattribute}[1]{\textsc{#1}} +\newcommand{\egcvalue}[1]{\textbf{\textit{#1}}} +\usepackage{enumitem} +\usepackage{color} +\usepackage{tcolorbox} +\usepackage{tikz} +\usepackage{amssymb} + +\def\UrlBreaks{\do\/\do-} % allow for breaks in urls + +% This is not strictly necessary, and may be commented out, +% but it will improve the layout of the manuscript, +% and will typically save some space. +\usepackage{microtype} + +\aclfinalcopy % Uncomment this line for the final submission +%\def\aclpaperid{***} % Enter the acl Paper ID here + +\setlength\titlebox{5cm} +% You can expand the titlebox if you need extra space +% to show all the authors. Please do not make the titlebox +% smaller than 5cm (the original size); we will check this +% in the camera-ready version and ask you to change it back. + +\newcommand\BibTeX{B\textsc{ib}\TeX} + +\definecolor{azure}{rgb}{0.0, 0.5, 1.0} +\tcbuselibrary{skins} +\tcbset{enhanced} +\newcommand{\qsecbox}[1]{\begin{tcolorbox}[left=1mm,right=1mm,boxrule=0.2mm,leftrule=2mm,drop fuzzy shadow,colframe=lightgray,frame style={left color=azure!90!lightgray}]#1\end{tcolorbox}} + +\makeatletter +\newcommand\footnoteref[1]{\protected@xdef\@thefnmark{\ref{#1}}\@footnotemark} +\makeatother + + +\title{The Human Evaluation Datasheet 1.0: A Template for Recording\\Details of Human Evaluation Experiments in NLP\\ +\normalsize{(described in \citet{shimorina-belz-2021-heds})} +} +%\author{\normalsize{(described in \citet{shimorina-belz-2021-heds})}} +\begin{document} +\maketitle + + +\section{Paper and Supplementary Resources (Questions 1.1--1.3)}\label{sec:paper-resources} + +Questions 1.1--1.3 record bibliographic and related information. These are straightforward and don't warrant much in-depth explanation. + +\vspace{-.3cm} +\subsection*{Question 1.1: Link to paper reporting the evaluation experiment. If the paper reports more than one experiment, state which experiment you're completing this sheet for. Or, if applicable, enter `for preregistration.'} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: a link to an online copy of the main reference for the human evaluation experiment, identifying which of the experiments the form is being completed for if there are several. If the experiment hasn't been run yet, and the form is being completed for the purpose of submitting it for preregistration, simply enter `for preregistration'. + +\vspace{-.3cm} +\subsection*{Question 1.2: Link to website providing resources used in the evaluation experiment (e.g.\ system outputs, evaluation tools, etc.). If there isn't one, enter `N/A'.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: link(s) to any resources used in the evaluation experiment, such as system outputs, evaluation tools, etc.\ If there aren't any publicly shared resources (yet), enter `N/A’. +\ +\vspace{-.3cm} +\subsection*{Question 1.3: Name, affiliation and email address of person completing this sheet, and of contact author if different.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: names, affiliations and email addresses as appropriate. + + + +\section{System (Questions 2.1--2.5)}\label{sec:system} + +Questions 2.1--2.5 record information about the system(s) (or human-authored stand-ins) whose outputs are evaluated in the Evaluation experiment that this sheet is being completed for. + +The input, output, and task questions in this section are closely interrelated: the value for one partially determines the others, as indicated for some combinations in Question 2.3. + + +\vspace{-.3cm} +\subsection*{Question 2.1: What type of input do the evaluated system(s) take? Select all that apply. If none match, select `Other' and describe.}\label{sec:input} +\vspace{-.1cm} + +Describe the type of input, where input refers to the representations and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g.\ if the input is a set of documents, you would still select \textit{text: document} below. + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{raw/structured data}: numerical, symbolic, and other data, possibly structured into trees, graphs, graphical models, etc. May be the input e.g.\ to Referring Expression Generation (REG), end-to-end text generation, etc. {NB}: excludes linguistic structures. + + \item \egcvalue{deep linguistic representation (DLR)}: any of a variety of deep, underspecified, semantic representations, such as abstract meaning representations \citep[AMRs;][]{banarescu-etal-2013-abstract} or discourse representation structures \citep[DRSs;][]{kamp-reyle2013discourse}. + + \item \egcvalue{shallow linguistic representation (SLR)}: any of a variety of shallow, syntactic representations, e.g.\ Universal Dependency (UD) structures; typically the input to surface realisation. + + \item \egcvalue{text: subsentential unit of text}: a unit of text shorter than a sentence, e.g.\ Referring Expressions (REs), verb phrase, text fragment of any length; includes titles/headlines. + + \item \egcvalue{text: sentence}: a single sentence (or set of sentences). + + \item \egcvalue{text: multiple sentences}: a sequence of multiple sentences, without any document structure (or a set of such sequences). + + \item \egcvalue{text: document}: a text with document structure, such as a title, paragraph breaks or sections, e.g.\ a set of news reports for summarisation. + + \item \egcvalue{text: dialogue}: a dialogue of any length, excluding a single turn which would come under one of the other text types. + + \item \egcvalue{text: other}: input is text but doesn't match any of the above \textit{text:*} categories. + + \item \egcvalue{speech}: a recording of speech. + + \item \egcvalue{visual}: an image or video. + + \item \egcvalue{multi-modal}: catch-all value for any combination of data and/or linguistic representation and/or visual data etc. + + \item \egcvalue{control feature}: a feature or parameter specifically present to control a property of the output text, e.g.\ positive stance, formality, author style. + + \item \egcvalue{no input (human generation)}: human generation\footnote{\label{human-generation}We use the term `human generation' where the items being evaluated have been created manually, rather than generated by an automatic system.}, therefore no system inputs. + + \item \egcvalue{other (please specify)}: if input is none of the above, choose this option and describe it. + +\end{enumerate} + +\vspace{-.3cm} +\subsection*{Question 2.2: What type of output do the evaluated system(s) generate? Select all that apply. If none match, select `Other' and describe.}\label{sec:output} + +Describe the type of input, where input refers to the representations and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g.\ if the output is a set of documents, you would still select \textit{text: document} below. + +Note that the options for outputs are the same as for inputs minus the \textit{control feature} option. + + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{raw/structured data}: numerical, symbolic, and other data, possibly structured into trees, graphs, graphical models, etc. May be the input e.g.\ to Referring Expression Generation (REG), end-to-end text generation, etc. {NB}: excludes linguistic structures. + + \item \egcvalue{deep linguistic representation (DLR)}: any of a variety of deep, underspecified, semantic representations, such as abstract meaning representations \citep[AMRs;][]{banarescu-etal-2013-abstract} or discourse representation structures \citep[DRSs;][]{kamp-reyle2013discourse}. + + \item \egcvalue{shallow linguistic representation (SLR)}: any of a variety of shallow, syntactic representations, e.g.\ Universal Dependency (UD) structures; typically the input to surface realisation. + + \item \egcvalue{text: subsentential unit of text}: a unit of text shorter than a sentence, e.g.\ Referring Expressions (REs), verb phrase, text fragment of any length; includes titles/headlines. + + \item \egcvalue{text: sentence}: a single sentence (or set of sentences). + + \item \egcvalue{text: multiple sentences}: a sequence of multiple sentences, without any document structure (or a set of such sequences). + + \item \egcvalue{text: document}: a text with document structure, such as a title, paragraph breaks or sections, e.g.\ a set of news reports for summarisation. + + \item \egcvalue{text: dialogue}: a dialogue of any length, excluding a single turn which would come under one of the other text types. + + \item \egcvalue{text: other}: select if output is text but doesn't match any of the above \textit{text:*} categories. + + \item \egcvalue{speech}: a recording of speech. + + \item \egcvalue{visual}: an image or video. + + \item \egcvalue{multi-modal}: catch-all value for any combination of data and/or linguistic representation and/or visual data etc. + + \item \egcvalue{human-generated `outputs'}: manually created stand-ins exemplifying outputs.\footnoteref{human-generation} + + \item \egcvalue{other (please specify)}: if output is none of the above, choose this option and describe it. + + \end{enumerate} + + +\vspace{-.3cm} +\subsection*{Question 2.3: How would you describe the task that the evaluated system(s) perform in mapping the inputs in Q2.1 to the outputs in Q2.2? Occasionally, more than one of the options below may apply. If none match, select `Other' and describe.}\label{sec:task} +\vspace{-.1cm} + +This field records the task performed by the system(s) being evaluated. This is independent of the application domain (financial reporting, weather forecasting, etc.), or the specific method (rule-based, neural, etc.) implemented in the system. We indicate mutual constraints between inputs, outputs and task for some of the options below. + + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{content selection/determination}: selecting the specific content that will be expressed in the generated text from a representation of possible content. This could be attribute selection for REG (without the surface realisation step). Note that the output here is not text. + + \item \egcvalue{content ordering/structuring}: assigning an order and/or structure to content to be included in generated text. Note that the output here is not text. + + \item \egcvalue{aggregation}: converting inputs (typically \textit{deep linguistic representations} or \textit{shallow linguistic representations}) in some way in order to reduce redundancy (e.g.\ representations for `they like swimming', `they like running' $\rightarrow$ representation for `they like swimming and running'). + + \item \egcvalue{referring expression generation}: generating \textit{text} to refer to a given referent, typically represented in the input as a set of attributes or a linguistic representation. + + \item \egcvalue{lexicalisation}: associating (parts of) an input representation with specific lexical items to be used in their realisation. + + \item \egcvalue{deep generation}: one-step text generation from \textit{raw/structured data} or \textit{deep linguistic representations}. One-step means that no intermediate representations are passed from one independently run module to another. + + \item \egcvalue{surface realisation (SLR to text)}: one-step text generation from \textit{shallow linguistic representations}. One-step means that no intermediate representations are passed from one independently run module to another. + + \item \egcvalue{feature-controlled text generation}: generation of text that varies along specific dimensions where the variation is controlled via \textit{control feature}s specified as part of the input. Input is a non-textual representation (for feature-controlled text-to-text generation select the matching text-to-text task). + + \item \egcvalue{data-to-text generation}: generation from \textit{raw/structured data} which may or may not include some amount of content selection as part of the generation process. Output is likely to be \textit{text:*} or \textit{multi-modal}. + + \item \egcvalue{dialogue turn generation}: generating a dialogue turn (can be a greeting or closing) from a representation of dialogue state and/or last turn(s), etc. + + \item \egcvalue{question generation}: generation of questions from given input text and/or knowledge base such that the question can be answered from the input. + + \item \egcvalue{question answering}: input is a question plus optionally a set of reference texts and/or knowledge base, and the output is the answer to the question. + + \item \egcvalue{paraphrasing/lossless simplification}: text-to-text generation where the aim is to preserve the meaning of the input while changing its wording. This can include the aim of changing the text on a given dimension, e.g.\ making it simpler, changing its stance or sentiment, etc., which may be controllable via input features. Note that this task type includes meaning-preserving text simplification (non-meaning preserving simplification comes under \textit{compression/lossy simplification} below). + + \item \egcvalue{compression/lossy simplification}: text-to-text generation that has the aim to generate a shorter, or shorter and simpler, version of the input text. This will normally affect meaning to some extent, but as a side effect, rather than the primary aim, as is the case in \textit{summarisation}. + + \item \egcvalue{machine translation}: translating text in a source language to text in a target language while maximally preserving the meaning. + + \item \egcvalue{summarisation (text-to-text)}: output is an extractive or abstractive summary of the important/relevant/salient content of the input document(s). + + \item \egcvalue{end-to-end text generation}: use this option if the single system task corresponds to more than one of tasks above, implemented either as separate modules pipelined together, or as one-step generation, other than \textit{deep generation} and \textit{surface realisation}. + + \item \egcvalue{image/video description}: input includes \textit{visual}, and the output describes it in some way. + + \item \egcvalue{post-editing/correction}: system edits and/or corrects the input text (typically itself the textual output from another system) to yield an improved version of the text. + + \item \egcvalue{other (please specify)}: if task is none of the above, choose this option and describe it. + \end{enumerate} + + +\vspace{-.3cm} +\subsection*{Question 2.4: Input Language(s), or `N/A'.} +\vspace{-.1cm} + +This field records the language(s) of the inputs accepted by the system(s) being evaluated. + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: any language name(s) that apply, mapped to standardised full language names in ISO 639-1\footnote{\label{iso}\url{https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes}}. E.g.\ English, Herero, Hindi. +If no language is accepted as (part of) the input, enter `N/A'. + +\vspace{-.3cm} +\subsection*{Question 2.5: Output Language(s), or `N/A'.} +\vspace{-.1cm} + +This field records the language(s) of the outputs generated by the system(s) being evaluated. + +\vspace{.2cm} +\noindent\textit{What to enter in the text box}: any language name(s) that apply, mapped to standardised full language names in ISO 639-1 (2019)\footnoteref{iso}. E.g.\ English, Herero, Hindi. +If no language is generated, enter `N/A'. + + +\section{Output Sample, Evaluators, Experimental Design}\label{sec:design} + +\subsection{Sample of system outputs (or human-authored stand-ins) evaluated (Questions 3.1.1--3.1.3)} + +Questions 3.1.1--3.1.3 record information about the size of the sample of outputs (or human-authored stand-ins) evaluated per system, how the sample was selected, and what its statistical power is. + +\vspace{-.3cm} +\subsection*{Question 3.1.1: How many system outputs (or other evaluation items) are evaluated per system in the evaluation experiment? Answer should be an integer.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The number of system outputs (or other evaluation items) that are evaluated per system by at least one evaluator in the experiment, as an integer. + +\vspace{-.3cm} +\subsection*{Question 3.1.2: How are system outputs (or other evaluation items) selected for inclusion in the evaluation experiment? If none match, select `Other' and describe.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{by an automatic random process from a larger set}: outputs were selected for inclusion in the experiment by a script using a pseudo-random number generator; don't use this option if the script selects every $n$th output (which is not random). + \item \egcvalue{by an automatic random process but using stratified sampling over given properties}: use this option if selection was by a random script as above, but with added constraints ensuring that the sample is representative of the set of outputs it was selected from, in terms of given properties, such as sentence length, positive/negative stance, etc. + \item \egcvalue{by manual, arbitrary selection}: output sample was selected by hand, or automatically from a manually compiled list, without a specific selection criterion. + \item \egcvalue{by manual selection aimed at achieving balance or variety relative to given properties}: selection by hand as above, but with specific selection criteria, e.g.\ same number of outputs from each time period. + \item \egcvalue{Other (please specify)}: if selection method is none of the above, choose this option and describe it. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{Question 3.1.3: What is the statistical power of the sample size?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The results of a statistical power calculation on the output sample: provide numerical results and a link to the script used (or another way of identifying the script). See, e.g., \citet{card-etal-2020-little}. + + + +\subsection{Evaluators (Questions 3.2.1--3.2.4)} + +Questions 3.2.1--3.2.4 record information about the evaluators participating in the experiment. + +\vspace{-.3cm} +\subsection*{Question 3.2.1: How many evaluators are there in this experiment? Answer should be an integer.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the total number of evaluators participating in the experiment, as an integer. + +\vspace{-.3cm} +\subsection*{Question 3.2.2: What kind of evaluators are in this experiment? Select all that apply. If none match, select `Other' and describe. In all cases, provide details in the text box under `Other'.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{experts}: participants are considered domain experts, e.g.\ meteorologists evaluating a weather forecast generator, or nurses evaluating an ICU report generator. + \item \egcvalue{non-experts}: participants are not domain experts. + \item \egcvalue{paid (including non-monetary compensation such as course credits)}: participants were given some form of compensation for their participation, including vouchers, course credits, and reimbursement for travel unless based on receipts. + \item \egcvalue{not paid}: participants were not given compensation of any kind. + \item \egcvalue{previously known to authors}: (one of the) researchers running the experiment knew some or all of the participants before recruiting them for the experiment. + \item \egcvalue{not previously known to authors}: none of the researchers running the experiment knew any of the participants before recruiting them for the experiment. + \item \egcvalue{evaluators include one or more of the authors}: one or more researchers running the experiment was among the participants. + \item \egcvalue{evaluators do not include any of the authors}: none of the researchers running the experiment were among the participants. + \item \egcvalue{Other} (fewer than 4 of the above apply): we believe you should be able to tick 4 options of the above. If that's not the case, use this box to explain. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{Question 3.2.3: How are evaluators recruited?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Please explain how your evaluators are recruited. Do you send emails to a given list? Do you post invitations on social media? Posters on university walls? Were there any gatekeepers involved? What are the exclusion/inclusion criteria? + +\vspace{-.3cm} +\subsection*{Question 3.2.4: What training and/or practice are evaluators given before starting on the evaluation itself?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe any training evaluators were given as part of the experiment to prepare them for the evaluation task, including any practice evaluations they did. This includes any introductory explanations they're given, e.g.\ on the start page of an online evaluation tool. + +\vspace{-.3cm} +\subsection*{Question 3.2.5: What other characteristics do the evaluators have, known either because these were qualifying criteria, or from information gathered as part of the evaluation?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to list any characteristics not covered in previous questions that the evaluators are known to have, either because evaluators were selected on the basis of a characteristic, or because information about a characteristic was collected as part of the evaluation. This might include geographic location of IP address, educational level, or demographic information such as gender, age, etc. Where characteristics differ among evaluators (e.g.\ gender, age, location etc.), also give numbers for each subgroup. + + +\subsection{Experimental design (Questions 3.3.1--3.3.8)} + +Questions~3.3.1--3.3.8 record information about the experimental design of the evaluation experiment. + +\vspace{-.3cm} +\subsection*{Question 3.3.1: Has the experimental design been preregistered? If yes, on which registry?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `Yes' or `No'; if `Yes' also give the name of the registry and a link to the registration page for the experiment. + +%\vspace{-.3cm} +\subsection*{Question 3.3.2: How are responses collected? E.g.\ paper forms, online survey tool, etc.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe how you collected responses, e.g.\ paper forms, Google forms, SurveyMonkey, Mechanical Turk, CrowdFlower, audio/video recording, etc. + +%\vspace{-.3cm} +\subsection*{Question 3.3.3: What quality assurance methods are used? Select all that apply. If none match, select `Other' and describe. In all cases, provide details in the text box under `Other'.} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators are required to be native speakers of the language they evaluate}: mechanisms are in place to ensure all participants are native speakers of the language they evaluate. + \item \egcvalue{automatic quality checking methods are used during/post evaluation}: evaluations are checked for quality by automatic scripts during or after evaluations, e.g.\ evaluators are given known bad/good outputs to check they're given bad/good scores on MTurk. + \item \egcvalue{manual quality checking methods are used during/post evaluation}: evaluations are checked for quality by a manual process during or after evaluations, e.g.\ scores assigned by evaluators are monitored by researchers conducting the experiment. + \item \egcvalue{evaluators are excluded if they fail quality checks (often or badly enough)}: there are conditions under which evaluations produced by participants are not included in the final results due to quality issues. + \item \egcvalue{some evaluations are excluded because of failed quality checks}: there are conditions under which some (but not all) of the evaluations produced by some participants are not included in the final results due to quality issues. + \item \egcvalue{none of the above}: tick this box if none of the above apply. + \item \egcvalue{Other (please specify)}: use this box to describe any other quality assurance methods used during or after evaluations, and to provide additional details for any of the options selected above. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{Question 3.3.4: What do evaluators see when carrying out evaluations? Link to screenshot(s) and/or describe the evaluation interface(s).} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe the interface, paper form, etc.\ that evaluators see when they carry out the evaluation. Link to a screenshot/copy if possible. If there is a separate introductory interface/page, include it under Question 3.2.4. + +%\vspace{-.3cm} +\subsection*{3.3.5: How free are evaluators regarding when and how quickly to carry out evaluations? Select all that apply. In all cases, provide details in the text box under `Other'.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators have to complete each individual assessment within a set time}: evaluators are timed while carrying out each assessment and cannot complete the assessment once time has run out. + \item \egcvalue{evaluators have to complete the whole evaluation in one sitting}: partial progress cannot be saved and the evaluation returned to on a later occasion. + \item \egcvalue{neither of the above}: Choose this option if neither of the above are the case in the experiment. + \item \egcvalue{Other (please specify)}: Use this space to describe any other way in which time taken or number of sessions used by evaluators is controlled in the experiment, and to provide additional details for any of the options selected above. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{3.3.6: Are evaluators told they can ask questions about the evaluation and/or provide feedback? Select all that apply. In all cases, provide details in the text box under `Other'.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators are told they can ask any questions during/after receiving initial training/instructions, and before the start of the evaluation}: evaluators are told explicitly that they can ask questions about the evaluation experiment \textit{before} starting on their assessments, either during or after training. + \item \egcvalue{evaluators are told they can ask any questions during the evaluation}: evaluators are told explicitly that they can ask questions about the evaluation experiment \textit{during} their assessments. + \item \egcvalue{evaluators are asked for feedback and/or comments after the evaluation, e.g.\ via an exit questionnaire or a comment box}: evaluators are explicitly asked to provide feedback and/or comments about the experiment \textit{after} their assessments, either verbally or in written form. + \item \egcvalue{None of the above}: Choose this option if none of the above are the case in the experiment. + \item \egcvalue{Other (please specify)}: use this space to describe any other ways you provide for evaluators to ask questions or provide feedback. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{3.3.7: What are the experimental conditions in which evaluators carry out the evaluations? If none match, select `Other’ and describe.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{evaluation carried out by evaluators at a place of their own choosing, e.g.\ online, using a paper form, etc.}: evaluators are given access to the tool or form specified in Question 3.3.2, and subsequently choose where to carry out their evaluations. + \item \egcvalue{evaluation carried out in a lab, and conditions are the same for each evaluator}: evaluations are carried out in a lab, and conditions in which evaluations are carried out \textit{are} controlled to be the same, i.e.\ the different evaluators all carry out the evaluations in identical conditions of quietness, same type of computer, same room, etc. Note we're not after very fine-grained differences here, such as time of day or temperature, but the line is difficult to draw, so some judgment is involved here. + \item \egcvalue{evaluation carried out in a lab, and conditions vary for different evaluators}: choose this option if evaluations are carried out in a lab, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{evaluation carried out in a real-life situation, and conditions are the same for each evaluator}: evaluations are carried out in a real-life situation, i.e.\ one that would occur whether or not the evaluation was carried out (e.g.\ evaluating a dialogue system deployed in a live chat function on a website), and conditions in which evaluations are carried out \textit{are} controlled to be the same. + \item \egcvalue{evaluation carried out in a real-life situation, and conditions vary for different evaluators}: choose this option if evaluations are carried out in a real-life situation, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{evaluation carried out outside of the lab, in a situation designed to resemble a real-life situation, and conditions are the same for each evaluator}: evaluations are carried out outside of the lab, in a situation intentionally similar to a real-life situation (but not actually a real-life situation), e.g.\ user-testing a navigation system where the destination is part of the evaluation design, rather than chosen by the user. Conditions in which evaluations are carried out \textit{are} controlled to be the same. + \item \egcvalue{evaluation carried out outside of the lab, in a situation designed to resemble a real-life situation, and conditions vary for different evaluators}: choose this option if evaluations are carried out outside of the lab, in a situation intentionally similar to a real-life situation, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{Other (please specify)}: Use this space to provide additional, or alternative, information about the conditions in which evaluators carry out assessments, not covered by the options above. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{3.3.8: Unless the evaluation is carried out at a place of the evaluators' own choosing, briefly describe the (range of different) conditions in which evaluators carry out the evaluations.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: use this space to describe the variations in the conditions in which evaluators carry out the evaluation, for both situations where those variations are controlled, and situations where they are not controlled. + +\vspace{.3cm} + +\section{Quality Criterion \textit{n} -- Definition and Operationalisation} +\label{sec:criteria} + +Questions in this section collect information about the $n$th quality criterion assessed in the single human evaluation experiment that this sheet is being completed for. The HEDS 1.0 form allows this section to be completed repeatedly, for up to 10 different quality criteria (see further explanation at the end of the section). + +For more information, in particular about quality criterion properties and evaluation mode properties, see \citet{belz-etal-2020-disentangling}. + + +\subsection{Quality criterion properties (Questions 4.1.1--4.1.3)} + +Questions 4.1.1--4.1.3 capture the aspect of quality that is assessed by a given quality criterion in terms of three orthogonal properties. They help determine e.g.\ whether or not the same aspect of quality is being evaluated in different evaluation experiments. The three properties characterise quality criteria in terms of (i) what type of quality is being assessed; (ii) what aspect of the system output is being assessed; and (iii) whether system outputs are assessed in their own right or with reference to some system-internal or system-external frame of reference. + +\vspace{-.3cm} +\subsection*{Question 4.1.1: What type of quality is assessed by the quality criterion?} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Correctness}: select this option if it is possible to state, generally for all outputs, the conditions under which outputs are maximally correct (hence of maximal quality). E.g.\ for Grammaticality, outputs are (maximally) correct if they contain no grammatical errors; for Semantic Completeness, outputs are correct if they express all the content in the input. + \item \egcvalue{Goodness}: select this option if, in contrast to correctness criteria, there is no single, general mechanism for deciding when outputs are maximally good, only for deciding for two outputs which is better and which is worse. E.g.\ for Fluency, even if outputs contain no disfluencies, there may be other ways in which any given output could be more fluent. + \item \egcvalue{Features}: choose this option if, in terms of property $X$ captured by the criterion, outputs are not generally better if they are more $X$, but instead, depending on evaluation context, more $X$ may be better or less $X$ may be better. E.g.\ outputs can be more specific or less specific, but it’s not the case that outputs are, in the general case, better when they are more specific. +\end{enumerate} + + +\subsection*{Question 4.1.2: Which aspect of system outputs is assessed by the quality criterion?} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Form of output}: choose this option if the criterion assesses the form of outputs alone, e.g.\ Grammaticality is only about the form, a sentence can be grammatical yet be wrong or nonsensical in terms of content. + \item \egcvalue{Content of output}: choose this option if the criterion assesses the content/meaning of the output alone, e.g.\ Meaning Preservation only assesses output content; two sentences can be considered to have the same meaning, but differ in form. + \item \egcvalue{Both form and content of output}: choose this option if the criterion assesses outputs as a whole, not just form or just content. E.g.\ Coherence is a property of outputs as a whole, either form or meaning can detract from it. +\end{enumerate} + + +%\vspace{-.3cm} +\subsection*{Question 4.1.3: Is each output assessed for quality in its own right, or with reference to a system-internal or external frame of reference?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Quality of output in its own right}: choose this option if output quality is assessed without referring to anything other than the output itself, i.e.\ no system-internal or external frame of reference. E.g.\ Poeticness is assessed by considering (just) the output and how poetic it is. + \item \egcvalue{Quality of output relative to the input}: choose this option if output quality is assessed relative to the input. E.g.\ Answerability is the degree to which the output question can be answered from information in the input. + \item \egcvalue{Quality of output relative to a system-external frame of reference}: choose this option if output quality is assessed with reference to system-external information, such as a knowledge base, a person’s individual writing style, or the performance of an embedding system. E.g.\ Factual Accuracy assesses outputs relative to a source of real-world knowledge. +\end{enumerate} + + + +\subsection{Evaluation mode properties (Questions 4.2.1--4.2.3)} + +Questions 4.2.1--4.2.3 record properties that are orthogonal to quality criteria, i.e.\ any given quality criterion can in principle be combined with any of the modes (although some combinations are more common than others). + +\vspace{-.3cm} +\subsection*{Question 4.2.1: Does an individual assessment involve an objective or a subjective judgment?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Objective}: Examples of objective assessment include any automatically counted or otherwise quantified measurements such as mouse-clicks, occurrences in text, etc. Repeated assessments of the same output with an objective-mode evaluation method always yield the same score/result. + \item \egcvalue{Subjective}: Subjective assessments involve ratings, opinions and preferences by evaluators. Some criteria lend themselves more readily to subjective assessments, e.g.\ Friendliness of a conversational agent, but an objective measure e.g.\ based on lexical markers is also conceivable. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{Question 4.2.2: Are outputs assessed in absolute or relative terms?} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Absolute}: choose this option if evaluators are shown outputs from a single system during each individual assessment. + \item \egcvalue{Relative}: choose this option if evaluators are shown outputs from multiple systems at the same time during assessments, typically ranking or preference-judging them. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{Question 4.2.3: Is the evaluation intrinsic or extrinsic?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Intrinsic}: Choose this option if quality of outputs is assessed \textit{without} considering their \textit{effect} on something external to the system, e.g.\ the performance of an embedding system or of a user at a task. + \item \egcvalue{Extrinsic}: Choose this option if quality of outputs is assessed in terms of their \textit{effect} on something external to the system such as the performance of an embedding system or of a user at a task. +\end{enumerate} + + +\subsection{Response elicitation (Questions 4.3.1--4.3.11)} + +Questions 4.3.1--4.3.11 record information about how responses are elicited for the quality criterion this section is being completed for. + +\vspace{-.3cm} +\subsection*{Question 4.3.1: What do you call the quality criterion in explanations/interfaces to evaluators? Enter `N/A' if criterion not named.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the name you use to refer to the quality criterion in explanations and/or interfaces created for evaluators. Examples of quality criterion names include Fluency, Clarity, Meaning Preservation. If no name is used, state `N/A'. + +%\vspace{-.3cm} +\subsection*{Question 4.3.2: What definition do you give for the quality criterion in explanations/interfaces to evaluators? Enter `N/A' if no definition given.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Copy and past the verbatim definition you give to evaluators to explain the quality criterion they're assessing. If you don't explicitly call it a definition, enter the nearest thing to a definition you give them. If you don't give any definition, state `N/A'. + +%\vspace{-.3cm} +\subsection*{Question 4.3.3: Size of scale or other rating instrument (i.e.\ how many different possible values there are). Answer should be an integer or `continuous' (if it's not possible to state how many possible responses there are). Enter `N/A' if there is no rating instrument.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The number of different response values for this quality criterion. E.g.\ for a 5-point Likert scale, the size to enter is 5. For two-way forced-choice preference judgments, it is 2; if there's also a no-preference option, enter 3. For a slider that is mapped to 100 different values for the purpose of recording assessments, the size to enter is 100. If no rating instrument is used (e.g.\ when evaluation gathers post-edits or qualitative feedback only), enter `N/A'. + +%\vspace{-.3cm} +\subsection*{Question 4.3.4: List or range of possible values of the scale or other rating instrument. Enter `N/A', if there is no rating instrument.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: list, or give the range of, the possible values of the rating instrument. The list or range should be of the size specified in Question 4.3.3. If there are too many to list, use a range. E.g.\ for two-way forced-choice preference judgments, the list entered might be \textit{A better, B better}; if there's also a no-preference option, the list +might be \textit{A better, B better, neither}. For a slider that is mapped to 100 different values for the purpose of recording assessments, the range \textit{1--100} might be entered. If no rating instrument is used (e.g.\ when evaluation gathers post-edits or qualitative feedback only), enter 'N/A'. + +%\vspace{-.3cm} +\subsection*{Question 4.3.5: How is the scale or other rating instrument presented to evaluators? If none match, select `Other’ and describe.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Multiple-choice options}: choose this option if evaluators select exactly one of multiple options. + \item \egcvalue{Check-boxes}: choose this option if evaluators select any number of options from multiple given options. + \item \egcvalue{Slider}: choose this option if evaluators move a pointer on a slider scale to the position corresponding to their assessment. + \item \egcvalue{N/A (there is no rating instrument)}: choose this option if there is no rating instrument. + \item \egcvalue{Other (please specify)}: choose this option if there is a rating instrument, but none of the above adequately describe the way you present it to evaluators. Use the text box to describe the rating instrument and link to a screenshot. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{Question 4.3.6: If there is no rating instrument, describe briefly what task the evaluators perform (e.g.\ ranking multiple outputs, finding information, playing a game, etc.), and what information is recorded. Enter `N/A' if there is a rating instrument.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: If (and only if) there is no rating instrument, i.e.\ you entered `N/A' for Questions 4.3.3--4.3.5, describe the task evaluators perform in this space. Otherwise, here enter `N/A' if there \textit{is} a rating instrument. + +%\vspace{-.3cm} +\subsection*{Question 4.3.7: What is the verbatim question, prompt or instruction given to evaluators (visible to them during each individual assessment)? } +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Copy and paste the verbatim text that evaluators see during each assessment, that is intended to convey the evaluation task to them. E.g.\ \textit{Which of these texts do you prefer?} Or \textit{Make any corrections to this text that you think are necessary in order to improve it to the point where you would be happy to provide it to a client.} + +%\vspace{-.3cm} +\subsection*{Question 4.3.8: Form of response elicitation. If none match, select `Other' and describe.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}:\footnote{Explanations adapted from \citet{howcroft-etal-2020-twenty}.} +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{(dis)agreement with quality statement}: Participants specify the degree to which they agree with a given quality statement by indicating their agreement on a rating instrument. The rating instrument is labelled with degrees of agreement and can additionally have numerical labels. E.g.\ \textit{This text is fluent --- 1=strongly disagree...5=strongly agree}. + \item \egcvalue{direct quality estimation}: Participants are asked to provide a rating using a rating instrument, which typically (but not always) mentions the quality criterion explicitly. E.g.\ \textit{How fluent is this text? --- 1=not at all fluent...5=very fluent}. + \item \egcvalue{relative quality estimation (including ranking)}: Participants evaluate two or more items in terms of which is better. + E.g.\ \textit{Rank these texts in terms of fluency}; \textit{Which of these texts is more fluent?}; \textit{Which of these items do you prefer?}. + \item \egcvalue{counting occurrences in text}: Evaluators are asked to count how many times some type of phenomenon occurs, e.g.\ the number of facts contained in the output that are inconsistent with the input. + \item \egcvalue{qualitative feedback (e.g.\ via comments entered in a text box)}: Typically, these are responses to open-ended questions in a survey or interview. + \item \egcvalue{evaluation through post-editing/annotation}: Choose this option if the evaluators' task consists of editing or inserting annotations in text. E.g.\ evaluators may perform error correction and edits are then automatically measured to yield a numerical score. + \item \egcvalue{output classification or labelling}: Choose this option if evaluators assign outputs to categories. E.g.\ \textit{What is the overall sentiment of this piece of text? --- Positive/neutral/negative.} + \item \egcvalue{user-text interaction measurements}: choose this option if participants in the evaluation experiment interact with a text in some way, and measurements are taken of their interaction. E.g.\ reading speed, eye movement tracking, comprehension questions, etc. Excludes situations where participants are given a task to solve and their performance is measured which comes under the next option. + \item \egcvalue{task performance measurements}: choose this option if participants in the evaluation experiment are given a task to perform, and measurements are taken of their performance at the task. E.g.\ task is finding information, and task performance measurement is task completion speed and success rate. + \item \egcvalue{user-system interaction measurements}: choose this option if participants in the evaluation experiment interact with a system in some way, while measurements are taken of their interaction. E.g.\ duration of interaction, hyperlinks followed, number of likes, or completed sales. + \item \egcvalue{Other (please specify)}: Use the text box to describe the form of response elicitation used in assessing the quality criterion if it doesn't fall in any of the above categories. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{Question 4.3.9: How are raw responses from participants aggregated or otherwise processed to obtain reported scores for this quality criterion? State if no scores reported.} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: normally a set of separate assessments is collected from evaluators and is converted to the results as reported. Describe here the method(s) used in the conversion(s). E.g.\ macro-averages or micro-averages are computed from numerical scores to provide summary, per-system results. + +\vspace{-.3cm} +\subsection*{Question 4.3.10: Method(s) used for determining effect size and significance of findings for this quality criterion.} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: A list of methods used for calculating the effect size and significance of any results, both as reported in the paper given in Question 1.1, for this quality criterion. If none calculated, state `None'. + +\vspace{-.3cm} +\subsection*{Question 4.3.11: Has the inter-annotator and intra-annotator agreement between evaluators for this quality criterion been measured? If yes, what method was used, and what are the agreement scores?} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the methods used to compute, and results obtained from, any measures of inter-annotator and intra-annotator agreement obtained for the quality criterion. + +\vspace{.3cm} +\noindent The section ends with the question \textbf{Is there another quality criterion in the evaluation experiment that you haven't completed this section for yet?} If \textbf{Yes} is selected, please copy this section and complete it for the next criterion. If \textbf{No}, the next section will be the Ethics section below. + +\section{Ethics}\label{sec:ethics} + +The questions in this section relate to ethical aspects of the evaluation. Information can be entered in the text box provided, and/or by linking to a source where complete information can be found. + +\vspace{-.3cm} +\subsection*{Question 5.1: Has the evaluation experiment this sheet is being completed for, or the larger study it is part of, been approved by a research ethics committee? If yes, which research ethics committee?} +\vspace{-.1cm} + + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Typically, research organisations, universities and other higher-education institutions require some form ethical approval before experiments involving human participants, however innocuous, are permitted to proceed. Please provide here the name of the body that approved the experiment, or state `No' if approval has not (yet) been obtained. + +\vspace{-.3cm} +\subsection*{Question 5.2: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain personal data (as defined in GDPR Art. 4, §1: https://gdpr.eu/article-4-definitions/)? If yes, describe data and state how addressed.} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `No' if no personal data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements such as privacy and security was ensured, e.g.\ by linking to the (successful) application for ethics approval from Question 5.1. + +\vspace{-.3cm} +\subsection*{Question 5.3: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain special category information (as defined in GDPR Art. 9, §1: https://gdpr.eu/article-9-processing-special-categories-of-personal-data-prohibited/)? If yes, describe data and state how addressed.} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `No' if no special-category data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements relating to special-category data was ensured, e.g.\ by linking to the (successful) application for ethics approval from Question 5.1. + +\vspace{-.3cm} +\subsection*{Question 5.4: Have any impact assessments been carried out for the evaluation experiment, and/or any data collected/evaluated in connection with it? If yes, summarise approach(es) and outcomes.} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this box to describe any \textit{ex ante} or \textit{ex post} impact assessments that have been carried out in relation to the evaluation experiment, such that the assessment plan and process, as well as the outcomes, were captured in written form. Link to documents if possible. Types of impact assessment include data protection impact assessments, e.g.\ under GDPR.\footnote{\footnotesize \url{https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/accountability-and-governance/data-protection-impact-assessments/}}Environmental and social impact assessment frameworks are also available. + + +\section*{Credits} + +Questions 2.1--2.5 relating to evaluated system, and 4.3.1--4.3.8 relating to response elicitation, are based on \citet{howcroft-etal-2020-twenty}, with some significant changes. Questions 4.1.1--4.2.3 relating to quality criteria, and some of the questions about system outputs, evaluators, and experimental design (3.1.1--3.2.3, 4.3.5, 4.3.6, 4.3.9--4.3.11) are based on \citet{belz-etal-2020-disentangling}. +HEDS was also informed by \citet{van2019best, vanderlee2021} and by \citet{gehrmann2021gem}'s\footnote{\footnotesize \url{https://gem-benchmark.com/data\_cards/guide}} data card guide. + +More generally, the original inspiration for creating a `datasheet' for describing human evaluation experiments of course comes from seminal papers by \citet{bender-friedman-2018-data}, \citet{mitchell2019modelcards} and \citet{gebru2018datasheets}. + +\bibliography{human-evaluation-datasheet} +\bibliographystyle{acl_natbib} + +\end{document} diff --git a/sheet/latex/human-evaluation-datasheet.bib b/sheet/latex/human-evaluation-datasheet.bib new file mode 100644 index 0000000..4033050 --- /dev/null +++ b/sheet/latex/human-evaluation-datasheet.bib @@ -0,0 +1,163 @@ +@inproceedings{banarescu-etal-2013-abstract, + title = "{A}bstract {M}eaning {R}epresentation for Sembanking", + author = "Banarescu, Laura and + Bonial, Claire and + Cai, Shu and + Georgescu, Madalina and + Griffitt, Kira and + Hermjakob, Ulf and + Knight, Kevin and + Koehn, Philipp and + Palmer, Martha and + Schneider, Nathan", + booktitle = "Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse", + month = aug, + year = "2013", + address = "Sofia, Bulgaria", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/W13-2322", + pages = "178--186", +} + +@article{gebru2018datasheets, + title={Datasheets for Datasets}, + author={Timnit Gebru and Jamie Morgenstern and Briana Vecchione and Jennifer Wortman Vaughan and Hanna Wallach and Hal Daumé III and Kate Crawford}, + year={2020}, + eprint={1803.09010}, + archivePrefix={arXiv}, + primaryClass={cs.DB} +} + +@article{bender-friedman-2018-data, + title = "Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science", + author = "Bender, Emily M. and + Friedman, Batya", + journal = "Transactions of the Association for Computational Linguistics", + volume = "6", + year = "2018", + url = "https://www.aclweb.org/anthology/Q18-1041", + doi = "10.1162/tacl_a_00041", + pages = "587--604", + abstract = "In this paper, we propose data statements as a design solution and professional practice for natural language processing technologists, in both research and development. Through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from certain populations in the development of technology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues related to exclusion and bias in language technology, lead to better precision in claims about how natural language processing research can generalize and thus better engineering results, protect companies from public embarrassment, and ultimately lead to language technology that meets its users in their own preferred linguistic style and furthermore does not misrepresent them to others.", +} + +@book{kamp-reyle2013discourse, + title={From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory}, + author={Kamp, Hans and Reyle, Uwe}, + volume={42}, + year={2013}, + publisher={Springer Science \& Business Media} +} + +@inproceedings{mitchell2019modelcards, +author = {Mitchell, Margaret and Wu, Simone and Zaldivar, Andrew and Barnes, Parker and Vasserman, Lucy and Hutchinson, Ben and Spitzer, Elena and Raji, Inioluwa Deborah and Gebru, Timnit}, +title = {Model Cards for Model Reporting}, +year = {2019}, +isbn = {9781450361255}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3287560.3287596}, +doi = {10.1145/3287560.3287596}, +abstract = {Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.}, +booktitle = {Proceedings of the Conference on Fairness, Accountability, and Transparency}, +pages = {220–229}, +numpages = {10}, +keywords = {ethical considerations, documentation, datasheets, model cards, fairness evaluation, disaggregated evaluation, ML model evaluation}, +location = {Atlanta, GA, USA}, +series = {FAT* '19} +} + +@inproceedings{card-etal-2020-little, + title = "With Little Power Comes Great Responsibility", + author = "Card, Dallas and + Henderson, Peter and + Khandelwal, Urvashi and + Jia, Robin and + Mahowald, Kyle and + Jurafsky, Dan", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", + month = nov, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-main.745", + doi = "10.18653/v1/2020.emnlp-main.745", + pages = "9263--9274", + abstract = "Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75{\%} power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.", +} + +@inproceedings{van2019best, + title={Best practices for the human evaluation of automatically generated text}, + author={{van der Lee}, Chris and Gatt, Albert and van Miltenburg, Emiel and Wubben, Sander and Krahmer, Emiel}, + booktitle={Proceedings of the 12th International Conference on Natural Language Generation}, + pages={355--368}, + url={https://www.aclweb.org/anthology/W19-8643.pdf}, + year={2019} +} + +@article{vanderlee2021, +title = {Human evaluation of automatically generated text: Current trends and best practice guidelines}, +journal = {Computer Speech \& Language}, +volume = {67}, +pages = {101151}, +year = {2021}, +issn = {0885-2308}, +doi = {https://doi.org/10.1016/j.csl.2020.101151}, +url = {https://www.sciencedirect.com/science/article/pii/S088523082030084X}, +author = {Chris {van der Lee} and Albert Gatt and Emiel {van Miltenburg} and Emiel Krahmer}, +keywords = {Natural Language Generation, Human evaluation, Recommendations, Literature review, Open science, Ethics}, +abstract = {Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated, with a particularly high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how (mostly intrinsic) human evaluation is currently conducted and presents a set of best practices, grounded in the literature. These best practices are also linked to the stages that researchers go through when conducting an evaluation research (planning stage; execution and release stage), and the specific steps in these stages. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.} +} + +@article{gehrmann2021gem, + title={The {GEM} Benchmark: Natural Language Generation, its Evaluation and Metrics}, + author={Sebastian Gehrmann and Tosin Adewumi and Karmanya Aggarwal and Pawan Sasanka Ammanamanchi and Aremu Anuoluwapo and Antoine Bosselut and Khyathi Raghavi Chandu and Miruna Clinciu and Dipanjan Das and Kaustubh D. Dhole and Wanyu Du and Esin Durmus and Ondřej Dušek and Chris Emezue and Varun Gangal and Cristina Garbacea and Tatsunori Hashimoto and Yufang Hou and Yacine Jernite and Harsh Jhamtani and Yangfeng Ji and Shailza Jolly and Dhruv Kumar and Faisal Ladhak and Aman Madaan and Mounica Maddela and Khyati Mahajan and Saad Mahamood and Bodhisattwa Prasad Majumder and Pedro Henrique Martins and Angelina McMillan-Major and Simon Mille and Emiel van Miltenburg and Moin Nadeem and Shashi Narayan and Vitaly Nikolaev and Rubungo Andre Niyongabo and Salomey Osei and Ankur Parikh and Laura Perez-Beltrachini and Niranjan Ramesh Rao and Vikas Raunak and Juan Diego Rodriguez and Sashank Santhanam and João Sedoc and Thibault Sellam and Samira Shaikh and Anastasia Shimorina and Marco Antonio Sobrevilla Cabezudo and Hendrik Strobelt and Nishant Subramani and Wei Xu and Diyi Yang and Akhila Yerukola and Jiawei Zhou}, + year={2021}, + eprint={2102.01672}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} + +@inproceedings{howcroft-etal-2020-twenty, + title = "Twenty Years of Confusion in Human Evaluation: {NLG} Needs Evaluation Sheets and Standardised Definitions", + author = "Howcroft, David M. and + Belz, Anya and + Clinciu, Miruna-Adriana and + Gkatzia, Dimitra and + Hasan, Sadid A. and + Mahamood, Saad and + Mille, Simon and + van Miltenburg, Emiel and + Santhanam, Sashank and + Rieser, Verena", + booktitle = "Proceedings of the 13th International Conference on Natural Language Generation", + month = dec, + year = "2020", + address = "Dublin, Ireland", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.inlg-1.23", + pages = "169--182" +} + +@inproceedings{belz-etal-2020-disentangling, + title = "Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing", + author = "Belz, Anya and + Mille, Simon and + Howcroft, David M.", + booktitle = "Proceedings of the 13th International Conference on Natural Language Generation", + month = dec, + year = "2020", + address = "Dublin, Ireland", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.inlg-1.24", + pages = "183--194" +} + +@misc{shimorina-belz-2021-heds, + title={The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP}, + author={Anastasia Shimorina and Anya Belz}, + year={2021}, + eprint={2103.09710}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} diff --git a/sheet/latex/human-evaluation-datasheet.tex b/sheet/latex/human-evaluation-datasheet.tex new file mode 100644 index 0000000..b4e6188 --- /dev/null +++ b/sheet/latex/human-evaluation-datasheet.tex @@ -0,0 +1,717 @@ +% +% File acl2020.tex +% +%% Based on the style files for ACL 2020, which were +%% Based on the style files for ACL 2018, NAACL 2018/19, which were +%% Based on the style files for ACL-2015, with some improvements +%% taken from the NAACL-2016 style +%% Based on the style files for ACL-2014, which were, in turn, +%% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009, +%% EACL-2009, IJCNLP-2008... +%% Based on the style files for EACL 2006 by +%%e.agirre@ehu.es or Sergi.Balari@uab.es +%% and that of ACL 08 by Joakim Nivre and Noah Smith + +\documentclass[11pt,a4paper]{article} +\pdfoutput=1 % forces arxiv to use pdflatex for compilation +\usepackage[hyperref]{acl2020} +\usepackage{times} +\usepackage{latexsym} +\renewcommand{\UrlFont}{\ttfamily\small} +\newcommand{\egcattribute}[1]{\textsc{#1}} +\newcommand{\egcvalue}[1]{\textbf{\textit{#1}}} +\usepackage{enumitem} +\usepackage{color} +\usepackage{tcolorbox} +\usepackage{tikz} +\usepackage{amssymb} + +\def\UrlBreaks{\do\/\do-} % allow for breaks in urls + +% This is not strictly necessary, and may be commented out, +% but it will improve the layout of the manuscript, +% and will typically save some space. +\usepackage{microtype} + +\aclfinalcopy % Uncomment this line for the final submission +%\def\aclpaperid{***} % Enter the acl Paper ID here + +\setlength\titlebox{5cm} +% You can expand the titlebox if you need extra space +% to show all the authors. Please do not make the titlebox +% smaller than 5cm (the original size); we will check this +% in the camera-ready version and ask you to change it back. + +\newcommand\BibTeX{B\textsc{ib}\TeX} + +\definecolor{azure}{rgb}{0.0, 0.5, 1.0} +\tcbuselibrary{skins} +\tcbset{enhanced} +\newcommand{\qsecbox}[1]{\begin{tcolorbox}[left=1mm,right=1mm,boxrule=0.2mm,leftrule=2mm,drop fuzzy shadow,colframe=lightgray,frame style={left color=azure!90!lightgray}]#1\end{tcolorbox}} + +\makeatletter +\newcommand\footnoteref[1]{\protected@xdef\@thefnmark{\ref{#1}}\@footnotemark} +\makeatother + + +\title{The Human Evaluation Datasheet 1.0: A Template for Recording\\Details of Human Evaluation Experiments in NLP\\ +\normalsize{(described in \citet{shimorina-belz-2021-heds})}} + +\begin{document} +\maketitle + + +\section{Paper and Supplementary Resources (Questions 1.1--1.3)}\label{sec:paper-resources} + +Questions 1.1--1.3 record bibliographic and related information. These are straightforward and don't warrant much in-depth explanation. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 1.1: Link to paper reporting the evaluation experiment. If the paper reports more than one experiment, state which experiment you're completing this sheet for. Or, if applicable, enter `for preregistration.'}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: a link to an online copy of the main reference for the human evaluation experiment, identifying which of the experiments the form is being completed for if there are several. If the experiment hasn't been run yet, and the form is being completed for the purpose of submitting it for preregistration, simply enter `for preregistration'. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 1.2: Link to website providing resources used in the evaluation experiment (e.g.\ system outputs, evaluation tools, etc.). If there isn't one, enter `N/A'.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: link(s) to any resources used in the evaluation experiment, such as system outputs, evaluation tools, etc.\ If there aren't any publicly shared resources (yet), enter `N/A’. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 1.3: Name, affiliation and email address of person completing this sheet, and of contact author if different.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: names, affiliations and email addresses as appropriate. + + + +\section{System (Questions 2.1--2.5)}\label{sec:system} + +Questions 2.1--2.5 record information about the system(s) (or human-authored stand-ins) whose outputs are evaluated in the Evaluation experiment that this sheet is being completed for. + +The input, output, and task questions in this section are closely interrelated: the value for one partially determines the others, as indicated for some combinations in Question 2.3. + + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 2.1: What type of input do the evaluated system(s) take? Select all that apply. If none match, select `Other' and describe.}}\label{sec:input} +\vspace{-.1cm} + +Describe the type of input, where input refers to the representations and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g.\ if the input is a set of documents, you would still select \textit{text: document} below. + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{raw/structured data}: numerical, symbolic, and other data, possibly structured into trees, graphs, graphical models, etc. May be the input e.g.\ to Referring Expression Generation (REG), end-to-end text generation, etc. {NB}: excludes linguistic structures. + + \item \egcvalue{deep linguistic representation (DLR)}: any of a variety of deep, underspecified, semantic representations, such as abstract meaning representations \citep[AMRs;][]{banarescu-etal-2013-abstract} or discourse representation structures \citep[DRSs;][]{kamp-reyle2013discourse}. + + \item \egcvalue{shallow linguistic representation (SLR)}: any of a variety of shallow, syntactic representations, e.g.\ Universal Dependency (UD) structures; typically the input to surface realisation. + + \item \egcvalue{text: subsentential unit of text}: a unit of text shorter than a sentence, e.g.\ Referring Expressions (REs), verb phrase, text fragment of any length; includes titles/headlines. + + \item \egcvalue{text: sentence}: a single sentence (or set of sentences). + + \item \egcvalue{text: multiple sentences}: a sequence of multiple sentences, without any document structure (or a set of such sequences). + + \item \egcvalue{text: document}: a text with document structure, such as a title, paragraph breaks or sections, e.g.\ a set of news reports for summarisation. + + \item \egcvalue{text: dialogue}: a dialogue of any length, excluding a single turn which would come under one of the other text types. + + \item \egcvalue{text: other}: input is text but doesn't match any of the above \textit{text:*} categories. + + \item \egcvalue{speech}: a recording of speech. + + \item \egcvalue{visual}: an image or video. + + \item \egcvalue{multi-modal}: catch-all value for any combination of data and/or linguistic representation and/or visual data etc. + + \item \egcvalue{control feature}: a feature or parameter specifically present to control a property of the output text, e.g.\ positive stance, formality, author style. + + \item \egcvalue{no input (human generation)}: human generation\footnote{\label{human-generation}We use the term `human generation' where the items being evaluated have been created manually, rather than generated by an automatic system.}, therefore no system inputs. + + \item \egcvalue{other (please specify)}: if input is none of the above, choose this option and describe it. + +\end{enumerate} + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 2.2: What type of output do the evaluated system(s) generate? Select all that apply. If none match, select `Other' and describe.}}\label{sec:output} + +Describe the type of input, where input refers to the representations and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g.\ if the output is a set of documents, you would still select \textit{text: document} below. + +Note that the options for outputs are the same as for inputs minus the \textit{control feature} option. + + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{raw/structured data}: numerical, symbolic, and other data, possibly structured into trees, graphs, graphical models, etc. May be the input e.g.\ to Referring Expression Generation (REG), end-to-end text generation, etc. {NB}: excludes linguistic structures. + + \item \egcvalue{deep linguistic representation (DLR)}: any of a variety of deep, underspecified, semantic representations, such as abstract meaning representations \citep[AMRs;][]{banarescu-etal-2013-abstract} or discourse representation structures \citep[DRSs;][]{kamp-reyle2013discourse}. + + \item \egcvalue{shallow linguistic representation (SLR)}: any of a variety of shallow, syntactic representations, e.g.\ Universal Dependency (UD) structures; typically the input to surface realisation. + + \item \egcvalue{text: subsentential unit of text}: a unit of text shorter than a sentence, e.g.\ Referring Expressions (REs), verb phrase, text fragment of any length; includes titles/headlines. + + \item \egcvalue{text: sentence}: a single sentence (or set of sentences). + + \item \egcvalue{text: multiple sentences}: a sequence of multiple sentences, without any document structure (or a set of such sequences). + + \item \egcvalue{text: document}: a text with document structure, such as a title, paragraph breaks or sections, e.g.\ a set of news reports for summarisation. + + \item \egcvalue{text: dialogue}: a dialogue of any length, excluding a single turn which would come under one of the other text types. + + \item \egcvalue{text: other}: select if output is text but doesn't match any of the above \textit{text:*} categories. + + \item \egcvalue{speech}: a recording of speech. + + \item \egcvalue{visual}: an image or video. + + \item \egcvalue{multi-modal}: catch-all value for any combination of data and/or linguistic representation and/or visual data etc. + + \item \egcvalue{human-generated `outputs'}: manually created stand-ins exemplifying outputs.\footnoteref{human-generation} + + \item \egcvalue{other (please specify)}: if output is none of the above, choose this option and describe it. + + \end{enumerate} + + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 2.3: How would you describe the task that the evaluated system(s) perform in mapping the inputs in Q2.1 to the outputs in Q2.2? Occasionally, more than one of the options below may apply. If none match, select `Other' and describe.}}\label{sec:task} +\vspace{-.1cm} + +This field records the task performed by the system(s) being evaluated. This is independent of the application domain (financial reporting, weather forecasting, etc.), or the specific method (rule-based, neural, etc.) implemented in the system. We indicate mutual constraints between inputs, outputs and task for some of the options below. + + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{content selection/determination}: selecting the specific content that will be expressed in the generated text from a representation of possible content. This could be attribute selection for REG (without the surface realisation step). Note that the output here is not text. + + \item \egcvalue{content ordering/structuring}: assigning an order and/or structure to content to be included in generated text. Note that the output here is not text. + + \item \egcvalue{aggregation}: converting inputs (typically \textit{deep linguistic representations} or \textit{shallow linguistic representations}) in some way in order to reduce redundancy (e.g.\ representations for `they like swimming', `they like running' $\rightarrow$ representation for `they like swimming and running'). + + \item \egcvalue{referring expression generation}: generating \textit{text} to refer to a given referent, typically represented in the input as a set of attributes or a linguistic representation. + + \item \egcvalue{lexicalisation}: associating (parts of) an input representation with specific lexical items to be used in their realisation. + + \item \egcvalue{deep generation}: one-step text generation from \textit{raw/structured data} or \textit{deep linguistic representations}. One-step means that no intermediate representations are passed from one independently run module to another. + + \item \egcvalue{surface realisation (SLR to text)}: one-step text generation from \textit{shallow linguistic representations}. One-step means that no intermediate representations are passed from one independently run module to another. + + \item \egcvalue{feature-controlled text generation}: generation of text that varies along specific dimensions where the variation is controlled via \textit{control feature}s specified as part of the input. Input is a non-textual representation (for feature-controlled text-to-text generation select the matching text-to-text task). + + \item \egcvalue{data-to-text generation}: generation from \textit{raw/structured data} which may or may not include some amount of content selection as part of the generation process. Output is likely to be \textit{text:*} or \textit{multi-modal}. + + \item \egcvalue{dialogue turn generation}: generating a dialogue turn (can be a greeting or closing) from a representation of dialogue state and/or last turn(s), etc. + + \item \egcvalue{question generation}: generation of questions from given input text and/or knowledge base such that the question can be answered from the input. + + \item \egcvalue{question answering}: input is a question plus optionally a set of reference texts and/or knowledge base, and the output is the answer to the question. + + \item \egcvalue{paraphrasing/lossless simplification}: text-to-text generation where the aim is to preserve the meaning of the input while changing its wording. This can include the aim of changing the text on a given dimension, e.g.\ making it simpler, changing its stance or sentiment, etc., which may be controllable via input features. Note that this task type includes meaning-preserving text simplification (non-meaning preserving simplification comes under \textit{compression/lossy simplification} below). + + \item \egcvalue{compression/lossy simplification}: text-to-text generation that has the aim to generate a shorter, or shorter and simpler, version of the input text. This will normally affect meaning to some extent, but as a side effect, rather than the primary aim, as is the case in \textit{summarisation}. + + \item \egcvalue{machine translation}: translating text in a source language to text in a target language while maximally preserving the meaning. + + \item \egcvalue{summarisation (text-to-text)}: output is an extractive or abstractive summary of the important/relevant/salient content of the input document(s). + + \item \egcvalue{end-to-end text generation}: use this option if the single system task corresponds to more than one of tasks above, implemented either as separate modules pipelined together, or as one-step generation, other than \textit{deep generation} and \textit{surface realisation}. + + \item \egcvalue{image/video description}: input includes \textit{visual}, and the output describes it in some way. + + \item \egcvalue{post-editing/correction}: system edits and/or corrects the input text (typically itself the textual output from another system) to yield an improved version of the text. + + \item \egcvalue{other (please specify)}: if task is none of the above, choose this option and describe it. + \end{enumerate} + + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 2.4: Input Language(s), or `N/A'.}} +\vspace{-.1cm} + +This field records the language(s) of the inputs accepted by the system(s) being evaluated. + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: any language name(s) that apply, mapped to standardised full language names in ISO 639-1\footnote{\label{iso}\url{https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes}}. E.g.\ English, Herero, Hindi. +If no language is accepted as (part of) the input, enter `N/A'. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 2.5: Output Language(s), or `N/A'.}} +\vspace{-.1cm} + +This field records the language(s) of the outputs generated by the system(s) being evaluated. + +\vspace{.2cm} +\noindent\textit{What to enter in the text box}: any language name(s) that apply, mapped to standardised full language names in ISO 639-1 (2019)\footnoteref{iso}. E.g.\ English, Herero, Hindi. +If no language is generated, enter `N/A'. + + +\section{Output Sample, Evaluators, Experimental Design}\label{sec:design} + +\subsection{Sample of system outputs (or human-authored stand-ins) evaluated (Questions 3.1.1--3.1.3)} + +Questions 3.1.1--3.1.3 record information about the size of the sample of outputs (or human-authored stand-ins) evaluated per system, how the sample was selected, and what its statistical power is. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.1.1: How many system outputs (or other evaluation items) are evaluated per system in the evaluation experiment? Answer should be an integer.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The number of system outputs (or other evaluation items) that are evaluated per system by at least one evaluator in the experiment, as an integer. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.1.2: How are system outputs (or other evaluation items) selected for inclusion in the evaluation experiment? If none match, select `Other' and describe.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{by an automatic random process from a larger set}: outputs were selected for inclusion in the experiment by a script using a pseudo-random number generator; don't use this option if the script selects every $n$th output (which is not random). + \item \egcvalue{by an automatic random process but using stratified sampling over given properties}: use this option if selection was by a random script as above, but with added constraints ensuring that the sample is representative of the set of outputs it was selected from, in terms of given properties, such as sentence length, positive/negative stance, etc. + \item \egcvalue{by manual, arbitrary selection}: output sample was selected by hand, or automatically from a manually compiled list, without a specific selection criterion. + \item \egcvalue{by manual selection aimed at achieving balance or variety relative to given properties}: selection by hand as above, but with specific selection criteria, e.g.\ same number of outputs from each time period. + \item \egcvalue{Other (please specify)}: if selection method is none of the above, choose this option and describe it. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.1.3: What is the statistical power of the sample size?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The results of a statistical power calculation on the output sample: provide numerical results and a link to the script used (or another way of identifying the script). See, e.g., \citet{card-etal-2020-little}. + + + +\subsection{Evaluators (Questions 3.2.1--3.2.4)} + +Questions 3.2.1--3.2.4 record information about the evaluators participating in the experiment. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.2.1: How many evaluators are there in this experiment? Answer should be an integer.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the total number of evaluators participating in the experiment, as an integer. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.2.2: What kind of evaluators are in this experiment? Select all that apply. If none match, select `Other' and describe. In all cases, provide details in the text box under `Other'.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{experts}: participants are considered domain experts, e.g.\ meteorologists evaluating a weather forecast generator, or nurses evaluating an ICU report generator. + \item \egcvalue{non-experts}: participants are not domain experts. + \item \egcvalue{paid (including non-monetary compensation such as course credits)}: participants were given some form of compensation for their participation, including vouchers, course credits, and reimbursement for travel unless based on receipts. + \item \egcvalue{not paid}: participants were not given compensation of any kind. + \item \egcvalue{previously known to authors}: (one of the) researchers running the experiment knew some or all of the participants before recruiting them for the experiment. + \item \egcvalue{not previously known to authors}: none of the researchers running the experiment knew any of the participants before recruiting them for the experiment. + \item \egcvalue{evaluators include one or more of the authors}: one or more researchers running the experiment was among the participants. + \item \egcvalue{evaluators do not include any of the authors}: none of the researchers running the experiment were among the participants. + \item \egcvalue{Other} (fewer than 4 of the above apply): we believe you should be able to tick 4 options of the above. If that's not the case, use this box to explain. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.2.3: How are evaluators recruited?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Please explain how your evaluators are recruited. Do you send emails to a given list? Do you post invitations on social media? Posters on university walls? Were there any gatekeepers involved? What are the exclusion/inclusion criteria? + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.2.4: What training and/or practice are evaluators given before starting on the evaluation itself?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe any training evaluators were given as part of the experiment to prepare them for the evaluation task, including any practice evaluations they did. This includes any introductory explanations they're given, e.g.\ on the start page of an online evaluation tool. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.2.5: What other characteristics do the evaluators have, known either because these were qualifying criteria, or from information gathered as part of the evaluation?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to list any characteristics not covered in previous questions that the evaluators are known to have, either because evaluators were selected on the basis of a characteristic, or because information about a characteristic was collected as part of the evaluation. This might include geographic location of IP address, educational level, or demographic information such as gender, age, etc. Where characteristics differ among evaluators (e.g.\ gender, age, location etc.), also give numbers for each subgroup. + + +\subsection{Experimental design (Questions 3.3.1--3.3.8)} + +Questions~3.3.1--3.3.8 record information about the experimental design of the evaluation experiment. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.3.1: Has the experimental design been preregistered? If yes, on which registry?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `Yes' or `No'; if `Yes' also give the name of the registry and a link to the registration page for the experiment. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.3.2: How are responses collected? E.g.\ paper forms, online survey tool, etc.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe how you collected responses, e.g.\ paper forms, Google forms, SurveyMonkey, Mechanical Turk, CrowdFlower, audio/video recording, etc. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.3.3: What quality assurance methods are used? Select all that apply. If none match, select `Other' and describe. In all cases, provide details in the text box under `Other'.}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators are required to be native speakers of the language they evaluate}: mechanisms are in place to ensure all participants are native speakers of the language they evaluate. + \item \egcvalue{automatic quality checking methods are used during/post evaluation}: evaluations are checked for quality by automatic scripts during or after evaluations, e.g.\ evaluators are given known bad/good outputs to check they're given bad/good scores on MTurk. + \item \egcvalue{manual quality checking methods are used during/post evaluation}: evaluations are checked for quality by a manual process during or after evaluations, e.g.\ scores assigned by evaluators are monitored by researchers conducting the experiment. + \item \egcvalue{evaluators are excluded if they fail quality checks (often or badly enough)}: there are conditions under which evaluations produced by participants are not included in the final results due to quality issues. + \item \egcvalue{some evaluations are excluded because of failed quality checks}: there are conditions under which some (but not all) of the evaluations produced by some participants are not included in the final results due to quality issues. + \item \egcvalue{none of the above}: tick this box if none of the above apply. + \item \egcvalue{Other (please specify)}: use this box to describe any other quality assurance methods used during or after evaluations, and to provide additional details for any of the options selected above. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 3.3.4: What do evaluators see when carrying out evaluations? Link to screenshot(s) and/or describe the evaluation interface(s).}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this space to describe the interface, paper form, etc.\ that evaluators see when they carry out the evaluation. Link to a screenshot/copy if possible. If there is a separate introductory interface/page, include it under Question 3.2.4. + +%\vspace{-.3cm} +\subsection*{\qsecbox{3.3.5: How free are evaluators regarding when and how quickly to carry out evaluations? Select all that apply. In all cases, provide details in the text box under `Other'.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators have to complete each individual assessment within a set time}: evaluators are timed while carrying out each assessment and cannot complete the assessment once time has run out. + \item \egcvalue{evaluators have to complete the whole evaluation in one sitting}: partial progress cannot be saved and the evaluation returned to on a later occasion. + \item \egcvalue{neither of the above}: Choose this option if neither of the above are the case in the experiment. + \item \egcvalue{Other (please specify)}: Use this space to describe any other way in which time taken or number of sessions used by evaluators is controlled in the experiment, and to provide additional details for any of the options selected above. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{3.3.6: Are evaluators told they can ask questions about the evaluation and/or provide feedback? Select all that apply. In all cases, provide details in the text box under `Other'.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Check-box options (select all that apply)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\small $\square$}] + \item \egcvalue{evaluators are told they can ask any questions during/after receiving initial training/instructions, and before the start of the evaluation}: evaluators are told explicitly that they can ask questions about the evaluation experiment \textit{before} starting on their assessments, either during or after training. + \item \egcvalue{evaluators are told they can ask any questions during the evaluation}: evaluators are told explicitly that they can ask questions about the evaluation experiment \textit{during} their assessments. + \item \egcvalue{evaluators are asked for feedback and/or comments after the evaluation, e.g.\ via an exit questionnaire or a comment box}: evaluators are explicitly asked to provide feedback and/or comments about the experiment \textit{after} their assessments, either verbally or in written form. + \item \egcvalue{None of the above}: Choose this option if none of the above are the case in the experiment. + \item \egcvalue{Other (please specify)}: use this space to describe any other ways you provide for evaluators to ask questions or provide feedback. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{3.3.7: What are the experimental conditions in which evaluators carry out the evaluations? If none match, select `Other’ and describe.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{evaluation carried out by evaluators at a place of their own choosing, e.g.\ online, using a paper form, etc.}: evaluators are given access to the tool or form specified in Question 3.3.2, and subsequently choose where to carry out their evaluations. + \item \egcvalue{evaluation carried out in a lab, and conditions are the same for each evaluator}: evaluations are carried out in a lab, and conditions in which evaluations are carried out \textit{are} controlled to be the same, i.e.\ the different evaluators all carry out the evaluations in identical conditions of quietness, same type of computer, same room, etc. Note we're not after very fine-grained differences here, such as time of day or temperature, but the line is difficult to draw, so some judgment is involved here. + \item \egcvalue{evaluation carried out in a lab, and conditions vary for different evaluators}: choose this option if evaluations are carried out in a lab, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{evaluation carried out in a real-life situation, and conditions are the same for each evaluator}: evaluations are carried out in a real-life situation, i.e.\ one that would occur whether or not the evaluation was carried out (e.g.\ evaluating a dialogue system deployed in a live chat function on a website), and conditions in which evaluations are carried out \textit{are} controlled to be the same. + \item \egcvalue{evaluation carried out in a real-life situation, and conditions vary for different evaluators}: choose this option if evaluations are carried out in a real-life situation, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{evaluation carried out outside of the lab, in a situation designed to resemble a real-life situation, and conditions are the same for each evaluator}: evaluations are carried out outside of the lab, in a situation intentionally similar to a real-life situation (but not actually a real-life situation), e.g.\ user-testing a navigation system where the destination is part of the evaluation design, rather than chosen by the user. Conditions in which evaluations are carried out \textit{are} controlled to be the same. + \item \egcvalue{evaluation carried out outside of the lab, in a situation designed to resemble a real-life situation, and conditions vary for different evaluators}: choose this option if evaluations are carried out outside of the lab, in a situation intentionally similar to a real-life situation, but the preceding option does not apply, i.e.\ conditions in which evaluations are carried out are \textit{not} controlled to be the same. + \item \egcvalue{Other (please specify)}: Use this space to provide additional, or alternative, information about the conditions in which evaluators carry out assessments, not covered by the options above. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{\qsecbox{3.3.8: Unless the evaluation is carried out at a place of the evaluators' own choosing, briefly describe the (range of different) conditions in which evaluators carry out the evaluations.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: use this space to describe the variations in the conditions in which evaluators carry out the evaluation, for both situations where those variations are controlled, and situations where they are not controlled. + +\vspace{.3cm} + +\section{Quality Criterion \textit{n} -- Definition and Operationalisation} +\label{sec:criteria} + +Questions in this section collect information about the $n$th quality criterion assessed in the single human evaluation experiment that this sheet is being completed for. The HEDS 1.0 form allows this section to be completed repeatedly, for up to 10 different quality criteria (see further explanation at the end of the section). + +For more information, in particular about quality criterion properties and evaluation mode properties, see \citet{belz-etal-2020-disentangling}. + + +\subsection{Quality criterion properties (Questions 4.1.1--4.1.3)} + +Questions 4.1.1--4.1.3 capture the aspect of quality that is assessed by a given quality criterion in terms of three orthogonal properties. They help determine e.g.\ whether or not the same aspect of quality is being evaluated in different evaluation experiments. The three properties characterise quality criteria in terms of (i) what type of quality is being assessed; (ii) what aspect of the system output is being assessed; and (iii) whether system outputs are assessed in their own right or with reference to some system-internal or system-external frame of reference. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.1.1: What type of quality is assessed by the quality criterion?}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Correctness}: select this option if it is possible to state, generally for all outputs, the conditions under which outputs are maximally correct (hence of maximal quality). E.g.\ for Grammaticality, outputs are (maximally) correct if they contain no grammatical errors; for Semantic Completeness, outputs are correct if they express all the content in the input. + \item \egcvalue{Goodness}: select this option if, in contrast to correctness criteria, there is no single, general mechanism for deciding when outputs are maximally good, only for deciding for two outputs which is better and which is worse. E.g.\ for Fluency, even if outputs contain no disfluencies, there may be other ways in which any given output could be more fluent. + \item \egcvalue{Features}: choose this option if, in terms of property $X$ captured by the criterion, outputs are not generally better if they are more $X$, but instead, depending on evaluation context, more $X$ may be better or less $X$ may be better. E.g.\ outputs can be more specific or less specific, but it’s not the case that outputs are, in the general case, better when they are more specific. +\end{enumerate} + + +\subsection*{\qsecbox{Question 4.1.2: Which aspect of system outputs is assessed by the quality criterion?}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Form of output}: choose this option if the criterion assesses the form of outputs alone, e.g.\ Grammaticality is only about the form, a sentence can be grammatical yet be wrong or nonsensical in terms of content. + \item \egcvalue{Content of output}: choose this option if the criterion assesses the content/meaning of the output alone, e.g.\ Meaning Preservation only assesses output content; two sentences can be considered to have the same meaning, but differ in form. + \item \egcvalue{Both form and content of output}: choose this option if the criterion assesses outputs as a whole, not just form or just content. E.g.\ Coherence is a property of outputs as a whole, either form or meaning can detract from it. +\end{enumerate} + + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.1.3: Is each output assessed for quality in its own right, or with reference to a system-internal or external frame of reference?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Quality of output in its own right}: choose this option if output quality is assessed without referring to anything other than the output itself, i.e.\ no system-internal or external frame of reference. E.g.\ Poeticness is assessed by considering (just) the output and how poetic it is. + \item \egcvalue{Quality of output relative to the input}: choose this option if output quality is assessed relative to the input. E.g.\ Answerability is the degree to which the output question can be answered from information in the input. + \item \egcvalue{Quality of output relative to a system-external frame of reference}: choose this option if output quality is assessed with reference to system-external information, such as a knowledge base, a person’s individual writing style, or the performance of an embedding system. E.g.\ Factual Accuracy assesses outputs relative to a source of real-world knowledge. +\end{enumerate} + + + +\subsection{Evaluation mode properties (Questions 4.2.1--4.2.3)} + +Questions 4.2.1--4.2.3 record properties that are orthogonal to quality criteria, i.e.\ any given quality criterion can in principle be combined with any of the modes (although some combinations are more common than others). + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.2.1: Does an individual assessment involve an objective or a subjective judgment?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Objective}: Examples of objective assessment include any automatically counted or otherwise quantified measurements such as mouse-clicks, occurrences in text, etc. Repeated assessments of the same output with an objective-mode evaluation method always yield the same score/result. + \item \egcvalue{Subjective}: Subjective assessments involve ratings, opinions and preferences by evaluators. Some criteria lend themselves more readily to subjective assessments, e.g.\ Friendliness of a conversational agent, but an objective measure e.g.\ based on lexical markers is also conceivable. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.2.2: Are outputs assessed in absolute or relative terms?}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Absolute}: choose this option if evaluators are shown outputs from a single system during each individual assessment. + \item \egcvalue{Relative}: choose this option if evaluators are shown outputs from multiple systems at the same time during assessments, typically ranking or preference-judging them. +\end{enumerate} + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.2.3: Is the evaluation intrinsic or extrinsic?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +%%% AB: wording same as in INLG paper, keep +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Intrinsic}: Choose this option if quality of outputs is assessed \textit{without} considering their \textit{effect} on something external to the system, e.g.\ the performance of an embedding system or of a user at a task. + \item \egcvalue{Extrinsic}: Choose this option if quality of outputs is assessed in terms of their \textit{effect} on something external to the system such as the performance of an embedding system or of a user at a task. +\end{enumerate} + + +\subsection{Response elicitation (Questions 4.3.1--4.3.11)} + +Questions 4.3.1--4.3.11 record information about how responses are elicited for the quality criterion this section is being completed for. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.1: What do you call the quality criterion in explanations/interfaces to evaluators? Enter `N/A' if criterion not named.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the name you use to refer to the quality criterion in explanations and/or interfaces created for evaluators. Examples of quality criterion names include Fluency, Clarity, Meaning Preservation. If no name is used, state `N/A'. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.2: What definition do you give for the quality criterion in explanations/interfaces to evaluators? Enter `N/A' if no definition given.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Copy and past the verbatim definition you give to evaluators to explain the quality criterion they're assessing. If you don't explicitly call it a definition, enter the nearest thing to a definition you give them. If you don't give any definition, state `N/A'. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.3: Size of scale or other rating instrument (i.e.\ how many different possible values there are). Answer should be an integer or `continuous' (if it's not possible to state how many possible responses there are). Enter `N/A' if there is no rating instrument.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: The number of different response values for this quality criterion. E.g.\ for a 5-point Likert scale, the size to enter is 5. For two-way forced-choice preference judgments, it is 2; if there's also a no-preference option, enter 3. For a slider that is mapped to 100 different values for the purpose of recording assessments, the size to enter is 100. If no rating instrument is used (e.g.\ when evaluation gathers post-edits or qualitative feedback only), enter `N/A'. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.4: List or range of possible values of the scale or other rating instrument. Enter `N/A', if there is no rating instrument.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: list, or give the range of, the possible values of the rating instrument. The list or range should be of the size specified in Question 4.3.3. If there are too many to list, use a range. E.g.\ for two-way forced-choice preference judgments, the list entered might be \textit{A better, B better}; if there's also a no-preference option, the list +might be \textit{A better, B better, neither}. For a slider that is mapped to 100 different values for the purpose of recording assessments, the range \textit{1--100} might be entered. If no rating instrument is used (e.g.\ when evaluation gathers post-edits or qualitative feedback only), enter 'N/A'. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.5: How is the scale or other rating instrument presented to evaluators? If none match, select `Other’ and describe.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}: +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{Multiple-choice options}: choose this option if evaluators select exactly one of multiple options. + \item \egcvalue{Check-boxes}: choose this option if evaluators select any number of options from multiple given options. + \item \egcvalue{Slider}: choose this option if evaluators move a pointer on a slider scale to the position corresponding to their assessment. + \item \egcvalue{N/A (there is no rating instrument)}: choose this option if there is no rating instrument. + \item \egcvalue{Other (please specify)}: choose this option if there is a rating instrument, but none of the above adequately describe the way you present it to evaluators. Use the text box to describe the rating instrument and link to a screenshot. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.6: If there is no rating instrument, describe briefly what task the evaluators perform (e.g.\ ranking multiple outputs, finding information, playing a game, etc.), and what information is recorded. Enter `N/A' if there is a rating instrument.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: If (and only if) there is no rating instrument, i.e.\ you entered `N/A' for Questions 4.3.3--4.3.5, describe the task evaluators perform in this space. Otherwise, here enter `N/A' if there \textit{is} a rating instrument. + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.7: What is the verbatim question, prompt or instruction given to evaluators (visible to them during each individual assessment)? }} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Copy and paste the verbatim text that evaluators see during each assessment, that is intended to convey the evaluation task to them. E.g.\ \textit{Which of these texts do you prefer?} Or \textit{Make any corrections to this text that you think are necessary in order to improve it to the point where you would be happy to provide it to a client.} + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.8: Form of response elicitation. If none match, select `Other' and describe.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{Multiple-choice options (select one)}:\footnote{Explanations adapted from \citet{howcroft-etal-2020-twenty}.} +\vspace{-.1cm} + +\begin{enumerate}[itemsep=0cm,leftmargin=0.5cm,label={\LARGE $\circ$}] + \item \egcvalue{(dis)agreement with quality statement}: Participants specify the degree to which they agree with a given quality statement by indicating their agreement on a rating instrument. The rating instrument is labelled with degrees of agreement and can additionally have numerical labels. E.g.\ \textit{This text is fluent --- 1=strongly disagree...5=strongly agree}. + \item \egcvalue{direct quality estimation}: Participants are asked to provide a rating using a rating instrument, which typically (but not always) mentions the quality criterion explicitly. E.g.\ \textit{How fluent is this text? --- 1=not at all fluent...5=very fluent}. + \item \egcvalue{relative quality estimation (including ranking)}: Participants evaluate two or more items in terms of which is better. + E.g.\ \textit{Rank these texts in terms of fluency}; \textit{Which of these texts is more fluent?}; \textit{Which of these items do you prefer?}. + \item \egcvalue{counting occurrences in text}: Evaluators are asked to count how many times some type of phenomenon occurs, e.g.\ the number of facts contained in the output that are inconsistent with the input. + \item \egcvalue{qualitative feedback (e.g.\ via comments entered in a text box)}: Typically, these are responses to open-ended questions in a survey or interview. + \item \egcvalue{evaluation through post-editing/annotation}: Choose this option if the evaluators' task consists of editing or inserting annotations in text. E.g.\ evaluators may perform error correction and edits are then automatically measured to yield a numerical score. + \item \egcvalue{output classification or labelling}: Choose this option if evaluators assign outputs to categories. E.g.\ \textit{What is the overall sentiment of this piece of text? --- Positive/neutral/negative.} + \item \egcvalue{user-text interaction measurements}: choose this option if participants in the evaluation experiment interact with a text in some way, and measurements are taken of their interaction. E.g.\ reading speed, eye movement tracking, comprehension questions, etc. Excludes situations where participants are given a task to solve and their performance is measured which comes under the next option. + \item \egcvalue{task performance measurements}: choose this option if participants in the evaluation experiment are given a task to perform, and measurements are taken of their performance at the task. E.g.\ task is finding information, and task performance measurement is task completion speed and success rate. + \item \egcvalue{user-system interaction measurements}: choose this option if participants in the evaluation experiment interact with a system in some way, while measurements are taken of their interaction. E.g.\ duration of interaction, hyperlinks followed, number of likes, or completed sales. + \item \egcvalue{Other (please specify)}: Use the text box to describe the form of response elicitation used in assessing the quality criterion if it doesn't fall in any of the above categories. +\end{enumerate} + +%\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.9: How are raw responses from participants aggregated or otherwise processed to obtain reported scores for this quality criterion? State if no scores reported.}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: normally a set of separate assessments is collected from evaluators and is converted to the results as reported. Describe here the method(s) used in the conversion(s). E.g.\ macro-averages or micro-averages are computed from numerical scores to provide summary, per-system results. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.10: Method(s) used for determining effect size and significance of findings for this quality criterion.}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: A list of methods used for calculating the effect size and significance of any results, both as reported in the paper given in Question 1.1, for this quality criterion. If none calculated, state `None'. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 4.3.11: Has the inter-annotator and intra-annotator agreement between evaluators for this quality criterion been measured? If yes, what method was used, and what are the agreement scores?}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: the methods used to compute, and results obtained from, any measures of inter-annotator and intra-annotator agreement obtained for the quality criterion. + +\vspace{.3cm} +\noindent The section ends with the question \textbf{Is there another quality criterion in the evaluation experiment that you haven't completed this section for yet?} If \textbf{Yes} is selected, please copy this section and complete it for the next criterion. If \textbf{No}, the next section will be the Ethics section below. + +\section{Ethics}\label{sec:ethics} + +The questions in this section relate to ethical aspects of the evaluation. Information can be entered in the text box provided, and/or by linking to a source where complete information can be found. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 5.1: Has the evaluation experiment this sheet is being completed for, or the larger study it is part of, been approved by a research ethics committee? If yes, which research ethics committee?}} +\vspace{-.1cm} + + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Typically, research organisations, universities and other higher-education institutions require some form ethical approval before experiments involving human participants, however innocuous, are permitted to proceed. Please provide here the name of the body that approved the experiment, or state `No' if approval has not (yet) been obtained. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 5.2: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain personal data (as defined in GDPR Art. 4, §1: https://gdpr.eu/article-4-definitions/)? If yes, describe data and state how addressed.}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `No' if no personal data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements such as privacy and security was ensured, e.g.\ by linking to the (successful) application for ethics approval from Question 5.1. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 5.3: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain special category information (as defined in GDPR Art. 9, §1: https://gdpr.eu/article-9-processing-special-categories-of-personal-data-prohibited/)? If yes, describe data and state how addressed.}} +\vspace{-.1cm} + +\vspace{.3cm} +\noindent\textit{What to enter in the text box}: State `No' if no special-category data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements relating to special-category data was ensured, e.g.\ by linking to the (successful) application for ethics approval from Question 5.1. + +\vspace{-.3cm} +\subsection*{\qsecbox{Question 5.4: Have any impact assessments been carried out for the evaluation experiment, and/or any data collected/evaluated in connection with it? If yes, summarise approach(es) and outcomes.}} +%\vspace{-.1cm} + +%\vspace{.3cm} +\noindent\textit{What to enter in the text box}: Use this box to describe any \textit{ex ante} or \textit{ex post} impact assessments that have been carried out in relation to the evaluation experiment, such that the assessment plan and process, as well as the outcomes, were captured in written form. Link to documents if possible. Types of impact assessment include data protection impact assessments, e.g.\ under GDPR.\footnote{\footnotesize \url{https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/accountability-and-governance/data-protection-impact-assessments/}}Environmental and social impact assessment frameworks are also available. + + +\section*{Credits} + +Questions 2.1--2.5 relating to evaluated system, and 4.3.1--4.3.8 relating to response elicitation, are based on \citet{howcroft-etal-2020-twenty}, with some significant changes. Questions 4.1.1--4.2.3 relating to quality criteria, and some of the questions about system outputs, evaluators, and experimental design (3.1.1--3.2.3, 4.3.5, 4.3.6, 4.3.9--4.3.11) are based on \citet{belz-etal-2020-disentangling}. +HEDS was also informed by \citet{van2019best, vanderlee2021} and by \citet{gehrmann2021gem}'s\footnote{\footnotesize \url{https://gem-benchmark.com/data\_cards/guide}} data card guide. + +More generally, the original inspiration for creating a `datasheet' for describing human evaluation experiments of course comes from seminal papers by \citet{bender-friedman-2018-data}, \citet{mitchell2019modelcards} and \citet{gebru2018datasheets}. + +\bibliography{human-evaluation-datasheet} +\bibliographystyle{acl_natbib} + +\end{document} diff --git a/sheet/markdown/human-evaluation-datasheet.md b/sheet/markdown/human-evaluation-datasheet.md new file mode 100644 index 0000000..273ebfb --- /dev/null +++ b/sheet/markdown/human-evaluation-datasheet.md @@ -0,0 +1,1023 @@ +--- +bibliography: ../sheet/latex/human-evaluation-datasheet.bib +csl: ../sheet/latex/apa-annotated-bibliography.csl +title: | + The Human Evaluation Datasheet 1.0: A Template for Recording + Details of Human Evaluation Experiments in NLP + (described in Shimorina & Belz (2021)) +--- + +# Paper and Supplementary Resources (Questions 1.1–1.3) + +Questions 1.1–1.3 record bibliographic and related information. These +are straightforward and don’t warrant much in-depth explanation. + +## Question 1.1: Link to paper reporting the evaluation experiment. If the paper reports more than one experiment, state which experiment you’re completing this sheet for. Or, if applicable, enter ‘for preregistration.’ + +*What to enter in the text box*: a link to an online copy of the main +reference for the human evaluation experiment, identifying which of the +experiments the form is being completed for if there are several. If the +experiment hasn’t been run yet, and the form is being completed for the +purpose of submitting it for preregistration, simply enter ‘for +preregistration.’ + +## Question 1.2: Link to website providing resources used in the evaluation experiment (e.g. system outputs, evaluation tools, etc.). If there isn’t one, enter ‘N/A.’ + +*What to enter in the text box*: link(s) to any resources used in the +evaluation experiment, such as system outputs, evaluation tools, etc. If +there aren’t any publicly shared resources (yet), enter ‘N/A’.   + +## Question 1.3: Name, affiliation and email address of person completing this sheet, and of contact author if different. + +*What to enter in the text box*: names, affiliations and email addresses +as appropriate. + +# System (Questions 2.1–2.5) + +Questions 2.1–2.5 record information about the system(s) (or +human-authored stand-ins) whose outputs are evaluated in the Evaluation +experiment that this sheet is being completed for. + +The input, output, and task questions in this section are closely +interrelated: the value for one partially determines the others, as +indicated for some combinations in Question 2.3. + +## Question 2.1: What type of input do the evaluated system(s) take? Select all that apply. If none match, select ‘Other’ and describe. + +Describe the type of input, where input refers to the representations +and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g. if the +input is a set of documents, you would still select *text: document* +below. + +*Check-box options (select all that apply)*: + +1. ***raw/structured data***: numerical, symbolic, and other data, + possibly structured into trees, graphs, graphical models, etc. May + be the input e.g. to Referring Expression Generation (REG), + end-to-end text generation, etc. NB: excludes linguistic structures. + +2. ***deep linguistic representation (DLR)***: any of a variety of + deep, underspecified, semantic representations, such as abstract + meaning representations (AMRs; Banarescu et al., 2013) or discourse + representation structures (DRSs; Kamp & Reyle, 2013). + +3. ***shallow linguistic representation (SLR)***: any of a variety of + shallow, syntactic representations, e.g. Universal Dependency (UD) + structures; typically the input to surface realisation. + +4. ***text: subsentential unit of text***: a unit of text shorter than + a sentence, e.g. Referring Expressions (REs), verb phrase, text + fragment of any length; includes titles/headlines. + +5. ***text: sentence***: a single sentence (or set of sentences). + +6. ***text: multiple sentences***: a sequence of multiple sentences, + without any document structure (or a set of such sequences). + +7. ***text: document***: a text with document structure, such as a + title, paragraph breaks or sections, e.g. a set of news reports for + summarisation. + +8. ***text: dialogue***: a dialogue of any length, excluding a single + turn which would come under one of the other text types. + +9. ***text: other***: input is text but doesn’t match any of the above + *text:\** categories. + +10. ***speech***: a recording of speech. + +11. ***visual***: an image or video. + +12. ***multi-modal***: catch-all value for any combination of data + and/or linguistic representation and/or visual data etc. + +13. ***control feature***: a feature or parameter specifically present + to control a property of the output text, e.g. positive stance, + formality, author style. + +14. ***no input (human generation)***: human generation[1], therefore no + system inputs. + +15. ***other (please specify)***: if input is none of the above, choose + this option and describe it. + +## Question 2.2: What type of output do the evaluated system(s) generate? Select all that apply. If none match, select ‘Other’ and describe. + +Describe the type of input, where input refers to the representations +and/or data structures shared by all evaluated systems. + +This question is about input type, regardless of number. E.g. if the +output is a set of documents, you would still select *text: document* +below. + +Note that the options for outputs are the same as for inputs minus the +*control feature* option. + +*Check-box options (select all that apply)*: + +1. ***raw/structured data***: numerical, symbolic, and other data, + possibly structured into trees, graphs, graphical models, etc. May + be the input e.g. to Referring Expression Generation (REG), + end-to-end text generation, etc. NB: excludes linguistic structures. + +2. ***deep linguistic representation (DLR)***: any of a variety of + deep, underspecified, semantic representations, such as abstract + meaning representations (AMRs; Banarescu et al., 2013) or discourse + representation structures (DRSs; Kamp & Reyle, 2013). + +3. ***shallow linguistic representation (SLR)***: any of a variety of + shallow, syntactic representations, e.g. Universal Dependency (UD) + structures; typically the input to surface realisation. + +4. ***text: subsentential unit of text***: a unit of text shorter than + a sentence, e.g. Referring Expressions (REs), verb phrase, text + fragment of any length; includes titles/headlines. + +5. ***text: sentence***: a single sentence (or set of sentences). + +6. ***text: multiple sentences***: a sequence of multiple sentences, + without any document structure (or a set of such sequences). + +7. ***text: document***: a text with document structure, such as a + title, paragraph breaks or sections, e.g. a set of news reports for + summarisation. + +8. ***text: dialogue***: a dialogue of any length, excluding a single + turn which would come under one of the other text types. + +9. ***text: other***: select if output is text but doesn’t match any of + the above *text:\** categories. + +10. ***speech***: a recording of speech. + +11. ***visual***: an image or video. + +12. ***multi-modal***: catch-all value for any combination of data + and/or linguistic representation and/or visual data etc. + +13. ***human-generated ‘outputs’***: manually created stand-ins + exemplifying outputs. + +14. ***other (please specify)***: if output is none of the above, choose + this option and describe it. + +## Question 2.3: How would you describe the task that the evaluated system(s) perform in mapping the inputs in Q2.1 to the outputs in Q2.2? Occasionally, more than one of the options below may apply. If none match, select ‘Other’ and describe. + +This field records the task performed by the system(s) being evaluated. +This is independent of the application domain (financial reporting, +weather forecasting, etc.), or the specific method (rule-based, neural, +etc.) implemented in the system. We indicate mutual constraints between +inputs, outputs and task for some of the options below. + +*Check-box options (select all that apply)*: + +1. ***content selection/determination***: selecting the specific + content that will be expressed in the generated text from a + representation of possible content. This could be attribute + selection for REG (without the surface realisation step). Note that + the output here is not text. + +2. ***content ordering/structuring***: assigning an order and/or + structure to content to be included in generated text. Note that the + output here is not text. + +3. ***aggregation***: converting inputs (typically *deep linguistic + representations* or *shallow linguistic representations*) in some + way in order to reduce redundancy (e.g. representations for ‘they + like swimming,’ ‘they like running’ → representation for ‘they like + swimming and running’). + +4. ***referring expression generation***: generating *text* to refer to + a given referent, typically represented in the input as a set of + attributes or a linguistic representation. + +5. ***lexicalisation***: associating (parts of) an input representation + with specific lexical items to be used in their realisation. + +6. ***deep generation***: one-step text generation from *raw/structured + data* or *deep linguistic representations*. One-step means that no + intermediate representations are passed from one independently run + module to another. + +7. ***surface realisation (SLR to text)***: one-step text generation + from *shallow linguistic representations*. One-step means that no + intermediate representations are passed from one independently run + module to another. + +8. ***feature-controlled text generation***: generation of text that + varies along specific dimensions where the variation is controlled + via *control feature*s specified as part of the input. Input is a + non-textual representation (for feature-controlled text-to-text + generation select the matching text-to-text task). + +9. ***data-to-text generation***: generation from *raw/structured data* + which may or may not include some amount of content selection as + part of the generation process. Output is likely to be *text:\** or + *multi-modal*. + +10. ***dialogue turn generation***: generating a dialogue turn (can be a + greeting or closing) from a representation of dialogue state and/or + last turn(s), etc. + +11. ***question generation***: generation of questions from given input + text and/or knowledge base such that the question can be answered + from the input. + +12. ***question answering***: input is a question plus optionally a set + of reference texts and/or knowledge base, and the output is the + answer to the question. + +13. ***paraphrasing/lossless simplification***: text-to-text generation + where the aim is to preserve the meaning of the input while changing + its wording. This can include the aim of changing the text on a + given dimension, e.g. making it simpler, changing its stance or + sentiment, etc., which may be controllable via input features. Note + that this task type includes meaning-preserving text simplification + (non-meaning preserving simplification comes under + *compression/lossy simplification* below). + +14. ***compression/lossy simplification***: text-to-text generation that + has the aim to generate a shorter, or shorter and simpler, version + of the input text. This will normally affect meaning to some extent, + but as a side effect, rather than the primary aim, as is the case in + *summarisation*. + +15. ***machine translation***: translating text in a source language to + text in a target language while maximally preserving the meaning. + +16. ***summarisation (text-to-text)***: output is an extractive or + abstractive summary of the important/relevant/salient content of the + input document(s). + +17. ***end-to-end text generation***: use this option if the single + system task corresponds to more than one of tasks above, implemented + either as separate modules pipelined together, or as one-step + generation, other than *deep generation* and *surface realisation*. + +18. ***image/video description***: input includes *visual*, and the + output describes it in some way. + +19. ***post-editing/correction***: system edits and/or corrects the + input text (typically itself the textual output from another system) + to yield an improved version of the text. + +20. ***other (please specify)***: if task is none of the above, choose + this option and describe it. + +## Question 2.4: Input Language(s), or ‘N/A.’ + +This field records the language(s) of the inputs accepted by the +system(s) being evaluated. + +*What to enter in the text box*: any language name(s) that apply, mapped +to standardised full language names in ISO 639-1[2]. E.g. English, +Herero, Hindi. If no language is accepted as (part of) the input, enter +‘N/A.’ + +## Question 2.5: Output Language(s), or ‘N/A.’ + +This field records the language(s) of the outputs generated by the +system(s) being evaluated. + +*What to enter in the text box*: any language name(s) that apply, mapped +to standardised full language names in ISO 639-1 (2019). E.g. English, +Herero, Hindi. If no language is generated, enter ‘N/A.’ + +# Output Sample, Evaluators, Experimental Design + +## Sample of system outputs (or human-authored stand-ins) evaluated (Questions 3.1.1–3.1.3) + +Questions 3.1.1–3.1.3 record information about the size of the sample of +outputs (or human-authored stand-ins) evaluated per system, how the +sample was selected, and what its statistical power is. + +## Question 3.1.1: How many system outputs (or other evaluation items) are evaluated per system in the evaluation experiment? Answer should be an integer. + +*What to enter in the text box*: The number of system outputs (or other +evaluation items) that are evaluated per system by at least one +evaluator in the experiment, as an integer. + +## Question 3.1.2: How are system outputs (or other evaluation items) selected for inclusion in the evaluation experiment? If none match, select ‘Other’ and describe. + +*Multiple-choice options (select one)*: + +1. ***by an automatic random process from a larger set***: outputs were + selected for inclusion in the experiment by a script using a + pseudo-random number generator; don’t use this option if the script + selects every *n*th output (which is not random). + +2. ***by an automatic random process but using stratified sampling over + given properties***: use this option if selection was by a random + script as above, but with added constraints ensuring that the sample + is representative of the set of outputs it was selected from, in + terms of given properties, such as sentence length, + positive/negative stance, etc. + +3. ***by manual, arbitrary selection***: output sample was selected by + hand, or automatically from a manually compiled list, without a + specific selection criterion. + +4. ***by manual selection aimed at achieving balance or variety + relative to given properties***: selection by hand as above, but + with specific selection criteria, e.g. same number of outputs from + each time period. + +5. ***Other (please specify)***: if selection method is none of the + above, choose this option and describe it. + +## Question 3.1.3: What is the statistical power of the sample size? + +*What to enter in the text box*: The results of a statistical power +calculation on the output sample: provide numerical results and a link +to the script used (or another way of identifying the script). See, +e.g., Card et al. (2020). + +## Evaluators (Questions 3.2.1–3.2.4) + +Questions 3.2.1–3.2.4 record information about the evaluators +participating in the experiment. + +## Question 3.2.1: How many evaluators are there in this experiment? Answer should be an integer. + +*What to enter in the text box*: the total number of evaluators +participating in the experiment, as an integer. + +## Question 3.2.2: What kind of evaluators are in this experiment? Select all that apply. If none match, select ‘Other’ and describe. In all cases, provide details in the text box under ‘Other.’ + +*Check-box options (select all that apply)*: + +1. ***experts***: participants are considered domain experts, + e.g. meteorologists evaluating a weather forecast generator, or + nurses evaluating an ICU report generator. + +2. ***non-experts***: participants are not domain experts. + +3. ***paid (including non-monetary compensation such as course + credits)***: participants were given some form of compensation for + their participation, including vouchers, course credits, and + reimbursement for travel unless based on receipts. + +4. ***not paid***: participants were not given compensation of any + kind. + +5. ***previously known to authors***: (one of the) researchers running + the experiment knew some or all of the participants before + recruiting them for the experiment. + +6. ***not previously known to authors***: none of the researchers + running the experiment knew any of the participants before + recruiting them for the experiment. + +7. ***evaluators include one or more of the authors***: one or more + researchers running the experiment was among the participants. + +8. ***evaluators do not include any of the authors***: none of the + researchers running the experiment were among the participants. + +9. ***Other*** (fewer than 4 of the above apply): we believe you should + be able to tick 4 options of the above. If that’s not the case, use + this box to explain. + +## Question 3.2.3: How are evaluators recruited? + +*What to enter in the text box*: Please explain how your evaluators are +recruited. Do you send emails to a given list? Do you post invitations +on social media? Posters on university walls? Were there any gatekeepers +involved? What are the exclusion/inclusion criteria? + +## Question 3.2.4: What training and/or practice are evaluators given before starting on the evaluation itself? + +*What to enter in the text box*: Use this space to describe any training +evaluators were given as part of the experiment to prepare them for the +evaluation task, including any practice evaluations they did. This +includes any introductory explanations they’re given, e.g. on the start +page of an online evaluation tool. + +## Question 3.2.5: What other characteristics do the evaluators have, known either because these were qualifying criteria, or from information gathered as part of the evaluation? + +*What to enter in the text box*: Use this space to list any +characteristics not covered in previous questions that the evaluators +are known to have, either because evaluators were selected on the basis +of a characteristic, or because information about a characteristic was +collected as part of the evaluation. This might include geographic +location of IP address, educational level, or demographic information +such as gender, age, etc. Where characteristics differ among evaluators +(e.g. gender, age, location etc.), also give numbers for each subgroup. + +## Experimental design (Questions 3.3.1–3.3.8) + +Questions 3.3.1–3.3.8 record information about the experimental design +of the evaluation experiment. + +## Question 3.3.1: Has the experimental design been preregistered? If yes, on which registry? + +*What to enter in the text box*: State ‘Yes’ or ‘No’; if ‘Yes’ also give +the name of the registry and a link to the registration page for the +experiment. + +## Question 3.3.2: How are responses collected? E.g. paper forms, online survey tool, etc. + +*What to enter in the text box*: Use this space to describe how you +collected responses, e.g. paper forms, Google forms, SurveyMonkey, +Mechanical Turk, CrowdFlower, audio/video recording, etc. + +## Question 3.3.3: What quality assurance methods are used? Select all that apply. If none match, select ‘Other’ and describe. In all cases, provide details in the text box under ‘Other.’ + +*Check-box options (select all that apply)*: + +1. ***evaluators are required to be native speakers of the language + they evaluate***: mechanisms are in place to ensure all participants + are native speakers of the language they evaluate. + +2. ***automatic quality checking methods are used during/post + evaluation***: evaluations are checked for quality by automatic + scripts during or after evaluations, e.g. evaluators are given known + bad/good outputs to check they’re given bad/good scores on MTurk. + +3. ***manual quality checking methods are used during/post + evaluation***: evaluations are checked for quality by a manual + process during or after evaluations, e.g. scores assigned by + evaluators are monitored by researchers conducting the experiment. + +4. ***evaluators are excluded if they fail quality checks (often or + badly enough)***: there are conditions under which evaluations + produced by participants are not included in the final results due + to quality issues. + +5. ***some evaluations are excluded because of failed quality + checks***: there are conditions under which some (but not all) of + the evaluations produced by some participants are not included in + the final results due to quality issues. + +6. ***none of the above***: tick this box if none of the above apply. + +7. ***Other (please specify)***: use this box to describe any other + quality assurance methods used during or after evaluations, and to + provide additional details for any of the options selected above. + +## Question 3.3.4: What do evaluators see when carrying out evaluations? Link to screenshot(s) and/or describe the evaluation interface(s). + +*What to enter in the text box*: Use this space to describe the +interface, paper form, etc. that evaluators see when they carry out the +evaluation. Link to a screenshot/copy if possible. If there is a +separate introductory interface/page, include it under Question 3.2.4. + +## 3.3.5: How free are evaluators regarding when and how quickly to carry out evaluations? Select all that apply. In all cases, provide details in the text box under ‘Other.’ + +*Check-box options (select all that apply)*: + +1. ***evaluators have to complete each individual assessment within a + set time***: evaluators are timed while carrying out each assessment + and cannot complete the assessment once time has run out. + +2. ***evaluators have to complete the whole evaluation in one + sitting***: partial progress cannot be saved and the evaluation + returned to on a later occasion. + +3. ***neither of the above***: Choose this option if neither of the + above are the case in the experiment. + +4. ***Other (please specify)***: Use this space to describe any other + way in which time taken or number of sessions used by evaluators is + controlled in the experiment, and to provide additional details for + any of the options selected above. + +## 3.3.6: Are evaluators told they can ask questions about the evaluation and/or provide feedback? Select all that apply. In all cases, provide details in the text box under ‘Other.’ + +*Check-box options (select all that apply)*: + +1. ***evaluators are told they can ask any questions during/after + receiving initial training/instructions, and before the start of the + evaluation***: evaluators are told explicitly that they can ask + questions about the evaluation experiment *before* starting on their + assessments, either during or after training. + +2. ***evaluators are told they can ask any questions during the + evaluation***: evaluators are told explicitly that they can ask + questions about the evaluation experiment *during* their + assessments. + +3. ***evaluators are asked for feedback and/or comments after the + evaluation, e.g. via an exit questionnaire or a comment box***: + evaluators are explicitly asked to provide feedback and/or comments + about the experiment *after* their assessments, either verbally or + in written form. + +4. ***None of the above***: Choose this option if none of the above are + the case in the experiment. + +5. ***Other (please specify)***: use this space to describe any other + ways you provide for evaluators to ask questions or provide + feedback. + +## 3.3.7: What are the experimental conditions in which evaluators carry out the evaluations? If none match, select ‘Other’ and describe. + +*Multiple-choice options (select one)*: + +1. ***evaluation carried out by evaluators at a place of their own + choosing, e.g. online, using a paper form, etc.***: evaluators are + given access to the tool or form specified in Question 3.3.2, and + subsequently choose where to carry out their evaluations. + +2. ***evaluation carried out in a lab, and conditions are the same for + each evaluator***: evaluations are carried out in a lab, and + conditions in which evaluations are carried out *are* controlled to + be the same, i.e. the different evaluators all carry out the + evaluations in identical conditions of quietness, same type of + computer, same room, etc. Note we’re not after very fine-grained + differences here, such as time of day or temperature, but the line + is difficult to draw, so some judgment is involved here. + +3. ***evaluation carried out in a lab, and conditions vary for + different evaluators***: choose this option if evaluations are + carried out in a lab, but the preceding option does not apply, + i.e. conditions in which evaluations are carried out are *not* + controlled to be the same. + +4. ***evaluation carried out in a real-life situation, and conditions + are the same for each evaluator***: evaluations are carried out in a + real-life situation, i.e. one that would occur whether or not the + evaluation was carried out (e.g. evaluating a dialogue system + deployed in a live chat function on a website), and conditions in + which evaluations are carried out *are* controlled to be the same. + +5. ***evaluation carried out in a real-life situation, and conditions + vary for different evaluators***: choose this option if evaluations + are carried out in a real-life situation, but the preceding option + does not apply, i.e. conditions in which evaluations are carried out + are *not* controlled to be the same. + +6. ***evaluation carried out outside of the lab, in a situation + designed to resemble a real-life situation, and conditions are the + same for each evaluator***: evaluations are carried out outside of + the lab, in a situation intentionally similar to a real-life + situation (but not actually a real-life situation), + e.g. user-testing a navigation system where the destination is part + of the evaluation design, rather than chosen by the user. Conditions + in which evaluations are carried out *are* controlled to be the + same. + +7. ***evaluation carried out outside of the lab, in a situation + designed to resemble a real-life situation, and conditions vary for + different evaluators***: choose this option if evaluations are + carried out outside of the lab, in a situation intentionally similar + to a real-life situation, but the preceding option does not apply, + i.e. conditions in which evaluations are carried out are *not* + controlled to be the same. + +8. ***Other (please specify)***: Use this space to provide additional, + or alternative, information about the conditions in which evaluators + carry out assessments, not covered by the options above. + +## 3.3.8: Unless the evaluation is carried out at a place of the evaluators’ own choosing, briefly describe the (range of different) conditions in which evaluators carry out the evaluations. + +*What to enter in the text box*: use this space to describe the +variations in the conditions in which evaluators carry out the +evaluation, for both situations where those variations are controlled, +and situations where they are not controlled. + +# Quality Criterion *n* – Definition and Operationalisation + +Questions in this section collect information about the *n*th quality +criterion assessed in the single human evaluation experiment that this +sheet is being completed for. The HEDS 1.0 form allows this section to +be completed repeatedly, for up to 10 different quality criteria (see +further explanation at the end of the section). + +For more information, in particular about quality criterion properties +and evaluation mode properties, see Belz et al. (2020). + +## Quality criterion properties (Questions 4.1.1–4.1.3) + +Questions 4.1.1–4.1.3 capture the aspect of quality that is assessed by +a given quality criterion in terms of three orthogonal properties. They +help determine e.g. whether or not the same aspect of quality is being +evaluated in different evaluation experiments. The three properties +characterise quality criteria in terms of (i) what type of quality is +being assessed; (ii) what aspect of the system output is being assessed; +and (iii) whether system outputs are assessed in their own right or with +reference to some system-internal or system-external frame of reference. + +## Question 4.1.1: What type of quality is assessed by the quality criterion? + +*Multiple-choice options (select one)*: + +1. ***Correctness***: select this option if it is possible to state, + generally for all outputs, the conditions under which outputs are + maximally correct (hence of maximal quality). E.g. for + Grammaticality, outputs are (maximally) correct if they contain no + grammatical errors; for Semantic Completeness, outputs are correct + if they express all the content in the input. + +2. ***Goodness***: select this option if, in contrast to correctness + criteria, there is no single, general mechanism for deciding when + outputs are maximally good, only for deciding for two outputs which + is better and which is worse. E.g. for Fluency, even if outputs + contain no disfluencies, there may be other ways in which any given + output could be more fluent. + +3. ***Features***: choose this option if, in terms of property *X* + captured by the criterion, outputs are not generally better if they + are more *X*, but instead, depending on evaluation context, more *X* + may be better or less *X* may be better. E.g. outputs can be more + specific or less specific, but it’s not the case that outputs are, + in the general case, better when they are more specific. + +## Question 4.1.2: Which aspect of system outputs is assessed by the quality criterion? + +*Multiple-choice options (select one)*: + +1. ***Form of output***: choose this option if the criterion assesses + the form of outputs alone, e.g. Grammaticality is only about the + form, a sentence can be grammatical yet be wrong or nonsensical in + terms of content. + +2. ***Content of output***: choose this option if the criterion + assesses the content/meaning of the output alone, e.g. Meaning + Preservation only assesses output content; two sentences can be + considered to have the same meaning, but differ in form. + +3. ***Both form and content of output***: choose this option if the + criterion assesses outputs as a whole, not just form or just + content. E.g. Coherence is a property of outputs as a whole, either + form or meaning can detract from it. + +## Question 4.1.3: Is each output assessed for quality in its own right, or with reference to a system-internal or external frame of reference? + +*Multiple-choice options (select one)*: + +1. ***Quality of output in its own right***: choose this option if + output quality is assessed without referring to anything other than + the output itself, i.e. no system-internal or external frame of + reference. E.g. Poeticness is assessed by considering (just) the + output and how poetic it is. + +2. ***Quality of output relative to the input***: choose this option if + output quality is assessed relative to the input. E.g. Answerability + is the degree to which the output question can be answered from + information in the input. + +3. ***Quality of output relative to a system-external frame of + reference***: choose this option if output quality is assessed with + reference to system-external information, such as a knowledge base, + a person’s individual writing style, or the performance of an + embedding system. E.g. Factual Accuracy assesses outputs relative to + a source of real-world knowledge. + +## Evaluation mode properties (Questions 4.2.1–4.2.3) + +Questions 4.2.1–4.2.3 record properties that are orthogonal to quality +criteria, i.e. any given quality criterion can in principle be combined +with any of the modes (although some combinations are more common than +others). + +## Question 4.2.1: Does an individual assessment involve an objective or a subjective judgment? + +*Multiple-choice options (select one)*: + +1. ***Objective***: Examples of objective assessment include any + automatically counted or otherwise quantified measurements such as + mouse-clicks, occurrences in text, etc. Repeated assessments of the + same output with an objective-mode evaluation method always yield + the same score/result. + +2. ***Subjective***: Subjective assessments involve ratings, opinions + and preferences by evaluators. Some criteria lend themselves more + readily to subjective assessments, e.g. Friendliness of a + conversational agent, but an objective measure e.g. based on lexical + markers is also conceivable. + +## Question 4.2.2: Are outputs assessed in absolute or relative terms? + +*Multiple-choice options (select one)*: + +1. ***Absolute***: choose this option if evaluators are shown outputs + from a single system during each individual assessment. + +2. ***Relative***: choose this option if evaluators are shown outputs + from multiple systems at the same time during assessments, typically + ranking or preference-judging them. + +## Question 4.2.3: Is the evaluation intrinsic or extrinsic? + +*Multiple-choice options (select one)*: + +1. ***Intrinsic***: Choose this option if quality of outputs is + assessed *without* considering their *effect* on something external + to the system, e.g. the performance of an embedding system or of a + user at a task. + +2. ***Extrinsic***: Choose this option if quality of outputs is + assessed in terms of their *effect* on something external to the + system such as the performance of an embedding system or of a user + at a task. + +## Response elicitation (Questions 4.3.1–4.3.11) + +Questions 4.3.1–4.3.11 record information about how responses are +elicited for the quality criterion this section is being completed for. + +## Question 4.3.1: What do you call the quality criterion in explanations/interfaces to evaluators? Enter ‘N/A’ if criterion not named. + +*What to enter in the text box*: the name you use to refer to the +quality criterion in explanations and/or interfaces created for +evaluators. Examples of quality criterion names include Fluency, +Clarity, Meaning Preservation. If no name is used, state ‘N/A.’ + +## Question 4.3.2: What definition do you give for the quality criterion in explanations/interfaces to evaluators? Enter ‘N/A’ if no definition given. + +*What to enter in the text box*: Copy and past the verbatim definition +you give to evaluators to explain the quality criterion they’re +assessing. If you don’t explicitly call it a definition, enter the +nearest thing to a definition you give them. If you don’t give any +definition, state ‘N/A.’ + +## Question 4.3.3: Size of scale or other rating instrument (i.e. how many different possible values there are). Answer should be an integer or ‘continuous’ (if it’s not possible to state how many possible responses there are). Enter ‘N/A’ if there is no rating instrument. + +*What to enter in the text box*: The number of different response values +for this quality criterion. E.g. for a 5-point Likert scale, the size to +enter is 5. For two-way forced-choice preference judgments, it is 2; if +there’s also a no-preference option, enter 3. For a slider that is +mapped to 100 different values for the purpose of recording assessments, +the size to enter is 100. If no rating instrument is used (e.g. when +evaluation gathers post-edits or qualitative feedback only), enter +‘N/A.’ + +## Question 4.3.4: List or range of possible values of the scale or other rating instrument. Enter ‘N/A,’ if there is no rating instrument. + +*What to enter in the text box*: list, or give the range of, the +possible values of the rating instrument. The list or range should be of +the size specified in Question 4.3.3. If there are too many to list, use +a range. E.g. for two-way forced-choice preference judgments, the list +entered might be *A better, B better*; if there’s also a no-preference +option, the list might be *A better, B better, neither*. For a slider +that is mapped to 100 different values for the purpose of recording +assessments, the range *1–100* might be entered. If no rating instrument +is used (e.g. when evaluation gathers post-edits or qualitative feedback +only), enter ’N/A’. + +## Question 4.3.5: How is the scale or other rating instrument presented to evaluators? If none match, select ‘Other’ and describe. + +*Multiple-choice options (select one)*: + +1. ***Multiple-choice options***: choose this option if evaluators + select exactly one of multiple options. + +2. ***Check-boxes***: choose this option if evaluators select any + number of options from multiple given options. + +3. ***Slider***: choose this option if evaluators move a pointer on a + slider scale to the position corresponding to their assessment. + +4. ***N/A (there is no rating instrument)***: choose this option if + there is no rating instrument. + +5. ***Other (please specify)***: choose this option if there is a + rating instrument, but none of the above adequately describe the way + you present it to evaluators. Use the text box to describe the + rating instrument and link to a screenshot. + +## Question 4.3.6: If there is no rating instrument, describe briefly what task the evaluators perform (e.g. ranking multiple outputs, finding information, playing a game, etc.), and what information is recorded. Enter ‘N/A’ if there is a rating instrument. + +*What to enter in the text box*: If (and only if) there is no rating +instrument, i.e. you entered ‘N/A’ for Questions 4.3.3–4.3.5, describe +the task evaluators perform in this space. Otherwise, here enter ‘N/A’ +if there *is* a rating instrument. + +## Question 4.3.7: What is the verbatim question, prompt or instruction given to evaluators (visible to them during each individual assessment)? + +*What to enter in the text box*: Copy and paste the verbatim text that +evaluators see during each assessment, that is intended to convey the +evaluation task to them. E.g. *Which of these texts do you prefer?* Or +*Make any corrections to this text that you think are necessary in order +to improve it to the point where you would be happy to provide it to a +client.* + +## Question 4.3.8: Form of response elicitation. If none match, select ‘Other’ and describe. + +*Multiple-choice options (select one)*:[3] + +1. ***(dis)agreement with quality statement***: Participants specify + the degree to which they agree with a given quality statement by + indicating their agreement on a rating instrument. The rating + instrument is labelled with degrees of agreement and can + additionally have numerical labels. E.g. *This text is fluent — + 1=strongly disagree...5=strongly agree*. + +2. ***direct quality estimation***: Participants are asked to provide a + rating using a rating instrument, which typically (but not always) + mentions the quality criterion explicitly. E.g. *How fluent is this + text? — 1=not at all fluent...5=very fluent*. + +3. ***relative quality estimation (including ranking)***: Participants + evaluate two or more items in terms of which is better. E.g. *Rank + these texts in terms of fluency*; *Which of these texts is more + fluent?*; *Which of these items do you prefer?*. + +4. ***counting occurrences in text***: Evaluators are asked to count + how many times some type of phenomenon occurs, e.g. the number of + facts contained in the output that are inconsistent with the input. + +5. ***qualitative feedback (e.g. via comments entered in a text + box)***: Typically, these are responses to open-ended questions in a + survey or interview. + +6. ***evaluation through post-editing/annotation***: Choose this option + if the evaluators’ task consists of editing or inserting annotations + in text. E.g. evaluators may perform error correction and edits are + then automatically measured to yield a numerical score. + +7. ***output classification or labelling***: Choose this option if + evaluators assign outputs to categories. E.g. *What is the overall + sentiment of this piece of text? — Positive/neutral/negative.* + +8. ***user-text interaction measurements***: choose this option if + participants in the evaluation experiment interact with a text in + some way, and measurements are taken of their interaction. + E.g. reading speed, eye movement tracking, comprehension questions, + etc. Excludes situations where participants are given a task to + solve and their performance is measured which comes under the next + option. + +9. ***task performance measurements***: choose this option if + participants in the evaluation experiment are given a task to + perform, and measurements are taken of their performance at the + task. E.g. task is finding information, and task performance + measurement is task completion speed and success rate. + +10. ***user-system interaction measurements***: choose this option if + participants in the evaluation experiment interact with a system in + some way, while measurements are taken of their interaction. + E.g. duration of interaction, hyperlinks followed, number of likes, + or completed sales. + +11. ***Other (please specify)***: Use the text box to describe the form + of response elicitation used in assessing the quality criterion if + it doesn’t fall in any of the above categories. + +## Question 4.3.9: How are raw responses from participants aggregated or otherwise processed to obtain reported scores for this quality criterion? State if no scores reported. + +*What to enter in the text box*: normally a set of separate assessments +is collected from evaluators and is converted to the results as +reported. Describe here the method(s) used in the conversion(s). +E.g. macro-averages or micro-averages are computed from numerical scores +to provide summary, per-system results. + +## Question 4.3.10: Method(s) used for determining effect size and significance of findings for this quality criterion. + +*What to enter in the text box*: A list of methods used for calculating +the effect size and significance of any results, both as reported in the +paper given in Question 1.1, for this quality criterion. If none +calculated, state ‘None.’ + +## Question 4.3.11: Has the inter-annotator and intra-annotator agreement between evaluators for this quality criterion been measured? If yes, what method was used, and what are the agreement scores? + +*What to enter in the text box*: the methods used to compute, and +results obtained from, any measures of inter-annotator and +intra-annotator agreement obtained for the quality criterion. + +The section ends with the question **Is there another quality criterion +in the evaluation experiment that you haven’t completed this section for +yet?** If **Yes** is selected, please copy this section and complete it +for the next criterion. If **No**, the next section will be the Ethics +section below. + +# Ethics + +The questions in this section relate to ethical aspects of the +evaluation. Information can be entered in the text box provided, and/or +by linking to a source where complete information can be found. + +## Question 5.1: Has the evaluation experiment this sheet is being completed for, or the larger study it is part of, been approved by a research ethics committee? If yes, which research ethics committee? + +*What to enter in the text box*: Typically, research organisations, +universities and other higher-education institutions require some form +ethical approval before experiments involving human participants, +however innocuous, are permitted to proceed. Please provide here the +name of the body that approved the experiment, or state ‘No’ if approval +has not (yet) been obtained. + +## Question 5.2: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain personal data (as defined in GDPR Art. 4, §1: https://gdpr.eu/article-4-definitions/)? If yes, describe data and state how addressed. + +*What to enter in the text box*: State ‘No’ if no personal data as +defined by GDPR was recorded or collected, otherwise explain how +conformity with GDPR requirements such as privacy and security was +ensured, e.g. by linking to the (successful) application for ethics +approval from Question 5.1. + +## Question 5.3: Do any of the system outputs (or human-authored stand-ins) evaluated, or do any of the responses collected, in the experiment contain special category information (as defined in GDPR Art. 9, §1: https://gdpr.eu/article-9-processing-special-categories-of-personal-data-prohibited/)? If yes, describe data and state how addressed. + +*What to enter in the text box*: State ‘No’ if no special-category data +as defined by GDPR was recorded or collected, otherwise explain how +conformity with GDPR requirements relating to special-category data was +ensured, e.g. by linking to the (successful) application for ethics +approval from Question 5.1. + +## Question 5.4: Have any impact assessments been carried out for the evaluation experiment, and/or any data collected/evaluated in connection with it? If yes, summarise approach(es) and outcomes. + +*What to enter in the text box*: Use this box to describe any *ex ante* +or *ex post* impact assessments that have been carried out in relation +to the evaluation experiment, such that the assessment plan and process, +as well as the outcomes, were captured in written form. Link to +documents if possible. Types of impact assessment include data +protection impact assessments, e.g. under GDPR.[4]Environmental and +social impact assessment frameworks are also available. + +# Credits + +Questions 2.1–2.5 relating to evaluated system, and 4.3.1–4.3.8 relating +to response elicitation, are based on Howcroft et al. (2020), with some +significant changes. Questions 4.1.1–4.2.3 relating to quality criteria, +and some of the questions about system outputs, evaluators, and +experimental design (3.1.1–3.2.3, 4.3.5, 4.3.6, 4.3.9–4.3.11) are based +on Belz et al. (2020). HEDS was also informed by van der Lee et al. +(2019, 2021) and by Gehrmann et al. (2021)’s[5] data card guide. + +More generally, the original inspiration for creating a ‘datasheet’ for +describing human evaluation experiments of course comes from seminal +papers by Bender & Friedman (2018), Mitchell et al. (2019) and Gebru et +al. (2020). + +# References + +Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., +Anuoluwapo, A., Bosselut, A., Chandu, K. R., Clinciu, M., Das, D., +Dhole, K. D., Du, W., Durmus, E., Dušek, O., Emezue, C., Gangal, V., +Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., … Zhou, J. (2021). +*The GEM benchmark: Natural language generation, its evaluation and +metrics*. + +Shimorina, A., & Belz, A. (2021). *The human evaluation datasheet 1.0: A +template for recording details of human evaluation experiments in NLP*. + + +van der Lee, C., Gatt, A., van Miltenburg, E., & Krahmer, E. (2021). +Human evaluation of automatically generated text: Current trends and +best practice guidelines. *Computer Speech & Language*, *67*, 101151. +https://doi.org/ + +Belz, A., Mille, S., & Howcroft, D. M. (2020). Disentangling the +properties of human evaluation methods: A classification system to +support comparability, meta-evaluation and reproducibility testing. +*Proceedings of the 13th International Conference on Natural Language +Generation*, 183–194. + +Howcroft, D. M., Belz, A., Clinciu, M.-A., Gkatzia, D., Hasan, S. A., +Mahamood, S., Mille, S., Miltenburg, E. van, Santhanam, S., & Rieser, V. +(2020). Twenty years of confusion in human evaluation: NLG needs +evaluation sheets and standardised definitions. *Proceedings of the 13th +International Conference on Natural Language Generation*, 169–182. + + +Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & +Jurafsky, D. (2020). With little power comes great responsibility. +*Proceedings of the 2020 Conference on Empirical Methods in Natural +Language Processing (EMNLP)*, 9263–9274. + + + +Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., +III, H. D., & Crawford, K. (2020). *Datasheets for datasets*. + + +Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., +Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model +cards for model reporting. *Proceedings of the Conference on Fairness, +Accountability, and Transparency*, 220–229. + + +van der Lee, C., Gatt, A., Miltenburg, E. van, Wubben, S., & Krahmer, E. +(2019). Best practices for the human evaluation of automatically +generated text. *Proceedings of the 12th International Conference on +Natural Language Generation*, 355–368. + + +Bender, E. M., & Friedman, B. (2018). Data statements for natural +language processing: Toward mitigating system bias and enabling better +science. *Transactions of the Association for Computational +Linguistics*, *6*, 587–604. + +Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., +Hermjakob, U., Knight, K., Koehn, P., Palmer, M., & Schneider, N. +(2013). Abstract Meaning Representation for sembanking. *Proceedings of +the 7th Linguistic Annotation Workshop and Interoperability with +Discourse*, 178–186. + +Kamp, H., & Reyle, U. (2013). *From discourse to logic: Introduction to +modeltheoretic semantics of natural language, formal logic and discourse +representation theory* (Vol. 42). Springer Science & Business Media. + + +[1] We use the term +‘human generation’ where the items being evaluated have been created +manually, rather than generated by an automatic system. + +[2] + +[3] Explanations adapted from Howcroft et al. (2020). + +[4] + +[5]