classifier_last_bart.tex

\section{Incidence of annotations on supervised polarity classification}
\label{sect:classifier}

\texttt{<Two experiments: (i) AMT annotations vs. original Ciao annotations and (ii) AMT annotations vs. expert annotations. >}

\texttt{<Make it very clear in the intro that only HIT1, and not HIT2 and HIT3 are considered for classifier training>}

This section intends to evaluate the incidence of AMT-generated annotations on a polarity classification task.
According to this, a comparative evaluation between two polarity classification systems is conducted. 
More specifically, baseline or reference classifiers trained with noisy available metadata are compared with 
contrastive classifiers trained with AMT generated annotations.  
Although more sophisticated classification schemas can be conceived for this task, a simple SVM-based binary supervised classification approach is considered here.

\subsection{Description of datasets}
\label{datasets}

As was mentioned in Section \ref{sect:design}, all sentences were extracted from a corpus of user opinions on cars from the automotive section of \texttt{www.ciao.es} (Spanish). For conducting the experimental evaluation, three different datasets were considered:

\begin{enumerate}
\item Baseline: constitutes the dataset used for training the baseline or reference classifiers. 
Automatic annotation for this dataset was obtained by using the following naive approach: those sentences extracted from
comments with ratings\footnote{The corpus at \texttt{www.ciao.es} contains consumer opinions marked with a score between 1 (negative) and 5 (positive). \texttt{<Rafael, please correct.>}} equal to 5 were assigned to category `positive', those extracted from comments with ratings 
equal to 3 were assigned to `neutral', and those extracted from comments with ratings equal to 1 were assigned to
`negative'. This dataset contains a total of 5570 sentences, with a vocabulary coverage of 11797 words. 

\item Annotated: constitutes the dataset that was manually annotated by AMT workers in HIT1.
This dataset is used for training the contrastive classifiers which are to be compared with baseline system.
The three independent annotations generated by AMT workers for each sentence within this dataset were consolidated into one unique annotation
by majority voting: if the three provided annotations happened to be
different\footnote{This kind of total disagreement among annotators occurred only in 13 sentences out of 1000.}, 
the sentence was assigned to category `neutral'; otherwise, the sentence was assigned to the category with
at least two annotation agreements. This dataset contains a total of 1000 sentences, with a vocabulary coverage 
of 3022 words. 

\item Evaluation: constitutes the gold standard used for evaluating the performance of classifiers.
This dataset was manually annotated by three experts in an independent manner. The gold standard annotation
was consolidated by using the same criterion used in the case of the previous dataset\footnote{In this case, 
annotator inter-agreement was above 80\%, and total disagreement among annotators occurred only in 1 sentence
out of 500}. This dataset contains a total of 500 sentences, with a vocabulary coverage of 2004 words.    
\end{enumerate} 

These three datasets were constructed by randomly extracting sample sentences from an original corpus
of over 25000 user comments containing more than 1000000 sentences in total. The sampling was conducted 
with the following constraints in mind: (i) the three resulting datasets should not overlap, (ii) only sentences 
containing more than 3 tokens are considered, and (iii) each resulting dataset must be balanced, as much
as possible, in terms of the amount of sentences per category. Table \ref{tc_corpus} presents the
distribution of sentences per category for each of the three considered datasets.  

\begin{table}
\begin{tabular}{|l|l|l|l|}
\hline
&Baseline &Annotated &Evaluation \\
\hline
Positive &1882 &341 &200 \\
\hline
Negative &1876 &323 &137 \\
\hline
Neutral &1812 &336 &161 \\
\hline
Totals &5570 &1000 &500 \\
\hline
\end{tabular}
\caption{Sentence-per-category distributions for baseline, annotated and evaluation datasets.}
\label{tc_corpus}
\end{table}

\subsection{Experimental settings}
As mentioned above, a simple SVM-based supervised classification approach was considered for the
polarity detection task under consideration. According to this, two different groups of classifiers were 
considered: a baseline or reference group, and a contrastive group. Classifiers within these two groups were
trained with data samples extracted from the baseline and annotated datasets, respectively. Within each group 
of classifiers, three different binary classification subtasks were considered: positive/not\_positive, 
negative/not\_negative and neutral/not-neutral. All trained binary classifiers were evaluated by computing 
precision and recall for each considered category, as well as overall classification accuracy, over the 
evaluation dataset.

A feature space model representation of the data was constructed by considering the standard bag-of-words approach. 
In this way, a sparse vector was obtained for each sentence in the datasets. Stop-word removal was not
conducted before computing vector models, and standard normalization and TF-IDF weighting schemes were used.

Multiple-fold cross-validation was used in all conducted experiments to tackle with statistical variability of the 
data. In this sense, twenty independent realizations were actually conducted for each experiment presented and,
instead of individual output results, mean values and standard deviations of evaluation metrics are reported.

Each binary classifier realization was trained with a random subsample set of 600 sentences extracted from 
the training dataset corresponding to the classifier group, i.e. baseline dataset for reference systems, 
and annotated dataset for contrastive systems. Training subsample sets were always balanced with respect to 
the original three categories: `positive', `negative' and `neutral'.

\subsection{Results and discussion}
Table \ref{tc_pre_rec} presents the resulting mean values of precision and recall for each considered class 
in classifiers trained with either the baseline or the annotated dataset. As observed in the table, with the
exception of recall for category `negative' and precision for category `not\_negative', both metrics are substantially 
improved when the annotated dataset is used for training the classifiers. The most impressive improvements
are observed for `neutral' precision and recall. 

\begin{table}
\begin{tabular}{|l|l|l|}
\hline
class &precision &recall \\ 
\hline
positive &50.10 (3.79) &62.00 (7.47) \\
&60.21 (2.07)  &71.00 (2.18) \\ 
\hline
not_positive &69.64 (2.70) &58.05 (7.54) \\ 
&77.95 (1.32) &68.54 (2.75) \\ 
\hline
negative &35.25 (2.63) &53.46 (10.55) \\ 
&39.07 (1.78) &55.52 (3.26) \\ 
\hline
not_negative &78.04 (2.19) &62.62 (6.76) \\ 
&79.73 (1.10) &66.87 (2.31) \\ 
\hline
neutral &32.51 (3.02) &48.03 (7.33) \\ 
&44.72 (2.00) &67.12 (2.96) \\ 
\hline
not_neutral &68.17 (2.65) &52.81 (3.84) \\ 
&79.41 (1.58) &60.40 (2.96) \\ 
\hline
\end{tabular}
\caption{Mean precision and recall over 20 independent simulations (with standard deviations provided in parenthesis) 
for each considered class in classifiers trained with either the baseline dataset (upper values) or the annotated dataset (lower values).}
\label{tc_pre_rec}
\end{table*}

Table \ref{tc_accu} presents the resulting mean values of accuracy for each considered subtask 
in classifiers trained with either the baseline or the annotated dataset. As observed in the table,
all subtasks benefit from using the annotated dataset for training the classifiers; however, it is 
important to mention that while similar absolute gains are observed for the `positive/not\_positive' 
and `neutral/not\_neutral' subtasks, this is not the case for the subtask `negative/not\_negative', 
which actually gains much less than the other two subtasks.

\begin{table}
\begin{tabular}{|l|l|l|}
\hline
classifier &baseline &annotated \\ 
\hline
positive/not\_positive &59.63 (3.04) &69.53 (1.70) \\ 
\hline
negative/not\_negative &60.09 (2.90) &63.73 (1.60) \\ 
\hline
neutral/not\_neutral &51.27 (2.49) &62.57 (2.08) \\ 
\hline
\end{tabular}
\caption{Mean accuracy over 20 independent simulations(with standard deviations provided in parenthesis) 
for each classification subtasks trained with either the baseline or the annotated dataset.}
\label{tc_accu}
\end{table}

After considering all evaluation metrics, the benefit provided by human-annotated data 
availability for categories `neutral' and `positive' is evident. However, in the case of category `negative', although some 
gain is also observed, the benefit of human-annotated data does not seem to be as much as for the two other 
categories. This, along with the fact that the `negative/not\_negative' subtask is actually the best performing
one (in terms of accuracy) when baseline training data is used, might suggest that low rating comments contains 
a better representation of sentences belonging to category `negative' than medium and high rating comments do with
respect to classes `neutral' and `positive'. 

In any case, this experimental work only verifies the feasibility of constructing training datasets for
opinionated content analysis, as well as it provides an approximated idea of costs involved in the generation
of this type of resources, by using AMT.