欢迎,计算机科学与信息计算爱好者!

Riposte! A Large Corpus of Counter-Arguments

NLP scott 来源:Arxiv 9个月前 (01-01) 92次浏览 0个评论 扫描二维码
This entry is part 1 of 1 in the series Computational argumentation
  • Riposte! A Large Corpus of Counter-Arguments

Paul Reisert   Benjamin Heinzerling   Naoya Inoue   Shun Kiyono   Kentaro Inui  
 RIKEN Center for Advanced Intelligence Project   Tohoku University
{paul.reisert,benjamin.heinzerling,shun.kiyono}@riken.jp
{naoya-i,inui}@ecei.tohoku.ac.jp

Abstract

Constructive feedback is an effective method for improving critical thinking skills. Counter-arguments (CAs), one form of constructive feedback, have been proven to be useful for critical thinking skills. However, little work has been done for constructing a large-scale corpus of them which can drive research on automatic generation of CAs for fallacious micro-level arguments (i.e. a single claim and premise pair). In this work, we cast providing constructive feedback as a natural language processing task and create Riposte!, a corpus of CAs, towards this goal. Produced by crowdworkers, Riposte! contains over 18k CAs. We instruct workers to first identify common fallacy types and produce a CA which identifies the fallacy. We analyze how workers create CAs and construct a baseline model based on our analysis.

Introduction

Critical thinking is a crucial skill necessary for valid reasoning, especially for students in a pedagogical context. Towards improving critical thinking skills for students, educators have evaluated the contents of a work and provided constructive feedback (i.e. criticism) to the student. Although such methods are effective, they require educators to articulately evaluate the contents of an essay, which can be time-consuming and varies depending on an educator’s critical thinking skills.

Riposte! A Large Corpus of Counter-Arguments
Figure 1CAs in Riposte! produced by crowdworkers. The fallacy type selected by a worker is shown in parentheses.
Fallacy Type Definition Template
Begging the Question (Riposte! A Large Corpus of Counter-Arguments) The truth of the premise is already assumed by the claim. If [something] is assumed to be true, then [something else] is already assumed to be true”.
Hasty Generalization(Riposte! A Large Corpus of Counter-Arguments) Someone assumes something is generally always the case based on a few instances. It’s too hasty to assume that [text]”.
Questionable Cause (Riposte! A Large Corpus of Counter-Arguments) The cause of an effect is questionable. There is a questionable cause in the argument because [questionable cause] does/will not cause [effect]”.
Red Herring (Riposte! A Large Corpus of Counter-Arguments) Someone reverts attention away from the original claim by changing the topic. The topic being discussed is [first topic], but it is being changed to [second topic]”.
Table 1: Definition and templates of fallacy types used in our experiments.

In the field of educational research, the usefulness of identifying fallacies and counter-arguments, henceforth CAs, as constructive feedback has been emphasized de2008constructiveoktavia2014analysisindah2015fallaciessong2013teaching, as both can help writers produce high-quality arguments while simultaneously improving their critical thinking skills. Shown in Figure 1 is an example of an argument with a fallacy (i.e. errors in the logical reasoning of the argument) and its CAs (i.e. attacks to the argument). In the field of NLP, previous works have addressed fallacy identification HABERNAL18.494, CA retrieval wachsmuth2017computational, and CA generation for macro-level arguments hua2018neural, and essay criteria such as thesis clarity persing2013modeling, argument strength persing2015modeling, and stance persing2016modeling have been evaluated. However, in the pedagogical context, macro-level arguments (e.g., an essay) may consist of several micro-level arguments (i.e. one claim/premise pair) that can each contain multiple fallacies. To bridge this gap, we create CAs for micro-level arguments which can be useful for automatic constructive feedback generation.

Several challenges exist for creating a corpus of CAs for constructive feedback. First, the corpus must contain a variety of different topics and arguments to both train and evaluate a model for unseen topics. Second, an argument can have many different fallacies which are not easily identifiable oktavia2014analysisindah2015fallaciesel2017logical. Third, producing CAs is costly and time-consuming.

In this work, we design a task for automatic constructive feedback and create Riposte!, a large-scale corpus of CAs via crowdsourcing. Workers are first instructed to identify common fallacy types (begging the questionhasty generalizationquestionable cause, and red herring) in educational research de2008constructiveoktavia2014analysisindah2015fallaciessong2013teaching and create a CA for micro-level arguments. In total, we collect 18,887 CAs (see Figure 1 for examples of CAs in Riposte!). We then cast automatic constructive feedback as a text generation task and create a baseline model.

max width= CriteriaRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsTotalUnsure2,0433152,1361,8796,373CAs (FS)3,3653,8182,1211,77211,076CAs (O)9072,1822,0582,6647,811CAs (total)4,2726,0004,1794,43618,887

Table 2Full statistics of the Riposte! corpus, where FS represents fallacy-specific CAs and O represents other.

The Riposte! corpus

In this section, we determine if training data can easily be created. To the best of our knowledge, this is the first research that addresses corpus construction for automatic constructive feedback.

2.1 Counter-arguments as an NLP task

When designing a task for automatic constructive feedback, one must take into account real-world situations. In the pedagogical context, educators can choose the same topic for students annually. With automatic constructive feedback, educators may choose to use a pretrained, supervised model for a single topic with editable background knowledge (i.e., educators can choose which knowledge is necessary to automatically construct feedback). On the other hand, educators may choose a new topic each year, and thus a conditioned model for multiple topics may also be considered. The input to a model should be a topic and several claim and premise argument pairs, and the output would be a set of CAs useful for improving the argument.

2.2 Existing corpus of arguments

When training a model for constructive feedback, the data should consist of many CAs for a wide variety of topics. We use the Argument Reasoning Comprehension (ARC) dataset Habernal.et.al.2018.NAACL.ARCT, a corpus of 1,263 unique topic-claim-premise pairs (172 unique topics and 264 unique claims). We assume the arguments in ARC contain many fallacies because they were created by non-expert crowdworkers (i.e., workers are not experts in the field of argumentation).

2.3 Riposte! creation

For creating Riposte!, we use the crowdsourcing platform Amazon Mechanical Turk.111https://www.mturk.com/

Data Collection

One challenge for collecting training data for automatic constructive feedback is that the CAs should be useful for improving an argument. To assist with collecting such CAs, we adopt reisert2019annotation’s protocol for collecting CAs using crowdsourcing. We first make several modifications for our data collection (see Appendix). We create 4 separate crowdsourcing tasks (i.e., one for each fallacy type). For each of the 1,263 arguments in ARC, we ask 5 workers to produce a CA. For each fallacy type, we assist workers by providing them with a “fill-in-the-blank” template, where workers were instructed to fill in text boxes for a given pattern. The fallacy types and templates are shown in Table 1.

2.4 Riposte! statistics

The statistics of Riposte! are shown in Table 2.11,076 of the CAs are fallacy-specific (i.e. workers first identified a fallacy and then created the CA), and 7,811 CAs were created when a worker did not believe the specified fallacy existed in the argument. 6,373 instances were labeled as unsure (i.e. the worker was unsure about the fallacy type).

How did workers create CAs?

max width= CriteriaRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-ArgumentsTotalScore0.610.170.350.360.24

Table 3The average Jaccard’s similarity scores between CAs for a single argument for each fallacy type.

When creating training data for automatic constructive feedback, CAs should be useful and diverse. We determine how workers create CAs by calculating the similarity between i) a CA and argument and ii) CAs for single arguments.

How similar is one CA to the premise-claim?

Riposte! A Large Corpus of Counter-Arguments
(a) Riposte! A Large Corpus of Counter-Arguments
Riposte! A Large Corpus of Counter-Arguments
(b) Riposte! A Large Corpus of Counter-Arguments
Riposte! A Large Corpus of Counter-Arguments
(c) Riposte! A Large Corpus of Counter-Arguments
Riposte! A Large Corpus of Counter-Arguments
(d) Riposte! A Large Corpus of Counter-Arguments
Riposte! A Large Corpus of Counter-Arguments
(e) Overall
Figure 2BLEU scores calculated between each worker-produced CA and the original argument (claim and premise). The results indicate that workers used keywords directly from the argument.

In order to determine how annotators created their CAs, we calculate the BLEU papineni2002bleu score of each CA and the argument (e.g., premise/claim). The distribution in Figure 2 indicates that workers copied keywords directly from the original argument in some cases.

How similar are the CAs across annotators?

One design decision when building Riposte! was that with more annotators, we could collect a wide variety of diverse CAs for a single-argument regardless of the fallacy type. We first calculate the similarity of the CAs across annotators for a single argument. We tokenize the corpus using spaCy222https://spacy.io/ and remove stop words and punctuation. We then calculate the average Jaccard similarity score for all combinations of CAs per unique argument and average over all arguments. The results (see Table 3) indicate that the CAs are diverse.

Experiment

4.1 Experimental design

In Section 3, we observed that workers copied keywords from the argument when creating a CA. Based on this observation, we experiment with different input settings to the model to better understand which parts of the argument annotators used to create their argument (e.g., topic (T) only, premise (P) only, claim (C) only, and so forth). We cast the task of automatic constructive feedback as a generation task and experiment with such settings.

Since both new and existing essay topics can be used and introduced by educators, we consider two possible settings: i) in-domain (i.e. topics are shared between splits) and ii) out-of-domain (i.e. topics are not shared).

For our generation model, we use gold fallacy type information.333We built an LSTM-encoder multi-label classifier and the results of 4-way classification was 36.02% F1 score, indicating more sophisticated features such as background knowledge and reasoning are necessary. This allows us to understand how well the model can generate CAs when correct fallacy types are predicted.

4.2 Data preparation

We filter out all unsure instances. We use majority vote for selecting CAs and their fallacy types. We split the data into 80% train, 10% test, and 10% dev. In each setting, we ensure that no unique claim-premise pairs are shared across splits.

For each experiment, we tokenize using spaCy and lowercase all tokens. For Riposte! A Large Corpus of Counter-ArgumentsCAs, we replace the template with a special token (i.e. hg). For all other CAs, we discard the original template and add a special token between slot-fillers. This allows our baseline model to focus more on the content words found in the original argument.

4.3 Baselines

Based on our observations in Section 4.1, we create a baseline for determining which parts of an argument annotators used to create CAs and how well a model can generate a CA.

Simple Overlap (SO)

We calculate simple BLEU overlap for each setting against the CA as a baseline. In order to directly compare the results to our seq2seq baseline model, we calculate the BLEU scores for the preprocessed data from our seq2seq baseline model with unknown words.

Seq2Seq

We preprocess and train our model using fairseq ott2019fairseq. We use pre-trained word embeddings (300-dimensional GloVe embeddings pennington2014glove) which are useful for generation tasks qi2018. We create two models (seq2seq-i and seq2seq-o) for in-domain and out-of-domain settings, respectively.444For seq2seq-i and seq2seq-o, we use the best hyperparameters from seq2seq-i (P+C) and seq2seq-o (P+C) across all settings, respectively.

max width= BaselineTCPT+P+CT+CT+PP+CSO3.986.3715.5913.5610.6913.7618.16seq2seq-i12.2812.315.9614.5412.6313.3716.57seq2seq-o1.311.051.494.781.601.535.53

Table 4BLEU scores of our baselines using gold fallacy type for topic (T), premise (P), and claim (C).

4.4 Evaluation

max width= AttributeScores (GO) (GO)Scores (GE) (GE)Strength2.30.201.980.20Persuasiveness2.260.711.940.15Relevance2.740.202.840.72

Table 5Mean scores and agreement (Krippendorff’s ) scores for gold (GO) and generated (GE) CAs.

We evaluate the results of our baselines using BLEU (see Table 4). Our SO results indicate that workers mainly used the premise and claim when creating CAs. We observe that seq2seq-o’s performance is low, indicating a simple model is not sufficient when unknown topics are introduced.

For evaluation, we would also like to compare the quality of gold CAs against generated CAs. We conduct an annotation study using AMT (3 workers per CA) and evaluate CA quality using 3 dimensions: StrengthPersuasiveness, and Relevance.555We use carlile2018give’s guidelines and slightly modify for CAs. Please see the Appendix for our criteria. In total, we show 50 arguments and their gold/generated CAs, where each argument is annotated by 3 workers.666We use 50 generated CAs from seq2seq-i (P+C). The results are shown in Table 5.777We convert from a 5 to 3-scale for score calculation. We observed that workers found generated CAs more relevant, but the arguments were weaker and less persuasive. Examples of the generated output for our best model (seq2seq-i P+C) are shown in Table 6.

Source Reference Hypothesis
home – schoolers should play for high school teams because all children should be able to participate in sports . Riposte! A Large Corpus of Counter-Arguments all children are to play in sports Riposte! A Large Corpus of Counter-Arguments even home – schoolers will be playing sports . all children should be able to participate in sports Riposte! A Large Corpus of Counter-Arguments home – schoolers should play for high school teams .
the u.s . should lift sanctions with cuba because the embargo hurts our own economy . Riposte! A Large Corpus of Counter-Arguments the u.s .Riposte! A Large Corpus of Counter-Arguments the embargo . us sanctions Riposte! A Large Corpus of Counter-Arguments our own economy .
Table 6Examples of output from seq2seq-i (P+C).

Conclusion and future work

In this work, we construct Riposte!, a large corpus of 18,887 crowdworker-produced CAs. Our analysis on Riposte! reveals that non-expert crowdworkers can produce reasonably diverse CAs. We cast automatic constructive feedback as a text generation task and create a baseline model.

In our future work, we will explore injecting background knowledge and reasoning into our model to generate CAs for unknown topics and provide detailed information to students about how to improve their original argumentation.

References

    Appendix A Annotation Interface and Guidelines

    We show the annotation interface used in our full-fledged crowdsourcing experiment in Figure 3. The conditions shown to workers for 3 fallacy types are shown in Figure 4. The interface for Riposte! A Large Corpus of Counter-Arguments is shown in Figure 5.

    Riposte! A Large Corpus of Counter-Arguments
    Figure 3Interface shown to crowdworkers for our hasty generalization full-fledged experiment.

    The guidelines shown to workers is shown in Figure 6.

    Riposte! A Large Corpus of Counter-Arguments
    Figure 4Conditions for rejecting worker’s responses shown to workers for Riposte! A Large Corpus of Counter-ArgumentsRiposte! A Large Corpus of Counter-Arguments, and Riposte! A Large Corpus of Counter-Argumentsexperiments.
    Riposte! A Large Corpus of Counter-Arguments
    Figure 5Interface for Riposte! A Large Corpus of Counter-Arguments.
    Riposte! A Large Corpus of Counter-Arguments
    Figure 6Guidelines shown to crowdworkers.
    Riposte! A Large Corpus of Counter-Arguments
    Figure 7CAs produced for a single argument (hasty generalization) with perfect annotator agreement. All 5 workers agreed the fallacy existed.

    Appendix B Crowdsourcing settings

    For our full-fledged experiment, we use the following settings: workers were required to have a number of Human Intelligence Tasks (HITs) approved to be greater than or equal to 100 and a HIT Approval Rate greater than or equal to 96%. For each HIT, workers were rewarded with $0.20 (in the case of hasty generalization, workers were rewarded with $0.10). An example of the guidelines for one fallacy type (e.g., questionable cause) are shown in Figure 6. For each of our experiments below, the settings are as follows. If workers selected no or unsure, they were required to provide a CA or reason, respectively. We inform workers that their work will be rejected if one or more of the following conditions is met. The CA is i) blank, ii) not a sentence, iii) a direct copy-paste of the original argument in the text box or copy-paste of the guidelines, or iv) not written in English. We manually reject responses that fall under this criteria.

    Appendix C Model Hyperparameters

    hyperparameter values
    dropout 0.1, 0.2, 0.3, 0.4, 0.5
    encoder/decoder layers 1,2,3
    hidden layers 128, 256, 512, 1024
    learning rate 0.1, 0.01, 0.001
    optimizers adam, sgd
    Table 7Hyperparameters used in our experiments for seq2seq-i (P+C) and seq2seq-o (P+C).

    For seq2seq-i (P+C) and seq2seq-o (P+C), we experiment with the hyperparameters shown in Table 7. The best hyperparameters for our experiment are as follows. For seq2seq-i, we use the following settings. The dropout is set to 0.4. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.

    For seq2seq-o, we use the following settings. The dropout is set to 0.2. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.

    Attribute Description (Strong)
    Relevant Anyone can see how the counter-argument attacks the argument. The relationship between the two components is either explicit or extremely easy to infer. The relationship is thoroughly explained in the text because the two components contain the same words or exhibit coreference.
    Persuasive A very strong, clear counter-argument. It would persuade most readers and is devoid of errors that might detract from its strength or make it difficult to understand.
    Strength A very strong counter-argument with no fallacies. Not much can be improved in order to attack the argument better.
    Table 8Guidelines for annotating the quality of the CAs in our corpus, where the description is shown for the highest score (5). Each dimension has a score of 1-5. Annotators are only shown the criteria for the highest and lowest score only.

    Appendix D Annotation Criteria and Examples

    The guidelines shown to crowdworkers when annotating the quality of CAs are shown in Table 8. We show the description for strong dimensions (i.e., score of 5).

    Examples of CAs for one argument are shown in Figure 7.


    CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
    转载请注明原文链接:Riposte! A Large Corpus of Counter-Arguments
    喜欢 (4)
    [985016145@qq.com]
    分享 (0)
    scott
    关于作者:
    发表我的评论
    取消评论
    表情 贴图 加粗 删除线 居中 斜体 签到

    Hi,您需要填写昵称和邮箱!

    • 昵称 (必填)
    • 邮箱 (必填)
    • 网址