Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

NLP Deep Talk 2周前 (10-16) 14次浏览 已收录 0个评论 扫描二维码

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

Abstract

A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP).
We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework.
On four different corpora we demonstrate that our hybrid TSM duration predictions are highly correlated with human gaze ground truth.
We further propose a novel joint modelling approach to integrate TSM predictions into the attention layer of a network designed for a specific upstream /addedNLP task without the need for any task-specific human gaze data.
We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus.
As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.

/definechangesauthor

[name=Andreas, color=orange]andreas

1 Introduction

Neural attention mechanisms have been widely applied in /addednatural language processing and computer vision.
By mimicking human attention Sood et al. (2020), they have enabled neural networks to only focus on those aspects of their input that are important for a given task Mnih et al. (2014); Xu et al. (2015b).
While neural networks are able to learn meaningful attention mechanisms using only supervision received /replacedforfrom the target task, the addition of human gaze information has been shown to be beneficial in many cases Karessli et al. (2017); Qiao et al. (2018); Xu et al. (2015a); Yun et al. (2013).
An especially interesting way of leveraging gaze information was demonstrated by works incorporating human gaze into neural attention mechanisms, for example for image and video captioning Sugano and Bulling (2016); Yu et al. (2017) or visual question answering Qiao et al. (2018).

While attention is at least as important for reading text as it is for viewing images Commodari and Guarnera (2005); Wolfe and Horowitz (2017), integration of human gaze into neural attention mechanisms for natural language processing (NLP) tasks remains under-explored.
A major obstacle to studying such integration is data scarcity:
Available corpora of human gaze during reading consist of too few samples to provide effective supervision for modern /addeddata-hungry architectures and human gaze data is only available for a small number of NLP tasks.
For paraphrase generation and sentence compression, which play an important role for tasks such as reading comprehension systems Gupta et al. (2018); Hermann et al. (2015); Patro et al. (2018), no human gaze data is available.

We address this data scarcity in two /replacednoveldistinct ways:
/addedFirst, to overcome the low number of human gaze samples for reading, we propose a novel hybrid text saliency model (TSM) in which we combine a cognitive model of reading behaviour with human gaze supervision in a single machine learning framework.
More specifically, we use the E-Z Reader model of attention allocation during reading Reichle et al. (1998) to obtain a large number of synthetic training examples.
/replacedWe use these examples towith which we pre-train a BiLSTM Graves and Schmidhuber (2005) network with a Transformer Vaswani et al. (2017) whose weights we subsequently refine by training on only a small amount of human gaze data.
/addedWe demonstrate that our model yields predictions that are well-correlated with human gaze on out-of-domain data.
/addedSecond, we /deletedfurther propose a novel joint modelling approach of attention and comprehension that allows human gaze predictions to be flexibly adapted to different NLP tasks by integrating TSM predictions into an attention layer.
By jointly training the TSM with a task-specific network, the saliency predictions are adapted to this upstream task without the need for explicit supervision using real gaze data.
Using this approach, we outperform the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieve state-of-the-art performance on the Google Sentence Compression corpus.
As such, our work demonstrates the significant potential of combining cognitive and data-driven models and establishes a general principle for flexible gaze integration into NLP that has the potential to also benefit tasks beyond paraphrase generation and sentence compression.

2 Related work

Our work is related to previous works on 1) NLP tasks for text comprehension, 2) human attention modelling, as well as 3) gaze integration in neural network architectures.

2.1 NLP tasks for text comprehension

Two key tasks in machine text comprehension
are paraphrasing and summarization Chen et al. (2016); Hermann et al. (2015); Cho et al. (2019); Li et al. (2018); Gupta and Lehal (2010).
While paraphrasing is the task of “conveying the same meaning, but with different expressions” Cho et al. (2019); Fader et al. (2013); Li et al. (2018),
summarization deals with
extracting or abstracting the key points of a larger input sequence Frintrop et al. (2010); Tas and Kiyani (2007); Kaushik and Lipton (2018).
Though advances have helped bring machine comprehension closer to human performance, humans are still superior for most tasks Blohm et al. (2018); Xia et al. (2019); Zhang et al. (2018).
While attention mechanisms can improve performance by helping models to focus on relevant parts of the input Prakash et al. (2016); Rush et al. (2015); Rocktäschel et al. (2016); Cao et al. (2016); Hasan et al. (2016); Cho et al. (2015), the benefit of explicit supervision through human attention remains under-explored.

2.2 Human attention modelling

Predicting where people look (saliency prediction) in images is a long-standing challenge in neuroscience and computer vision Borji and Itti (2012); Bylinskii et al. (2016); Kümmerer et al. (2015).
In contrast to images, most attention models for eye movement behaviors during reading are cognitive process models, i.e. models that do not involve machine learning but implement cognitive theories Engbert et al. (2005); Rayner (1978); Reichle et al. (1998).
Key challenges for such models are a limited number of parameters, hand-crafted rules and thus a difficulty to adapt them to different tasks and domains, as well as the difficulty to use them as part of an end-to-end trained machine learning architectures Duch et al. (2008); Kotseruba and Tsotsos (2018); Ma and Peters (2020).
One of the most influential cognitive models of gaze during reading is the E-Z Reader model Reichle et al. (1998).
It assumes attention shifts to be strictly serial in nature and that saccade production depends on different stages of lexical processing.
The E-Z Reader model has been very successful in explaining different effects seen in attention allocation during reading Reichle et al. (2009, 2013).

In contrast, learning-based attention models for text remain under-explored. Nilsson and Nivre (2009)
trained person-specific models on features including length and frequency of words to predict fixations on words and later extended their approach to also predict fixation durations Nilsson and Nivre (2010). The first work to present a person-independent model for fixation prediction on text used a linear CRF model Matthies and Søgaard (2013).
A separate line of work has instead tried to incorporate assumptions about the human reading process into the model design.
For example, the Neural Attention Trade-off (NEAT) language model by Hahn and Keller (2016)
is trained with hard attention and assigning a cost to each fixation.
Subsequent work applied the NEAT model to question answering tasks, showing task-specific effects on learned attention patterns that reflect human behavior Hahn and Keller (2016).
Further works include sentence representation learning using surprisal and part of speech tags as proxies to human attention Wang et al. (2017), attention as a way to improve time complexity for NLP tasks Seo et al. (2018), and learning saliency scores by training for sentence comparison Samardzhiev et al. (2018). Our work is fundamentally different from all of these works given that we, for the first time, combine cognitive theory and data-driven approaches.

2.3 Gaze integration in neural network architectures

Integration of human gaze data into neural network architectures has been explored for a range of computer vision tasks Karessli et al. (2017); Shcherbatyi et al. (2015); Xu et al. (2015a); Yu et al. (2017); Yun et al. (2013).
Sugano and Bulling (2016) used gaze as an additional input to the attention layer for image captioning, while Qiao et al. (2018) used human-like attention maps as an additional supervision for the attention layer for a visual question answering task.
Most previous work in gaze-supported NLP has used gaze as an input feature, e.g. for syntactic sequence labeling Klerke and Plank (2019), classifying referential versus non-referential use of pronouns Yaneva et al. (2018), reference resolution Iida et al. (2011), keyphrase extraction Zhang and Zhang (2019), or prediction of multi-word expressions Rohanian et al. (2017). Recently, Hollenstein et al. (2019)
proposed to build a lexicon of gaze features given word types, overcoming the need for gaze data at test time.
/addedTwo key recent works in NLP pioneered methods for /replacedincorporatingintegrating gaze data into NLP classification /replacedmodelstasks, inspired from multi-task learning: Klerke et al. (2016) added a gaze prediction task to regularize their sentence compression model. While they do not integrate gaze into an attention layer, their supervision method still improved performance on this task. Barrett et al. (2018) proposed an architecture for sequence classification tasks that could alternate between supervisory signals from labeled sequences and disjoint eye tracking data. In this work, the authors do not predict gaze on the specific task corpus, but rather they use eye tracking data from a different corpus and task in order to regularize the neural attention function used in their classification system.
In stark contrast, our work provides human gaze predictions over any given NLP text corpus and therefore for the first time we are able to supervise NLP attention models by integrating human gaze predictions (made over the task corpus) directly into neural attention layers.

3 Method

/adjustbox

width=

Text SaliencyModel

Embedding layer

BiLSTM layer

Transformer layer

Output layer

Task Model

Task specific

Attention layer

Task specific

Figure 1: High-level architecture of our model. Given an input sentence the Text Saliency Model produces attention scores for each word in the input sentence . The Task Model combines this information with the original input sentence to produce an output sentence .

We make two distinct contributions:
A hybrid text saliency model as well as two attention-based models for paraphrase generation and text summarization employed in a novel joint modelling approach 1

3.1 Hybrid text saliency model

To overcome the limited amount of eye-tracking data for reading comprehension tasks, we propose a hybrid approach when training our text saliency model.
In the first stage of training, we leverage the E-Z Reader model Reichle et al. (1998) to generate a large amount of training data over the CNN and Daily Mail Reading Comprehension Corpus Hermann et al. (2015).
After training the text saliency model until convergence using this synthetic data, in a second training phase we fine-tune the network with real eye tracking data of humans reading from the Provo and Geco corpus Luke and Christianson (2018); Cop et al. (2017).
We used the most recent implementation of EZ Reader (Version 10.2) available from the authors’ website2.

The task of text saliency is to predict fixation durations for each word of an input sentence.
In our text saliency model, we combine a BiLSTM network Graves and Schmidhuber (2005) with a Transformer Vaswani et al. (2017) (see  Figure 1 for an overview).
Each word of the input sentence is encoded using pre-trained GloVe embeddings
 Pennington et al. (2014).
The resulting embeddings are fed into a single-layer BiLSTM network Graves and Schmidhuber (2005) that integrates information over the whole input sentence.
The outputs from the BiLSTM network are fed into a Transformer network with multi-headed self-attention Vaswani et al. (2017).
In contrast to Vaswani et al. (2017), we only use the encoder of the Transformer network.
Furthermore, we do not provide positional encodings as input, as this information is implicitly present in the outputs produced by the BiLSTM layer.
In initial experiments we found an advantage of using only four layers with four attention heads each for the Transformer network as opposed to six layers with 12 heads in the original Transformer architecture Vaswani et al. (2017).
The combination of a BiLSTM network with a subsequent Transformer network allows our model to better capture the sequential context while still maintaining computational efficiency.
Finally, a fully connected layer is used
to obtain an attention score for each input word in .
We apply sigmoid nonlinearities with subsequent normalization over the input sentence to obtain a probability distribution over the sentence.
As loss function we use the mean squared error.

3.2 Joint modelling for natural language processing tasks

To model the relationship between attention allocation and text comprehension, we integrate the TSM with two different NLP task attention-based networks in a joint model (see
 Figure 1).
Specifically, we propose a modification to the Luong attention layer Luong et al. (2015) that is a computationally light-weight but highly effective, multiplicative attention algorithm Luong et al. (2015); Britz et al. (2017).
We compute attention scores as

(1)

using our task-specific modified score functions .
For the tasks of paraphrase generation and sentence compression, respectively, we propose the novel score functions

(2)
(3)

Where is the current hidden state, are the hidden states of the encoder and and are learnable parameters of the attention mechanism.
The outputs of the TSM model
on the input sentence are incorporated into the score function by element-wise multiplication.
This way, attention scores in the upstream task network reflect word saliencies learnt from humans. In addition to that, the error signal from the upstream loss function can be propagated back to the TSM in order to adapt its’ parameters to the upstream task, thereby defining an implicit loss on .
This way, the attention distribution returned by the TSM is adapted to the specific upstream task, allowing us to incorporate and adapt a neural model of attention to tasks for which no human gaze data is available.
Note, as we have two different tasks namely generative (paraphrase generation) and classification (sentence compression), we used different score functions as suggested by previous work Luong et al. (2015).

4 Experiments

4.1 Joint model with upstream tasks

Evaluation details

Datasets. We used two standard benchmark corpora to evaluate each upstream NLP task.
For paraphrase generation, we used the Quora Question Pairs corpus3 that consists of human-annotated pairs of paraphrased questions that were crawled from Quora.
We followed the common practice of excluding negative paraphrase examples from the corpus to obtain training data for paraphrase generation Patro et al. (2018); Gupta et al. (2018).
We split the data according to Gupta et al. (2018); Patro et al. (2018), using either 100K or 50K examples for training, 45K examples for validation, and 4K examples for testing.
For the sentence compression task we used the Google Sentence Compression corpus Filippova et al. (2015) containing 200K sentence compression pairs that were crawled from news articles.
We split the data according to Zhao et al. (2018), taking the first 1K examples as test data, and the next 1K as validation data.

Paraphrase generation. Our first task was paraphrase generation where, given a source sentence, the model has to produce a different target sentence with the same meaning that may have a different length.
We used a sequence-to-sequence network with word-level attention that was originally been proposed for neural machine translation Bahdanau et al. (2015).
The model consisted of two recurrent neural networks, an encoder and an attention decoder (see  Figure 1).
The encoder consisted of an embedding layer followed by a gated recurrent unit (GRU) Cho et al. (2014).
The decoder produced an output sentence step-by-step given the hidden state of the encoder and the input sentence.
At each output step, the encoded input word and the previous hidden state are used to produce attention weights using our modified Luong attention (see Equation 2).
These attention weights are combined with the embedded input sentence and fed into a GRU to produce an output sentence.
The loss between predicted and the ground-truth paraphrase was calculated over the entire vocabulary using cross-entropy.

Sentence compression. As a second text comprehension task, we opted for deletion-based sentence compression that aims to delete unimportant words from the input sentence Jing (2000); Knight and Marcu (2002); McDonald (2006); Clarke and Lapata (2008); Filippova et al. (2015).
We incorporated the attention mechanism into the baseline architecture presented in Filippova et al. (2015).
The network consisted of three stacked LSTM layers, with dropout after each LSTM layer as a regularization method. The outputs of the last LSTM layer were fed through our modified Luong attention mechanism (see Equation 3) and two fully connected layers which predicted for each word whether it should be deleted. The loss between predicted and ground truth deletion mask was calculated with cross-entropy.

Training. We used pre-trained 300-dimensional GloVe embeddings in both the TSM and the upstream task network to represent the input words Pennington et al. (2014).
We trained both upstream task models using the ADAM optimizer Kingma and Ba (2015) with a learning rate of 0.0001.
For paraphrase generation we used uni-directional GRUs with hidden layer size 1,024 and dropout probability of 0.2. For sentence compression we used Bi-LSTMs with hidden layer size 1,024 and dropout probability of 0.1.

Metrics. The most common metric to evaluate text generative tasks is BLEU Papineni et al. (2002), which measures the n-gram overlap between the produced and target sequence.
To ensure reproducibility, we followed the standard Sacrebleu Post (2018) implementation that uses BLEU-4.
For sentence compression, we followed previous works Filippova et al. (2015); Zhao et al. (2018) by reporting the F1 score as well as the compression ratio calculated as the length of the compressed sentence divided by the input sentence length measured in characters Filippova et al. (2015).

Results and discussion

/adjustbox

width=




Paraphrase Generation (BLEU-4)
Sentence Compression


Method
50K
100K
Params
Method
F1
CR
Params


Baseline (Seq-to-Seq)
7.11
8.91
45M
Baseline (BiLSTM)
81.3
0.39
12M


Patro et al. (2018)
16.5
17.9

Zhao et al. (2018)
85.1
0.39


No Fixation
24.62
27.81
69M
No Fixation
83.4
0.38
129M


Random TSM Init
25.26
27.11
79M
Random TSM Init
83.7
0.38
178M


TSM Weight Swap
23.43
27.60
79M
TSM Weight Swap
83.8
0.38
178M


Frozen TSM
25.73
28.26
79M
Frozen TSM
83.9
0.37
178M


Ours
26.24
28.82
79M
Ours
85.0
0.39
178M


Table 1: Ablation study results and comparison with the state of the art for paraphrase generation with both data splits in terms of BLEU-4 score for different training set sizes and sentence compression in terms of F1 score and compression ratio.
Also shown is the number of model parameters.

Results for our joint model on paraphrase generation and sentence compression in comparison to the state of the art are shown in  Table 1.
For paraphrase generation, our approach achieves a BLEU-4 score of 28.82 when using 100K training examples, clearly outperforming the previous state of the art for this task from Patro et al. (2018) (17.9 BLEU-4).
The same holds for 50K training examples (26.24 vs. 16.5 BLEU-4).
For sentence compression, our joint model reaches state-of-the art performance with 85.0 F1 score and 0.39 compression rate compared to 85.1 F1 score and 0.39 compression rate achieved by the approach of Zhao et al. (2018).

To further /replacedanalyzeunderstand the impact of our joint modelling approach, we evaluated several ablated versions of our model:

  • Baseline (Seq-to-Seq): Stand-alone models based on a Seq-2-seq network Bahdanau et al. (2015) for paraphrase generation and a BiLSTM network Schuster and Paliwal (1997) for sentence compression.

  • No Fixations: Stand-alone upstream task network with original Luong attention (no TSM).

  • Random TSM Init: Random initialization of the TSM instead of training on E-Z Reader and human data. Still implicit supervision by the upstream task during joint training.

  • TSM Weight Swap: Exchange of the weights of the TSM model between tasks, i.e. sentence compression using the TSM weights obtained from the best-performing paraphrase generation model and vice versa.

  • Frozen TSM: Training of the TSM with E-Z Reader and human gaze predictions but with frozen weights in the joint training with the upstream task, i.e. no adaptation of the TSM.

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention
Figure 2: Paraphrase generation (top) and sentence compression (bottom) attention maps for both sub-networks (TSM predictions and upstream task attention) in our joint architecture. We show the TSM fixation predictions (left in blue) over epochs (last epoch is our converged models). We show the two-dimensional neural attention maps (right), showing our model and the no fixation model from our ablation study. The two-dimensional maps show the input sequence (horizontal axis) and the predicted sequence (vertical axis). We show the temporal TSM predictions over epochs, in order to depict how the fixation predictions change while training. The fixation predictions (for each epoch, left) are computed over words in the input sequences and then are integrated into the neural attention mechanism which in turn is used to make a prediction (vertical axis, right).

As can be seen from Table 1, all ablated models obtain inferior performance to our full model on both tasks (statistically significant at the 0.05 level).
Notably, even the No Fixation model improves drastically over the Seq-to-Seq baseline for paraphrase generation, most likely due to the significant increase in network parameters.
The benefit of training the TSM with our hybrid approach
before using it in the joint model is underlined by the performance difference between the Random TSM Init (e.g. decrease in performance for both tasks) and our full model (e.g. best performance and differently adapted saliency predictions (see Table 1 and  Figure 2).4

Most importantly, our full model achieves higher performance than the Frozen TSM model in all evaluations (e.g. 85.0 vs. 83.9 F1 for sentence compression), indicating that our model successfully adapts the TSM predictions during joint training.
This is further underlined by the inferior performance of the TSM Weight Swap model:
Swapping the optimal TSM weights between different upstream tasks leads to a notable performance decrease (e.g. 85.0 vs. 83.7 F1 for sentence compression), implying that the TSM model adaptation is specific to the upstream task.

To gain qualitative insights into how our joint model training adapts TSM predictions to specific upstream tasks, we visualize the saliency predictions over time.
 Figure 2 shows a visualization of representative samples for both tasks.
In addition, we show the 2D neural attention map of the converged models with the input sequence on the horizontal and the subsequent prediction on the vertical axis for our (with fixations) and the No Fixation model predictions and weights, respectively.
As can be seen, the adapted saliency predictions (left) differ significantly from each other, particularly when analyzed over training epochs where the final epoch are fixation duration predictions from our converged model.
In paraphrase generation (left top) the saliency predictions focus on fewer words in the sentence within 11 epochs, specifically the word “travel”, as this word is replaced in the correct paraphrase by the word “visit”.
Our model correctly predicts the paraphrase, while the No Fixation model does not.
For sentence compression (left bottom) the predictions continue to be spread over the whole sentence with only slight changes in the distribution over the words. This makes sense given that the task of this network is to delete as many words in the input sequence as possible while still maintaining syntactic structure and meaning. In the 2D attention map (bottom right), both the converged models neural attention weights differ with respect to allocation of probability mass. We see the No Fixation model densely concentrates attention towards a specific few input words (horizontal axis) when predicting several words (vertical axis). In contrast, the attention mass of our model is more spread out.

4.2 Pre-training of the hybrid text saliency model (TSM)

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention
Figure 3: Heatmaps showing human fixation durations in red and hybrid TSM duration predictions in blue. Here we show three different example sentences in order to depict the similarity between TSM word-level durations predictions as compared to human ground truth word-level durations

Evaluation details

Training datasets. Training the TSM consists of two stages: pre-training with synthetic data generated by E-Z Reader, and subsequent fine-tuning on human gaze data.
For the first step, we run E-Z Reader on the CNN and Daily Mail
corpus Hermann et al. (2015) consisting of 300k online news articles with on average 3.75 sentences.
As recommended in Reichle et al. (1998), we run E-Z Reader 10 times for each sentence to ensure stability in fixation predictions.
For training we obtain a total of 7.6M annotated sentences on Daily Mail and 3.1M for CNN. For validation, we obtained 850K sentences on Daily Mail and 350K on CNN.
For the second step, we used the two established gaze corpora Provo Luke and Christianson (2018) and Geco Cop et al. (2017).
Provo contains 55 short passages, extracted from different sources such as popular science magazines and fiction stories Luke and Christianson (2018).
We split the data into 10K sentence pairs (pairs means sentence to human, as multiple humans read the same sentence) for train and 1K sentence pairs for validation.
Geco is comprised of long passages from a popular novel Cop et al. (2017).
We split the data into 65K sentence pairs for train and 8K sentence pairs for validation.

Test datasets. We evaluated our model on the validation sets of the Provo and Geco corpora, as well as on the Dundee Kennedy and Pynte (2005) and MQA-RC corpora Sood et al. (2020).
The combined validation corpora of Provo and Geco contained 18K sentence pairs.
Dundee consists of recordings from 10 participants reading 20 news articles while MQA-RC corpus is a 3-condition reading comprehension corpus using 32 documents from the MovieQA question answering dataset Tapaswi et al. (2016).
For our evaluation we used 1K sentence pairs from the free reading condition.
This dataset is substantially different from the other eye tracking corpora because its stimuli are scraped from online sources and contain noise not found in text intended for human reading.

Implementation details. We used pre-trained 300 dimensional GloVe word embeddings Pennington et al. (2014).
Our network has a bidirectional LSTM, with four transformer self-attention layers with four heads and hidden size of 128.
The model objective is to predict normalized fixation durations for each word in the input sentence, resulting in saliency scores between 0 and 1.
We used the ADAM optimizer Kingma and Ba (2015) with a learning rate of 0.00001, batch size of 100, and dropout of 0.5 after the embedding layer and the recurrent layer.
We pre-trained our network on synthetic training data for four epochs, and then fine-tune it on human data for 10 epochs.

TSM TSM w/o pre-train TSM w/o fine-tune
Corpus MSE JSD MSE JSD MSE JSD
Dundee 0.063 0.39 0.99* 0.071 0.39 0.99* 0.096 0.47 -0.68
Provo + Geco 0.105 0.34 1.00* 0.112 0.36 0.99* 0.238 0.46 0.10
Provo 0.003 0.24 0.88* 0.008 0.44 0.83* 0.032 0.52 -0.25
Geco 0.118 0.35 0.99* 0.127 0.35 0.98* 0.267 0.45 -0.10
MQA-RC 0.064 0.36 0.94* 0.071 0.36 0.76* 0.083 0.42 -0.05
Table 2: Comparison of predicted and human ground-truth fixation durations for the different TSM conditions and corpora in terms of mean squared error (MSE), Jensen Shannon Divergence (JSD), and Spearman’s rank correlation () between the part of speech tags based fixation distributions for model predictions and ground truth.
A star indicates statistically significant at .

Metrics. To evaluate the TSM model, we compute mean squared error (MSE) between the predicted and ground truth fixation durations as well as the Jensen-Shannon Divergence (JSD) Lin (1991).
JSD is widely used in eye tracking research to evaluate inter-gaze agreement Mozaffari et al. (2018); Fang et al. (2009); Davies et al. (2016); Oertel and Salvi (2013) as, unlike Kullback-Leibler Divergence, JSD is symmetric.
In addition we measured the word type predictability as it is a well-known predictor of fixation probabilities Hahn and Keller (2016); Nilsson and Nivre (2009).
We used the Stanford tagger Toutanova et al. (2003) to predict part-of-speech (POS) tags for our corpora and compute the average fixation probability per tag,
allowing us to compute the correlation between our model and ground truth using Spearman’s .

Results and discussion

Table 2 shows the performance of our model and ablation conditions in terms of means squared error (MSE), Jensen-Shannon-Divergence (JSD) and correlation to human ground truth.
As ablation conditions we evaluate a model only trained on human data (w/o pretrain) as well as a model that is not fine-tuned on human data (w/o finetune), but only trained with E-Z Reader data.

Most importantly, our model is superior to- or on par with both ablation variants across all metrics and corpora, showing the importance of both the E-Z Reader pre-training as well as the fine-tuning with human data.
Pretraining with data obtained from E-Z Reader is most beneficial in the case of the small Provo corpus, where we observe a reduction from 0.44 JSD to 0.24 JSD by adding the pretraining step.
For the larger corpora this difference is less pronounced but still present.
It is interesting to note that TSM w/o fine-tune performs consistently the worst, indicating that training on E-Z Reader data alone insufficient even though it provides benefits when combined with human data.

Using the correlations to human gaze over the POS distributions, we can compare our approach to Hahn and Keller (2016) who achieved a of 0.85 on the Dundee corpus, compared to a of 0.99 achieved by our model.
Furthermore we observe an especially large improvement in as a result of E-Z Reader pre-training on the MQA-RC dataset.
This dataset, unlike the other eye tracking corpora, is generated from stimuli which were scraped from online sources regarding movie plots, underlining the effectiveness of our approach in generalising to out-of-domain data.
In further analyses on the POS based correlations we observed that content words, such as adjectives, adverbs, nouns, and verbs, are more predictive than function words. 5 Lastly, we provide a qualitative impression of our method by comparing attention maps using our TSM predictions to ground truth human data (see Figure 3).

5 Conclusion

In this work we made two /replacedoriginaldistinct contributions towards improving natural language processing tasks using human gaze predictions as a supervisory signal.
First, we introduced a novel hybrid text saliency model that, for the first time, integrates a cognitive reading model with a data-driven approach to address the scarcity of human gaze data on text.
Second, we proposed a novel joint modelling approach that allows the TSM to be flexibly adapted to different NLP tasks /addedwithout the need for task-specific ground truth human gaze data.
We showed that both advances result in significant performance improvements over the state of the art in paraphrase generation /replacedas well asand competitive performance for sentence compression but with a much less complex model than the state of the art.
We further demonstrated that this approach is effective in /replacedyieldingproducing task-specific attention predictions.
Taken together, /replacedourthese findings not only demonstrate the feasibility and significant potential of combining cognitive and data-driven models for NLP tasks – and potentially beyond – but also how saliency predictions can be effectively integrated into the attention layer of task-specific neural network architectures to improve performance.

Acknowledgements

E. Sood was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2075 – 390740016;
S. Tannert was supported by IBM Research AI through the IBM AI Horizons Network;
P. Müller and A. Bulling were funded by the European Research Council (ERC; grant agreement 801708).
We would like to thank the following people for their helpful insights and contributions: Sean Papay, Pavel Denisov, Prajit Dhar, Manuel Mager, and Diego Frassinelli.
Additional revenues related to, but not supporting, this work: Scholarship by Google for E. Sood.

Appendix A Appendix

a.1 Sentence Compression Comparison To Previous SOTA

To gain further insight into the comparison between our model and the current state of the art in sentence compression, we show results of our method and ablations in relation to ablations of the method by Zhao et al. [2018] (see Table 3).
In their work, the authors added a “syntax-based language model” to their sentence compression network with which they obtained the state-of-the-art performance of 85.1 F1 score.
The authors employ a syntax-based language model which is trained to learn the syntactic dependencies between lexical items in the given input sequence. Together with this language model, they use a reinforcement learning algorithm to improve the deletion proposed by their Bi-LSTM model.
Using a naive language model without syntactic features their model obtained a F1 score of 85.0.
With their stand-alone Bi-LSTM method in which they do not employ the reinforce language model policy, they obtain 84.8.
In contrast, our method does neither include a reinforcement-learning based language model nor additional syntactic features.
However, our method is still competitive with the state of the art (achieving a F1 score of 85.0), and
arguably might benefit from additional incorporation of syntactic information in future work.

Method F1 CR Params
Zhao et al (2018) LSTM implementation 84.8 0.40
Evaluator LM 85.0 0.41
Syntax-Based Evaluator LM 85.1 0.39
Our paper Baseline (BiLSTM) 81.3 0.39 12M
No Fixation 83.4 0.38 129M
Random TSM Init 83.7 0.38 178M
TSM Weight Swap 83.8 0.38 178M
Frozen TSM 83.9 0.37 178M
Ours 85.0 0.39 178M
Table 3: Ablation study results and comparison with the state of the art for sentence compression generation in terms of F1 score and compression ratio.
Also shown is the number of model parameters. We show that our model, without additional syntactic information as was used in previous methods, still obtains SOTA performance.

a.2 Ablation Study – Attention Maps

To shed more light onto the adapted TSM predictions for the conditions in our ablation study, we present saliency and neural attention maps for the conditions Random TSM Init and TSM Weight Swap.
In Figure 4, we show that the adapted saliency predictions (blue, left showing) for paraphrase generation, between the two conditions (top vs. bottom) vary with respect to the words which are predicted to be most salient and the temporal adaptation during training. The last epoch is from the converged models, respectively.
There exist notable differences in the adapted TSM predictions for the two ablations.
However, we assume they do not play a role in performance between these two conditions, as these performance differences are not statistically significant.
However, these conditions do perform significantly worse than our model (see paper for results).
As shown in the paper, our model allocates the most attention to the word “travel” in the example